HCDA/cikm-paper Changeset - a6c88c797216 · Centrum Wiskunde & Informatica (CWI)

Changeset - a6c88c797216

Parent rev.

Child rev.

[Not reviewed]

Merge

0 1 0

Gebrekirstos Gebremeskel - 11 years ago 2014-06-12 06:53:59
destinycome@gmail.com

Merge branch 'master' of https://scm.cwi.nl/IA/cikm-paper

1 file changed with 17 insertions and 10 deletions:

mypaper-final.tex

0 comments (0 inline, 0 general)

mypaper-final.tex

➞

Show inline comments

 main components.  In particular, we study the effect of cleansing,
 entity profiling, type of entity filtered for (Wikipedia or Twitter), and
 document category (social, news, etc) on the filtering components'
-performance. The main contribution of the
+performance. The main contributions of the
 paper are an in-depth analysis of the factors that affect entity-based
 stream filtering, identifying optimal entity profiles without
 compromising precision, describing and classifying relevant documents
  \section{Data Description}\label{sec:desc}
-We base this analysis on the TREC-KBA 2013 dataset%
+Our study is carried out using the TREC-KBA 2013 dataset%
 \footnote{\url{http://trec-kba.org/trec-kba-2013.shtml}}
 that consists of three main parts: a time-stamped stream corpus, a set of
 KB entities to be curated, and a set of relevance judgments. A CCR
 after its HTML tags are stripped off and only English documents
 identified with Chromium Compact Language Detector
 \footnote{\url{https://code.google.com/p/chromium-compact-language-detector/}}
 are included.  The stream corpus is organized in hourly folders each
 of which contains many  chunk files. Each chunk file contains between
 hundreds and hundreds of thousands of serialized  thrift objects. One
 thrift object is one document. A document could be a blog article, a
 are included.  The stream corpus is organized in hourly folders, each
 of which contains many ``chunk files''. The chunk files contain
 hunderds up to hundreds of thousands of semi-structured documents,
 serialized as thrift objects (one thrift object corresponding to one
 document). A document could be a blog article, a
 news article, or a social media post (including tweet). The stream
 corpus comes from three sources: TREC KBA 2012 (social, news and
 linking) \footnote{\url{http://trec-kba.org/kba-stream-corpus-2012.shtml}},
 arxiv\footnote{\url{http://arxiv.org/}}, and
 corpus has been derived from three main sources: TREC KBA 2012
 (containing blogs, news and web pages with shortened urls)%
 \footnote{\url{http://trec-kba.org/kba-stream-corpus-2012.shtml}%
 },
 ArXiV\footnote{\url{http://arxiv.org/}}, and
 spinn3r\footnote{\url{http://spinn3r.com/}}.
 Table \ref{tab:streams} shows the sources, the number of hourly
 directories, and the number of chunk files.
@@ @@ -271,7 +274,11 @@ directories, and the number of chunk files. @@
 \subsection{KB entities}
  The KB entities consist of 20 Twitter entities and 121 Wikipedia entities. The selected entities are, on purpose, sparse. The entities consist of 71 people, 1 organization, and 24 facilities.
 The KB entities in the dataset consist of 20 Twitter entities and 121
 Wikipedia entities. These selected entities are, on purpose, ``sparse
 entities'' in the sense that they only occur in relatively few
 documents. The collection of entities consists of 71 people entities,
 organization entity, and 24 facilities entities.
 \subsection{Relevance judgments}

0 comments (0 inline, 0 general)