Changeset - a6c88c797216
[Not reviewed]
Merge
0 1 0
Gebrekirstos Gebremeskel - 11 years ago 2014-06-12 06:53:59
destinycome@gmail.com
Merge branch 'master' of https://scm.cwi.nl/IA/cikm-paper
1 file changed with 19 insertions and 12 deletions:
0 comments (0 inline, 0 general)
mypaper-final.tex
Show inline comments
 
@@ -195,17 +195,17 @@ sub-components of the pipeline.
 
 
 
In this paper, we therefore fix the subsequent steps of the pipeline,
 
and zoom in on \emph{only} the filtering step; and conduct an in-depth analysis of its
 
main components.  In particular, we study the effect of cleansing,
 
entity profiling, type of entity filtered for (Wikipedia or Twitter), and
 
document category (social, news, etc) on the filtering components'
 
performance. The main contribution of the
 
performance. The main contributions of the
 
paper are an in-depth analysis of the factors that affect entity-based
 
stream filtering, identifying optimal entity profiles without
 
compromising precision, describing and classifying relevant documents
 
that are not amenable to filtering , and estimating the upper-bound
 
that are not amenable to filtering, and estimating the upper-bound
 
of recall on entity-based filtering.
 
 
 
The rest of the paper  is organized as follows. Section \ref{sec:desc}
 
describes the dataset and section \ref{sec:fil} defines the task. In
 
section  \ref{sec:lit}, we discuss related literature followed by a
 
@@ -217,13 +217,13 @@ impact of filtering choices on classification in section
 
section \ref{sec:unfil}. Finally, we present our conclusions in
 
\ref{sec:conc}.
 
 
 
 
 \section{Data Description}\label{sec:desc}
 
We base this analysis on the TREC-KBA 2013 dataset%
 
Our study is carried out using the TREC-KBA 2013 dataset%
 
\footnote{\url{http://trec-kba.org/trec-kba-2013.shtml}}
 
that consists of three main parts: a time-stamped stream corpus, a set of
 
KB entities to be curated, and a set of relevance judgments. A CCR
 
system now has to identify for each KB entity which documents in the
 
stream corpus are to be considered by the human curator.
 
 
@@ -231,20 +231,23 @@ stream corpus are to be considered by the human curator.
 
raw and cleaned. The raw and cleansed versions are 6.45TB and 4.5TB
 
respectively,  after xz-compression and GPG encryption. The raw data
 
is a  dump of  raw HTML pages. The cleansed version is the raw data
 
after its HTML tags are stripped off and only English documents
 
identified with Chromium Compact Language Detector
 
\footnote{\url{https://code.google.com/p/chromium-compact-language-detector/}}
 
are included.  The stream corpus is organized in hourly folders each
 
of which contains many  chunk files. Each chunk file contains between
 
hundreds and hundreds of thousands of serialized  thrift objects. One
 
thrift object is one document. A document could be a blog article, a
 
news article, or a social media post (including tweet).  The stream
 
corpus comes from three sources: TREC KBA 2012 (social, news and
 
linking) \footnote{\url{http://trec-kba.org/kba-stream-corpus-2012.shtml}},
 
arxiv\footnote{\url{http://arxiv.org/}}, and
 
are included.  The stream corpus is organized in hourly folders, each
 
of which contains many ``chunk files''. The chunk files contain
 
hunderds up to hundreds of thousands of semi-structured documents,
 
serialized as thrift objects (one thrift object corresponding to one
 
document). A document could be a blog article, a
 
news article, or a social media post (including tweet). The stream
 
corpus has been derived from three main sources: TREC KBA 2012
 
(containing blogs, news and web pages with shortened urls)%
 
\footnote{\url{http://trec-kba.org/kba-stream-corpus-2012.shtml}%
 
},
 
ArXiV\footnote{\url{http://arxiv.org/}}, and
 
spinn3r\footnote{\url{http://spinn3r.com/}}.
 
Table \ref{tab:streams} shows the sources, the number of hourly
 
directories, and the number of chunk files.
 
\begin{table}
 
\caption{Retrieved documents to different sources }
 
\begin{center}
 
@@ -268,13 +271,17 @@ directories, and the number of chunk files.
 
\end{center}
 
\label{tab:streams}
 
\end{table}
 
 
\subsection{KB entities}
 
 
 The KB entities consist of 20 Twitter entities and 121 Wikipedia entities. The selected entities are, on purpose, sparse. The entities consist of 71 people, 1 organization, and 24 facilities.  
 
The KB entities in the dataset consist of 20 Twitter entities and 121
 
Wikipedia entities. These selected entities are, on purpose, ``sparse
 
entities'' in the sense that they only occur in relatively few
 
documents. The collection of entities consists of 71 people entities,
 
1 organization entity, and 24 facilities entities.  
 
 
\subsection{Relevance judgments}
 
 
TREC-KBA provided relevance judgments for training and
 
testing. Relevance judgments are given as a document-entity
 
pairs. Documents with citation-worthy content to a given entity are
0 comments (0 inline, 0 general)