diff --git a/mypaper-final.tex b/mypaper-final.tex index c4c099ae7d2d60cf2496f8871be9fe637c9f3336..8939e984f17e9a8b444f774a1f1529a48e24db3e 100644 --- a/mypaper-final.tex +++ b/mypaper-final.tex @@ -198,11 +198,11 @@ and zoom in on \emph{only} the filtering step; and conduct an in-depth analysis main components. In particular, we study the effect of cleansing, entity profiling, type of entity filtered for (Wikipedia or Twitter), and document category (social, news, etc) on the filtering components' -performance. The main contribution of the +performance. The main contributions of the paper are an in-depth analysis of the factors that affect entity-based stream filtering, identifying optimal entity profiles without compromising precision, describing and classifying relevant documents -that are not amenable to filtering , and estimating the upper-bound +that are not amenable to filtering, and estimating the upper-bound of recall on entity-based filtering. @@ -220,7 +220,7 @@ section \ref{sec:unfil}. Finally, we present our conclusions in \section{Data Description}\label{sec:desc} -We base this analysis on the TREC-KBA 2013 dataset% +Our study is carried out using the TREC-KBA 2013 dataset% \footnote{\url{http://trec-kba.org/trec-kba-2013.shtml}} that consists of three main parts: a time-stamped stream corpus, a set of KB entities to be curated, and a set of relevance judgments. A CCR @@ -234,14 +234,17 @@ is a dump of raw HTML pages. The cleansed version is the raw data after its HTML tags are stripped off and only English documents identified with Chromium Compact Language Detector \footnote{\url{https://code.google.com/p/chromium-compact-language-detector/}} -are included. The stream corpus is organized in hourly folders each -of which contains many chunk files. Each chunk file contains between -hundreds and hundreds of thousands of serialized thrift objects. One -thrift object is one document. A document could be a blog article, a -news article, or a social media post (including tweet). The stream -corpus comes from three sources: TREC KBA 2012 (social, news and -linking) \footnote{\url{http://trec-kba.org/kba-stream-corpus-2012.shtml}}, -arxiv\footnote{\url{http://arxiv.org/}}, and +are included. The stream corpus is organized in hourly folders, each +of which contains many ``chunk files''. The chunk files contain +hunderds up to hundreds of thousands of semi-structured documents, +serialized as thrift objects (one thrift object corresponding to one +document). A document could be a blog article, a +news article, or a social media post (including tweet). The stream +corpus has been derived from three main sources: TREC KBA 2012 +(containing blogs, news and web pages with shortened urls)% +\footnote{\url{http://trec-kba.org/kba-stream-corpus-2012.shtml}% +}, +ArXiV\footnote{\url{http://arxiv.org/}}, and spinn3r\footnote{\url{http://spinn3r.com/}}. Table \ref{tab:streams} shows the sources, the number of hourly directories, and the number of chunk files. @@ -271,7 +274,11 @@ directories, and the number of chunk files. \subsection{KB entities} - The KB entities consist of 20 Twitter entities and 121 Wikipedia entities. The selected entities are, on purpose, sparse. The entities consist of 71 people, 1 organization, and 24 facilities. +The KB entities in the dataset consist of 20 Twitter entities and 121 +Wikipedia entities. These selected entities are, on purpose, ``sparse +entities'' in the sense that they only occur in relatively few +documents. The collection of entities consists of 71 people entities, +1 organization entity, and 24 facilities entities. \subsection{Relevance judgments}