diff --git a/mypaper-final.tex b/mypaper-final.tex index bcfb7c9d7314942f4f533165204f4a50979697e9..d85f86c9bb90a36f370d5ee68e37fb5168db177d 100644 --- a/mypaper-final.tex +++ b/mypaper-final.tex @@ -129,19 +129,9 @@ relevance of the document-entity pair under consideration. We analyze how these factors (and the design choices made in their corresponding system components) affect filtering performance. We identify and characterize the relevant documents that do not pass -<<<<<<< HEAD -<<<<<<< HEAD -the filtering stage by examining their contents. This way, we give -estimate of a practical upper-bound of recall for entity-centric stream -======= the filtering stage by examing their contents. This way, we estimate a practical upper-bound of recall for entity-centric stream ->>>>>>> 68fbea2f0372ab9b4199b88f980dbf5e97f49063 -======= -the filtering stage by examing their contents. This way, we -estimate a practical upper-bound of recall for entity-centric stream ->>>>>>> 3eb20e9cca3d074a4001a593e626a9269cb5608c -filtering. +filtering. \end{abstract} % A category with the (minimum) three required fields @@ -225,33 +215,6 @@ The rest of the paper is is organized as follows: \textbf{TODO!!} \section{Data Description} -<<<<<<< HEAD -We base this analysis on the TREC-KBA 2013 dataset% -\footnote{http://http://trec-kba.org/trec-kba-2013.shtml} -that consists of three main parts: a time-stamped stream corpus, a set of -KB entities to be curated, and a set of relevance judgments. A CCR -system now has to identify for each KB entity which documents in the -stream corpus are to be considered by the human curator. - -\subsection{Stream corpus} The stream corpus comes in two versions: -raw and cleaned. The raw and cleansed versions are 6.45TB and 4.5TB -respectively, after xz-compression and GPG encryption. The raw data -is a dump of raw HTML pages. The cleansed version is the raw data -after its HTML tags are stripped off and only English documents -identified with Chromium Compact Language Detector -\footnote{https://code.google.com/p/chromium-compact-language-detector/} -are included. The stream corpus is organized in hourly folders each -of which contains many chunk files. Each chunk file contains between -hundreds and hundreds of thousands of serialized thrift objects. One -thrift object is one document. A document could be a blog article, a -news article, or a social media post (including tweet). The stream -corpus comes from three sources: TREC KBA 2012 (social, news and -linking) \footnote{http://trec-kba.org/kba-stream-corpus-2012.shtml}, -arxiv\footnote{http://arxiv.org/}, and -spinn3r\footnote{http://spinn3r.com/}. -Table \ref{tab:streams} shows the sources, the number of hourly -directories, and the number of chunk files. -======= We base this analysis on the TREC-KBA 2013 dataset% \footnote{\url{http://trec-kba.org/trec-kba-2013.shtml}} that consists of three main parts: a time-stamped stream corpus, a set of @@ -277,7 +240,6 @@ arxiv\footnote{\url{http://arxiv.org/}}, and spinn3r\footnote{\url{http://spinn3r.com/}}. Table \ref{tab:streams} shows the sources, the number of hourly directories, and the number of chunk files. ->>>>>>> 3eb20e9cca3d074a4001a593e626a9269cb5608c \begin{table} \caption{Retrieved documents to different sources } \begin{center} @@ -458,10 +420,29 @@ Redirect &49 \\ \end{table} -<<<<<<< HEAD -We have a total of 121 Wikipedia entities. Every entity has a DBpedia label. Only 82 entities have a name string and only 49 entities have redirect strings. Most of the entities have only one string, but some have several redirect sterings. One entity, Buddy\_MacKay, has the highest (12) number of redirect strings. 6 entities have birth names, 1 entity has a nick name, 1 entity has alias and 4 entities have alternative names. - -We combined the different name variants we extracted to form a set of strings for each KB entity. For Twitter entities, we used the display names that we collected . We consider the names of the entities that are part of the URL as canonical. For example in http://en.wikipedia.org/wiki/Benjamin\_Bronfman, Benjamin Bronfman is a canonical name of the entity. From the combined name variants and the canonical names, we created four sets of profiles for each entity: canonical(cano) canonical partial (cano-part), all name variants combined (all) and partial names of all name variants(all-part). We refer to the last two profiles as name-variant and name-variant partial. The names in paranthesis are used in table captions. +The collection contains a total number of 121 Wikipedia entities. +Every entity has a corresponding DBpedia label. Only 82 entities have +a name string and only 49 entities have redirect strings. (Most of the +entities have only one string, except for a few cases with multiple +redirect strings; Buddy\_MacKay, has the highest (12) number of +redirect strings.) + +We combine the different name variants we extracted to form a set of +strings for each KB entity. For Twitter entities, we used the display +names that we collected. We consider the names of the entities that +are part of the URL as canonical. For example in entity\\ +\url{http://en.wikipedia.org/wiki/Benjamin_Bronfman}\\ +Benjamin Bronfman is a canonical name of the entity. +An example is given in Table \ref{tab:profile}. + +From the combined name variants and +the canonical names, we created four sets of profiles for each +entity: canonical(cano) canonical partial (cano-part), all name +variants combined (all) and partial names of all name +variants(all-part). We refer to the last two profiles as name-variant +and name-variant partial. The names in parentheses are used in table +captions. + \begin{table*} \caption{Example entity profiles (upper part Wikipedia, lower part Twitter)} @@ -480,36 +461,8 @@ We combined the different name variants we extracted to form a set of strings f \hline \end{tabular} \end{center} -\label{tab:breakdown} +\label{tab:profile} \end{table*} - - - - -======= -The collection contains a total number of 121 Wikipedia entities. -Every entity has a corresponding DBpedia label. Only 82 entities have -a name string and only 49 entities have redirect strings. (Most of the -entities have only one string, except for a few cases with multiple -redirect strings; Buddy\_MacKay, has the highest (12) number of -redirect strings.) - -We combine the different name variants we extracted to form a set of -strings for each KB entity. For Twitter entities, we used the display -names that we collected. - -We consider the names of the entities that -are part of the URL as canonical. For example in entity\\ -\url{http://en.wikipedia.org/wiki/Benjamin_Bronfman}\\ -Benjamin Bronfman is a canonical name of the entity. From the combined name variants and -the canonical names, we created four sets of profiles for each -entity: canonical(cano) canonical partial (cano-part), all name -variants combined (all) and partial names of all name -variants(all-part). We refer to the last two profiles as name-variant -and name-variant partial. The names in parentheses are used in table -captions. - ->>>>>>> 3eb20e9cca3d074a4001a593e626a9269cb5608c \subsection{Annotation Corpus} The annotation set is a combination of the annotations from before the Training Time Range(TTR) and Evaluation Time Range (ETR) and consists of 68405 annotations. Its breakdown into training and test sets is shown in Table \ref{tab:breakdown}.