diff --git a/mypaper-final.tex b/mypaper-final.tex index c35b576841de63d7dccd375e50472815ac1513dd..f2d442259cf3b05e92aa4b74be669f7276e228b8 100644 --- a/mypaper-final.tex +++ b/mypaper-final.tex @@ -30,7 +30,7 @@ \usepackage{booktabs} \usepackage{multirow} \usepackage{todonotes} -\usepackage{url} +\usepackage{url} \begin{document} @@ -65,7 +65,7 @@ % without further effort on your part as the last section in % the body of your article BEFORE References or any Appendices. -\numberofauthors{2} % in this sample file, there are a *total* +\numberofauthors{8} % in this sample file, there are a *total* % of EIGHT authors. SIX appear on the 'first-page' (for formatting % reasons) and the remaining two appear in the \additionalauthors section. % @@ -131,7 +131,7 @@ corresponding system components) affect filtering performance. We identify and characterize the relevant documents that do not pass the filtering stage by examing their contents. This way, we estimate a practical upper-bound of recall for entity-centric stream -filtering. +filtering. \end{abstract} % A category with the (minimum) three required fields @@ -215,31 +215,31 @@ The rest of the paper is is organized as follows: \textbf{TODO!!} \section{Data Description} -We base this analysis on the TREC-KBA 2013 dataset% -\footnote{\url{http://trec-kba.org/trec-kba-2013.shtml}} -that consists of three main parts: a time-stamped stream corpus, a set of -KB entities to be curated, and a set of relevance judgments. A CCR -system now has to identify for each KB entity which documents in the -stream corpus are to be considered by the human curator. - -\subsection{Stream corpus} The stream corpus comes in two versions: -raw and cleaned. The raw and cleansed versions are 6.45TB and 4.5TB -respectively, after xz-compression and GPG encryption. The raw data -is a dump of raw HTML pages. The cleansed version is the raw data -after its HTML tags are stripped off and only English documents -identified with Chromium Compact Language Detector -\footnote{\url{https://code.google.com/p/chromium-compact-language-detector/}} -are included. The stream corpus is organized in hourly folders each -of which contains many chunk files. Each chunk file contains between -hundreds and hundreds of thousands of serialized thrift objects. One -thrift object is one document. A document could be a blog article, a -news article, or a social media post (including tweet). The stream -corpus comes from three sources: TREC KBA 2012 (social, news and -linking) \footnote{\url{http://trec-kba.org/kba-stream-corpus-2012.shtml}}, -arxiv\footnote{\url{http://arxiv.org/}}, and -spinn3r\footnote{\url{http://spinn3r.com/}}. -Table \ref{tab:streams} shows the sources, the number of hourly -directories, and the number of chunk files. +We base this analysis on the TREC-KBA 2013 dataset% +\footnote{\url{http://trec-kba.org/trec-kba-2013.shtml}} +that consists of three main parts: a time-stamped stream corpus, a set of +KB entities to be curated, and a set of relevance judgments. A CCR +system now has to identify for each KB entity which documents in the +stream corpus are to be considered by the human curator. + +\subsection{Stream corpus} The stream corpus comes in two versions: +raw and cleaned. The raw and cleansed versions are 6.45TB and 4.5TB +respectively, after xz-compression and GPG encryption. The raw data +is a dump of raw HTML pages. The cleansed version is the raw data +after its HTML tags are stripped off and only English documents +identified with Chromium Compact Language Detector +\footnote{\url{https://code.google.com/p/chromium-compact-language-detector/}} +are included. The stream corpus is organized in hourly folders each +of which contains many chunk files. Each chunk file contains between +hundreds and hundreds of thousands of serialized thrift objects. One +thrift object is one document. A document could be a blog article, a +news article, or a social media post (including tweet). The stream +corpus comes from three sources: TREC KBA 2012 (social, news and +linking) \footnote{\url{http://trec-kba.org/kba-stream-corpus-2012.shtml}}, +arxiv\footnote{\url{http://arxiv.org/}}, and +spinn3r\footnote{\url{http://spinn3r.com/}}. +Table \ref{tab:streams} shows the sources, the number of hourly +directories, and the number of chunk files. \begin{table} \caption{Retrieved documents to different sources } \begin{center} @@ -388,16 +388,16 @@ from DBpedia and Twitter. \ \subsection{Entity Profiling} -We build entity profiles for the KB entities of interest. We have two -types: Twitter and Wikipedia. Both entities have been selected, on -purpose by the track organisers, to occur only sparsely and be less-documented. -For the Wikipedia entities, we fetch different name variants -from DBpedia: name, label, birth name, alternative names, -redirects, nickname, or alias. -These extraction results are summarized in Table -\ref{tab:sources}. -For the Twitter entities, we visit -their respective Twitter pages and fetch their display names. +We build entity profiles for the KB entities of interest. We have two +types: Twitter and Wikipedia. Both entities have been selected, on +purpose by the track organisers, to occur only sparsely and be less-documented. +For the Wikipedia entities, we fetch different name variants +from DBpedia: name, label, birth name, alternative names, +redirects, nickname, or alias. +These extraction results are summarized in Table +\ref{tab:sources}. +For the Twitter entities, we visit +their respective Twitter pages and fetch their display names. \begin{table} \caption{Number of different DBpedia name variants} \begin{center} @@ -420,29 +420,29 @@ Redirect &49 \\ \end{table} -The collection contains a total number of 121 Wikipedia entities. -Every entity has a corresponding DBpedia label. Only 82 entities have -a name string and only 49 entities have redirect strings. (Most of the -entities have only one string, except for a few cases with multiple -redirect strings; Buddy\_MacKay, has the highest (12) number of -redirect strings.) - -We combine the different name variants we extracted to form a set of -strings for each KB entity. For Twitter entities, we used the display -names that we collected. We consider the names of the entities that -are part of the URL as canonical. For example in entity\\ -\url{http://en.wikipedia.org/wiki/Benjamin_Bronfman}\\ -Benjamin Bronfman is a canonical name of the entity. -An example is given in Table \ref{tab:profile}. - -From the combined name variants and -the canonical names, we created four sets of profiles for each -entity: canonical(cano) canonical partial (cano-part), all name -variants combined (all) and partial names of all name -variants(all-part). We refer to the last two profiles as name-variant -and name-variant partial. The names in parentheses are used in table -captions. - +The collection contains a total number of 121 Wikipedia entities. +Every entity has a corresponding DBpedia label. Only 82 entities have +a name string and only 49 entities have redirect strings. (Most of the +entities have only one string, except for a few cases with multiple +redirect strings; Buddy\_MacKay, has the highest (12) number of +redirect strings.) + +We combine the different name variants we extracted to form a set of +strings for each KB entity. For Twitter entities, we used the display +names that we collected. We consider the names of the entities that +are part of the URL as canonical. For example in entity\\ +\url{http://en.wikipedia.org/wiki/Benjamin_Bronfman}\\ +Benjamin Bronfman is a canonical name of the entity. +An example is given in Table \ref{tab:profile}. + +From the combined name variants and +the canonical names, we created four sets of profiles for each +entity: canonical(cano) canonical partial (cano-part), all name +variants combined (all) and partial names of all name +variants(all-part). We refer to the last two profiles as name-variant +and name-variant partial. The names in parentheses are used in table +captions. + \begin{table*} \caption{Example entity profiles (upper part Wikipedia, lower part Twitter)} @@ -500,11 +500,11 @@ The annotation set is a combination of the annotations from before the Training -%Most (more than 80\%) of the annotation documents are in the test set. -The 2013 training and test data contain 68405 -annotations, of which 50688 are unique document-entity pairs. Out of -these, 24162 unique document-entity pairs are vital (9521) or relevant -(17424). +%Most (more than 80\%) of the annotation documents are in the test set. +The 2013 training and test data contain 68405 +annotations, of which 50688 are unique document-entity pairs. Out of +these, 24162 unique document-entity pairs are vital (9521) or relevant +(17424). @@ -617,12 +617,12 @@ If we look at the recall performances for the raw corpus, filtering documents \subsection{ Relevance Rating: vital and relevant} -When comparing recall for vital and relevant, we observe that -canonical names are more effective for vital than for relevant -entities, in particular for the Wikipedia entities. -%For example, the recall for news is 80.1 and for social is 76, while the corresponding recall in relevant is 75.6 and 63.2 respectively. -We conclude that the most relevant documents mention the -entities by their common name variants. +When comparing recall for vital and relevant, we observe that +canonical names are more effective for vital than for relevant +entities, in particular for the Wikipedia entities. +%For example, the recall for news is 80.1 and for social is 76, while the corresponding recall in relevant is 75.6 and 63.2 respectively. +We conclude that the most relevant documents mention the +entities by their common name variants. % \subsection{Difference by document categories} % @@ -635,17 +635,17 @@ entities by their common name variants. \subsection{Recall across document categories: others, news and social} -The recall for Wikipedia entities in Table \ref{tab:name} ranged from -61.8\% (canonicals) to 77.9\% (name-variants). Table -\ref{tab:source-delta} shows how recall is distributed across document -categories. For Wikipedia entities, across all entity profiles, others -have a higher recall followed by news, and then by social. While the -recall for news ranges from 76.4\% to 98.4\%, the recall for social -documents ranges from 65.7\% to 86.8\%. In Twitter entities, however, -the pattern is different. In canonicals (and their partials), social -documents achieve higher recall than news. +The recall for Wikipedia entities in Table \ref{tab:name} ranged from +61.8\% (canonicals) to 77.9\% (name-variants). Table +\ref{tab:source-delta} shows how recall is distributed across document +categories. For Wikipedia entities, across all entity profiles, others +have a higher recall followed by news, and then by social. While the +recall for news ranges from 76.4\% to 98.4\%, the recall for social +documents ranges from 65.7\% to 86.8\%. In Twitter entities, however, +the pattern is different. In canonicals (and their partials), social +documents achieve higher recall than news. %This indicates that social documents refer to Twitter entities by their canonical names (user names) more than news do. In name- variant partial, news achieve better results than social. The difference in recall between canonicals and name-variants show that news do not refer to Twitter entities by their user names, they refer to them by their display names. -Overall, across all entities types and all entity profiles, documents +Overall, across all entities types and all entity profiles, documents in the others category achieve a higher recall than news, and news documents, in turn, achieve higher recall than social documents. % This suggests that social documents are the hardest to retrieve. This makes sense since social posts such as tweets and blogs are short and are more likely to point to other resources, or use short informal names. @@ -672,59 +672,59 @@ in the others category achieve a higher recall than news, and news documents, in % all\_part in relevant. \subsection{Entity Types: Wikipedia and Twitter} -Table \ref{tab:name} summarizes the differences between Wikipedia and -Twitter entities. Wikipedia entities' canonical representation -achieves a recall of 70\%, while canonical partial achieves a recall of 86.1\%. This is an -increase in recall of 16.1\%. By contrast, the increase in recall of -name-variant partial over name-variant is 8.3\%. -%This high increase in recall when moving from canonical names to their -%partial names, in comparison to the lower increase when moving from -%all name variants to their partial names can be explained by -%saturation: documents have already been extracted by the different -%name variants and thus using their partial names do not bring in many -%new relevant documents. -For Wikipedia entities, canonical -partial achieves better recall than name-variant in both the cleansed and -the raw corpus. %In the raw extraction, the difference is about 3.7. -In Twitter entities, recall of canonical matching is very low.% -\footnote{Canonical -and canonical partial are the same for Twitter entities because they -are one word strings. For example in https://twitter.com/roryscovel, -``roryscovel`` is the canonical name and its partial is identical.} -%The low recall is because the canonical names of Twitter entities are -%not really names; they are usually arbitrarily created user names. It -%shows that documents refer to them by their display names, rarely -%by their user name, which is reflected in the name-variant recall -%(67.9\%). The use of name-variant partial increases the recall to -%88.2\%. - - - -The tables in \ref{tab:name} and \ref{tab:source-delta} show a higher recall -for Wikipedia than for Twitter entities. Generally, at both -aggregate and document category levels, we observe that recall -increases as we move from canonicals to canonical partial, to -name-variant, and to name-variant partial. The only case where this -does not hold is in the transition from Wikipedia's canonical partial -to name-variant. At the aggregate level (as can be inferred from Table -\ref{tab:name}), the difference in performance between canonical and -name-variant partial is 31.9\% on all entities, 20.7\% on Wikipedia -entities, and 79.5\% on Twitter entities. - -Section \ref{sec:analysis} discusses the most plausible explanations for these findings. +Table \ref{tab:name} summarizes the differences between Wikipedia and +Twitter entities. Wikipedia entities' canonical representation +achieves a recall of 70\%, while canonical partial achieves a recall of 86.1\%. This is an +increase in recall of 16.1\%. By contrast, the increase in recall of +name-variant partial over name-variant is 8.3\%. +%This high increase in recall when moving from canonical names to their +%partial names, in comparison to the lower increase when moving from +%all name variants to their partial names can be explained by +%saturation: documents have already been extracted by the different +%name variants and thus using their partial names do not bring in many +%new relevant documents. +For Wikipedia entities, canonical +partial achieves better recall than name-variant in both the cleansed and +the raw corpus. %In the raw extraction, the difference is about 3.7. +In Twitter entities, recall of canonical matching is very low.% +\footnote{Canonical +and canonical partial are the same for Twitter entities because they +are one word strings. For example in https://twitter.com/roryscovel, +``roryscovel`` is the canonical name and its partial is identical.} +%The low recall is because the canonical names of Twitter entities are +%not really names; they are usually arbitrarily created user names. It +%shows that documents refer to them by their display names, rarely +%by their user name, which is reflected in the name-variant recall +%(67.9\%). The use of name-variant partial increases the recall to +%88.2\%. + + + +The tables in \ref{tab:name} and \ref{tab:source-delta} show a higher recall +for Wikipedia than for Twitter entities. Generally, at both +aggregate and document category levels, we observe that recall +increases as we move from canonicals to canonical partial, to +name-variant, and to name-variant partial. The only case where this +does not hold is in the transition from Wikipedia's canonical partial +to name-variant. At the aggregate level (as can be inferred from Table +\ref{tab:name}), the difference in performance between canonical and +name-variant partial is 31.9\% on all entities, 20.7\% on Wikipedia +entities, and 79.5\% on Twitter entities. + +Section \ref{sec:analysis} discusses the most plausible explanations for these findings. %% TODO: PERHAPS SUMMARY OF DISCUSSION HERE - + \section{Impact on classification} -In the overall experimental setup, classification, ranking, and -evaluation are kept constant. Following \cite{balog2013multi} -settings, we use -WEKA's\footnote{\url{http://www.cs.waikato.ac.nz/~ml/weka/}} Classification -Random Forest. However, we use fewer numbers of features which we -found to be more effective. We determined the effectiveness of the -features by running the classification algorithm using the fewer -features we implemented and their features. Our feature -implementations achieved better results. The total numbers of -features we used are 13 and are listed below. +In the overall experimental setup, classification, ranking, and +evaluation are kept constant. Following \cite{balog2013multi} +settings, we use +WEKA's\footnote{\url{http://www.cs.waikato.ac.nz/~ml/weka/}} Classification +Random Forest. However, we use fewer numbers of features which we +found to be more effective. We determined the effectiveness of the +features by running the classification algorithm using the fewer +features we implemented and their features. Our feature +implementations achieved better results. The total numbers of +features we used are 13 and are listed below. \paragraph*{Google's Cross Lingual Dictionary (GCLD)} @@ -961,7 +961,9 @@ There is a trade-off between using a richer entity-profile and retrieval of irre In vital ranking, across all entity profiles and types of corpus, Wikipedia's canonical partial achieves better performance than any other Wikipedia entity profiles. In vital-relevant documents too, Wikipedia's canonical partial achieves the best result. In the raw corpus, it achieves a little less than name-variant partial. For Twitter entities, the name-variant partial profile achieves the highest F-score across all entity profiles and types of corpus. -Cleansing impacts Twitter +There are 3 interesting observations: + +1) cleansing impacts Twitter entities and relevant documents. This is validated by the observation that recall gains in Twitter entities and the relevant categories in the raw corpus also translate into overall performance @@ -979,7 +981,7 @@ transformation and cleasing processes. %%%% NEEDS WORK: -Taking both performance (recall at filtering and overall F-score +2) Taking both performance (recall at filtering and overall F-score during evaluation) into account, there is a clear trade-off between using a richer entity-profile and retrieval of irrelevant documents. The richer the profile, the more relevant documents it retrieves, but also the more irrelevant documents. To put it into perspective, lets compare the number of documents that are retrieved with canonical partial and with name-variant partial. Using the raw corpus, the former retrieves a total of 2547487 documents and achieves a recall of 72.2\%. By contrast, the later retrieves a total of 4735318 documents and achieves a recall of 90.2\%. The total number of documents extracted increases by 85.9\% for a recall gain of 18\%. The rest of the documents, that is 67.9\%, are newly introduced irrelevant documents. Wikipedia's canonical partial is the best entity profile for Wikipedia entities. This is interesting to see that the retrieval of of thousands vital-relevant document-entity pairs by name-variant partial does not translate to an increase in over all performance. It is even more interesting since canonical partial was not considered as contending profile for stream filtering by any of participant to the best of our knowledge. With this understanding, there is actually no need to go and fetch different names variants from DBpedia, a saving of time and computational resources.