HCDA/cikm-paper Files · mypaper-final.tex · Centrum Wiskunde & Informatica (CWI)

Files @ 2c1313a39690
Branch filter:
Location: HCDA/cikm-paper/mypaper-final.tex

2c1313a39690 60.0 KiB text/x-tex Show Annotation Show as Raw Download as Raw
Gebrekirstos Gebremeskel
updated
% THIS IS SIGPROC-SP.TEX - VERSION 3.1
% WORKS WITH V3.2SP OF ACM_PROC_ARTICLE-SP.CLS
% APRIL 2009
%
% It is an example file showing how to use the 'acm_proc_article-sp.cls' V3.2SP
% LaTeX2e document class file for Conference Proceedings submissions.
% ----------------------------------------------------------------------------------------------------------------
% This .tex file (and associated .cls V3.2SP) *DOES NOT* produce:
%       1) The Permission Statement
%       2) The Conference (location) Info information
%       3) The Copyright Line with ACM data
%       4) Page numbering
% ---------------------------------------------------------------------------------------------------------------
% It is an example which *does* use the .bib file (from which the .bbl file
% is produced).
% REMEMBER HOWEVER: After having produced the .bbl file,
% and prior to final submission,
% you need to 'insert'  your .bbl file into your source .tex file so as to provide
% ONE 'self-contained' source file.
%
% Questions regarding SIGS should be sent to
% Adrienne Griscti ---> griscti@acm.org
%
% Questions/suggestions regarding the guidelines, .tex and .cls files, etc. to
% Gerald Murray ---> murray@hq.acm.org
%
% For tracking purposes - this is V3.1SP - APRIL 2009

\documentclass{acm_proc_article-sp}
\usepackage{booktabs}
\usepackage{multirow}
\usepackage{todonotes}

\begin{document}

\title{Entity-Centric Stream Filtering and ranking: Filtering and Unfilterable Documents 
}
%
% You need the command \numberofauthors to handle the 'placement
% and alignment' of the authors beneath the title.
%
% For aesthetic reasons, we recommend 'three authors at a time'
% i.e. three 'name/affiliation blocks' be placed beneath the title.
%
% NOTE: You are NOT restricted in how many 'rows' of
% "name/affiliations" may appear. We just ask that you restrict
% the number of 'columns' to three.
%
% Because of the available 'opening page real-estate'
% we ask you to refrain from putting more than six authors
% (two rows with three columns) beneath the article title.
% More than six makes the first-page appear very cluttered indeed.
%
% Use the \alignauthor commands to handle the names
% and affiliations for an 'aesthetic maximum' of six authors.
% Add names, affiliations, addresses for
% the seventh etc. author(s) as the argument for the
% \additionalauthors command.
% These 'additional authors' will be output/set for you
% without further effort on your part as the last section in
% the body of your article BEFORE References or any Appendices.

\numberofauthors{8} %  in this sample file, there are a *total*
% of EIGHT authors. SIX appear on the 'first-page' (for formatting
% reasons) and the remaining two appear in the \additionalauthors section.
%
% \author{
% % You can go ahead and credit any number of authors here,
% % e.g. one 'row of three' or two rows (consisting of one row of three
% % and a second row of one, two or three).
% %
% % The command \alignauthor (no curly braces needed) should
% % precede each author name, affiliation/snail-mail address and
% % e-mail address. Additionally, tag each line of
% % affiliation/address with \affaddr, and tag the
% % e-mail address with \email.
% %
% % 1st. author
% \alignauthor
% Ben Trovato\titlenote{Dr.~Trovato insisted his name be first.}\\
%        \affaddr{Institute for Clarity in Documentation}\\
%        \affaddr{1932 Wallamaloo Lane}\\
%        \affaddr{Wallamaloo, New Zealand}\\
%        \email{trovato@corporation.com}
% % 2nd. author
% \alignauthor
% G.K.M. Tobin\titlenote{The secretary disavows
% any knowledge of this author's actions.}\\
%        \affaddr{Institute for Clarity in Documentation}\\
%        \affaddr{P.O. Box 1212}\\
%        \affaddr{Dublin, Ohio 43017-6221}\\
%        \email{webmaster@marysville-ohio.com}
% }
% There's nothing stopping you putting the seventh, eighth, etc.
% author on the opening page (as the 'third row') but we ask,
% for aesthetic reasons that you place these 'additional authors'
% in the \additional authors block, viz.
% Just remember to make sure that the TOTAL number of authors
% is the number that will appear on the first page PLUS the
% number that will appear in the \additionalauthors section.

\maketitle
\begin{abstract}


Entity-centric information processing requires complex pipelines involving both natural language processing and information retrieval components. In entity-centric stream filtering and ranking, the pipeline involves four  important stages: filtering, classification, ranking(scoring)  and evaluation. Filtering is an important step  that creates a manageable working set of documents  from a  web-scale corpus for the next stages.  It thus  determines the performance of the overall system.  Keeping the subsequent steps constant, we  zoom in on the filtering stage and conduct an in-depth analysis of the  main components of cleansing, entity profiles, relevance levels, category of documents and entity types with a view to understanding  the factors and choices that affect filtering performance. The study demonstrates the most  effective entity profiling,  identifies those relevant documents that defy filtering and conducts manual examination into their contents. The paper classifies the ways unfilterable documents 
are mentioned in text and estimates the practical upper-bound of recall in  entity-based filtering.  

\end{abstract}
% A category with the (minimum) three required fields
\category{H.4}{Information Filtering}{Miscellaneous}

%A category including the fourth, optional field follows...
%\category{D.2.8}{Software Engineering}{Metrics}[complexity measures, performance measures]

\terms{Theory}

\keywords{Information Filtering; Cumulative Citation Recommendation; knowledge maintenance; Stream Filtering;  emerging entities} % NOT required for Proceedings

\section{Introduction}
  In 2012, the Text REtrieval Conferences (TREC) introduced the Knowledge Base Acceleration (KBA) track  to help Knowledge Bases(KBs) curators. The track is crucial to address a critical need of KB curators: given KB (Wikipedia or Twitter) entities, filter  a stream  for relevant documents, rank the retrieved documents and recommend them to the KB curators. The track is crucial and timely because  the number of entities in a KB on one hand, and the huge amount of new information content on the Web on the other hand make the task of manual KB maintenance challenging.   TREC KBA's main task, Cumulative Citation Recommendation (CCR), aims at filtering a stream to identify   citation-worthy  documents, rank them,  and recommend them to KB curators.
  
   
 Filtering is a crucial step in CCR for selecting a potentially relevant set of working documents for subsequent steps of the pipeline out of a big collection of stream documents. The TREC Filtering track defines filtering as a ``system that sifts through stream of incoming information to find documents that are relevant to a set of user needs represented by profiles'' \cite{robertson2002trec}. Adaptive Filtering, one task of the filtering track,  starts with   a persistent user profile and a very small number of positive examples. The  filtering step used in CCR systems fits under adaptive filtering: the profiles are represented by persistent KB (Wikipedia or Twitter) entities and there is a small set of relevance judgments representing positive examples. 
 
 TREC-KBA 2013's participants applied Filtering as a first step  to produce a smaller working set for subsequent experiments. As the subsequent steps of the pipeline use the output of the filter, the final performance of the system is dependent on this important step.  The filtering step particularly determines the recall of the overall system. However, all submitted systems suffered from poor recall \cite{frank2013stream}.  The most important components of the filtering step are cleansing, and entity profiling. Each component has choices to make. For example, there are two versions of corpus: cleansed and raw. Different approaches used different entity profiles for filtering. These entity profiles varied from  KB entities' canonical names, to  DBpedia name variants, to using bold words in the first paragraph of the Wikipedia entities’ profiles and anchor texts from other Wikipedia pages, to using exact name and wordNet synonyms. Moreover, the Type of entities (Wikipedia or Twitter), the category of 
documents (news, blog, tweets) can influence filtering.
 

 A variety of approaches are employed  to solve the CCR challenge. Each participant reports the steps of the pipeline and the final results in comparison to other systems.  A typical TREC KBA poster presentation or talk explains the system pipeline and reports the final results. The systems may employ similar (even the same) steps  but the choices they make at every step are usually different. In such a situation, it becomes hard to identify the factors that result in improved performance. There is  a lack of insight across different approaches. This makes  it hard to know  whether the improvement in performance of a particular approach is due to preprocessing, filtering, classification, scoring  or any of the sub-components of the pipeline. 
 
  
 
 
 In this paper,  we hold the subsequent steps of the pipeline fixed, zoom in on the filtering step and  conduct an in-depth analysis of the main components in it.  In particular, we study  cleansing, different entity profiling,  type of entities (Wikipedia or Twitter), and document categories (social, news, etc).  The main contribution of the paper are an in-depth analysis of the factors that affect entity-based stream filtering, identifying optimal entity profiles vis-avis not compromising precision, describing and classifying relevant documents that are not amenable to filtering , and  estimating the upper-bound of recall on entity-based filtering.
 
 The rest of the paper is is organized as follows: 
 
 
 

 \section{Data Description}
We use TREC-KBA 2013 dataset \footnote{http://http://trec-kba.org/trec-kba-2013.shtml}. The dataset consists of a time-stamped  stream corpus, a set of KB entities, and a set of relevance judgments. 
\subsection{Stream corpus} The stream corpus comes in two versions: raw and cleaned. The raw  and cleansed versions are 6.45TB and 4.5TB respectively,  after xz-compression and GPG encryption. The raw data is a  dump of  raw HTML pages. The cleansed version is the raw data after its HTML tags are stripped off and non-English documents removed. The stream corpus is organized in hourly folders each of which contains many  chunk files. Each chunk file contains between hundreds and hundreds of thousands of serialized  thrift objects. One thrift object is one document. A document could be a blog article, a news article, or a social media post (including tweet).  The stream corpus comes from three sources: TREC KBA 2012 (social, news and linking) \footnote{http://trec-kba.org/kba-stream-corpus-2013.shtml}, arxiv\footnote{http://arxiv.org/}, and spinn3r\footnote{http://spinn3r.com/}. Table \ref{tab:streams}   shows the sources, the number of hourly directories, and the number of chunk files. 

\begin{table*}
\caption{retrieved documents to different sources }
\begin{center}

 \begin{tabular}{l*{4}{l}l}
 documents     &   chunk files    &    Sub-stream \\
\hline

126,952         &11,851         &arxiv \\
394,381,405      &   688,974        & social \\
134,933,117       &  280,658       &  news \\
5,448,875         &12,946         &linking \\
57,391,714         &164,160      &   MAINSTREAM\_NEWS (spinn3r)\\
36,559,578         &85,769      &   FORUM (spinn3r)\\
14,755,278         &36,272     &    CLASSIFIED (spinn3r)\\
52,412         &9,499         &REVIEW (spinn3r)\\
7,637         &5,168         &MEMETRACKER (spinn3r)\\
1,040,520,595   &      2,222,554 &        Total\\

\end{tabular}
\end{center}
\label{tab:streams}
\end{table*}

\subsection{KB entities}

 The KB entities consist of 20 Twitter entities and 121 Wikipedia entities. The selected entities are, on purpose, sparse. The entities consist of 71 people, 1 organization, and 24 facilities.  
\subsection{Relevance judgments}

TREC-KBA provided relevance judgments for training and testing. Relevance judgments are given to a document-entity pairs. Documents with citation-worthy content to a given entity are annotated  as \emph{vital},  while documents with tangentially relevant content, or documents that lack freshliness o  with content that can be useful for initial KB-dossier are annotated as \emph{relevant}. Documents with no relevant content are labeled \emph{neutral} and spam is labeled as \emph{garbage}.  The inter-annotator agreement on vital in 2012 was 70\% while in 2013 it is 76\%. This is due to the more refined definition of vital and the distinction made between vital and relevant. 



 \subsection{Stream Filtering}
 Given a stream of documents of news items, blogs and social media on one hand and KB entities (Wikipedia, Twitter)  on the other,  we study the factors and choices that affect filtering perfromance. Specifically, we conduct in-depth analysis on the cleansing step, the entity-profile construction, the document category of the stream items, and the type of entities (Wikipedia or Twitter). We also study the impact of choices on classification performance. Finally, we conduct manual examination of the relevant documents that defy filtering. We strive to answer the following research questions:
 
 \begin{enumerate}
  \item Does cleansing affect filtering and subsequent performance
  \item What is the most effective way of entity profile representation
  \item Is filtering different for Wikipedia and Twitter entities?
  \item Are some type of documents easily filterable and others not ? 
  \item Does a gain in recall at filtering step translate to a gain in F-measure at the end of the pipeline?
  \item What are the vital(relevant documents that are not retrievable by a system?
  \item Are there vital (relevant) documents that are not filterable by a reasonable system?
\end{enumerate}

\subsection{Evaluation}
The TREC filtering track 

\subsection{Literature Review}
There has been a great deal of interest  as of late on entity-based filtering and ranking. One manifestation of that is the introduction of TREC KBA in 2012. Following that, there have been a number of research works done on the topic \cite{frank2012building, ceccarelli2013learning, taneva2013gem, wang2013bit, balog2013multi}.  These works are based on KBA 2012 task and dataset  and they address the whole problem of entity filtering and ranking.  TREC KBA continued in 2013, but the task underwent some changes. The main change between  the 2012 and 2013 are in the number of entities, the type of entities, the corpus and the relevance rankings.

The number of entities increased from 29 to 141, and it included 20 Twitter entities. The TREC KBA 2012 corpus is 1.9TB after xz-compression and has  400M documents. By contrast, the KBA 2013 corpus is 6.45 after XZ-compression and GPG encryption. A version with all-non English documented removed  is 4.5 TB and consists of 1 Billion documents. The 2013 corpus subsumed the 2012 corpus and added others from spinn3r, namely main-stream news, forum, arxiv, classified, reviews and meme-tracker.  A more important difference is, however, a change in the definitions of relevance ratings vital and relevant. While in KBA 2012, a document was judged vital if it has citation-worthy content for a given entity, in 2013 it must have the freshliness, that is the content must trigger an editing of the given entity's KB entry. 

While the tasks of 2012 and 2013 are fundamentally the same, the approaches  varied due  to the size of the corpus. In 2013, all participants used filtering to reduce the size of the big corpus.   They used different ways of filtering: many of them used two or more of different name variants from DBpedia such as labels, names, redirects, birth names, alias, nicknames, same-as and alternative names \cite{wang2013bit,dietzumass,liu2013related, zhangpris}.  Although most of the participants used DBpedia name variants none of them used all the name variants.  A few other participants used bold words in the first paragraph of the Wikipedia entity's profiles and anchor texts from other Wikipedia pages  \cite{bouvierfiltering, niauniversity}. One participant used Boolean \emph{and} built from the tokens of the canonical names \cite{illiotrec2013}.  

All of the studies used filtering as their first step to generate a smaller set of documents. And many systems suffered from poor recall and their system performances were highly affected \cite{frank2012building}. Although  systems  used different entity profiles to filter the stream, and achieved different performance levels, there is no study on and the factors and choices that affect the filtering step itself. Of course filtering has been extensively examined in TREC Filtering \cite{robertson2002trec}. However, those studies were isolated in the sense that they were intended to optimize recall. What we have here is a different scenario. Documents have relevance rating. Thus we want to study filtering in connection to  relevance to the entities and thus can be done by coupling filtering to the later stages of the pipeline. This is new to the best of our knowledge and the TREC KBA problem setting and data-sets offer a good opportunity to examine this aspect of filtering. 

Moreover, there has not been a chance to study at this scale and/or a study into what type of documents defy filtering and why? In this paper, we conduct a manual examination of the documents that are missing and classify them into different categories. We also estimate the general upper bound of recall using the different entities profiles and choose the best profile that results in an increased over all performance as measured by F-measure. 

\section{Method}
We work with the subset of stream corpus documents  for whom there exist  annotation. For this purpose, we extracted the documents that have annotation from the big corpus. All our experiments are based on this smaller subset.   We experiment with all KB entities.  For each KB entity, we extract different name variants from DBpedia and Twitter. 
\

\subsection{Entity Profiling}
We build profiles for the KB entities of interest. We have two types: Twitter and Wikipedia. Both Entities are selected, on purpose, to be sparse, less-documented.  For the Twitter entities, we visit their respective Twitter pages  and  manually fetch their display names. For the Wikipedia entities, we fetch different name variants from DBpedia, namely  name, label, birth name, alternative names, redirects, nickname, or alias.  The extraction results are in Table \ref{tab:sources}.
\begin{table*}
\caption{Number of retrieved name variants for different sources }
\begin{center}

 \begin{tabular}{l*{4}{c}l}
 &Name Variant&Number of strings \\
\hline
 Name  &82&82\\
 Label   &121 &121\\
Redirect  &49&96 \\
 Birth Name &6&6\\
 Nickname & 1&1&\\
 Alias &1 &1\\
 Alternative Names &4&4\\

\hline
\end{tabular}
\end{center}
\label{tab:sources}
\end{table*}


We have a total of 121 Wikipedia entities. Not every entity has a value for the different name variants. Every entity has a DBpedia label.  Only 82 entities have a name string and only 49 entities have redirect strings. Most of the entities have only one string, but some have 2, 3, 4 and 5. One entity, Buddy\_MacKay,has the highest (12) number of redirect strings. 6 entities have  birth names, only 1 ( Chuck Pankow) has a nick name,  ``The Peacemaker'',  only 1 entity has alias and only 4 have alternative names.

We combined the different name variants we extracted to form a set of strings for each KB entity. Specifically, we merged the names, labels, redirects, birth names, nick names and alternative names of each entity. For Twitter entities, we used the display names that we collected. We consider the names of the entities that are part of the URL as canonical names . For example in http://en.wikipedia.org/wiki/Benjamin\_Bronfman, Benjamin Bronfman is a canonical name of the entity. From the combined name variants and the canonical names, we  created four sets of profiles for each entity:  canonical names (cano), partial names of canonical names (cano\_part), all names variants (all) and partial names of all names(all\_part). 
\subsection{Annotation Corpus}

The annotation set is a combination of the annotations from before the Training Time Range(TTR) and Evaluation Time Range (ETR) and consists of 68405 annotations.  Its breakdown into training and test sets is  shown in Table \ref{tab:breakdown}.


\begin{table}
\caption{Number of annotation documents with respect to different categories,(cleansed, raw, training, testing}
\begin{center}
\begin{tabular}{l*{3}{c}r}
 &&Vital&Relevant  &Total \\
\hline

\multirow{2}{*}{Training}  &Wikipedia & 1932  &2051& 3672\\
			  &Twitter&189   &314&488 \\
			   &All Entities&2121&2365&4160\\
                        
\hline 
\multirow{2}{*}{Testing}&Wikipedia &6139   &12375 &16160 \\
                         &Twitter&1261   &2684&3842  \\
                         &All Entities&7400   &12059&20002 \\
                         
             \hline 
\multirow{2}{*}{Total} & Wikipedia       &8071   &14426&19832  \\
                       &Twitter  &1450  &2998&4330  \\
                       &All entities&9521   &17424&24162 \\
	                 
\hline
\end{tabular}
\end{center}
\label{tab:breakdown}
\end{table}






Most (more than 80\%) of the annotation documents are in the test set.  In both the training and test data for 2013, there are  68405 annotations, of which 50688 are unique document-entity pairs.   Out of 50688,  24162  unique document-entity pairs vital or relevant, of which 9521 are vital and 17424 are relevant. 

 

\section{Experiments and Results}
 We conducted experiments to study  the effect of cleansing, different entity profiles, types of entities, category of documents, relevance ranks (vital or relevant), and the impact on classification.  In the following subsections, we present the results and discuss them.
 
 \subsection{Cleansing: raw or cleansed}
\begin{table}
\caption{Central or relevant documents that are retrieved under different name variants , upper part from cleansed, lower part from raw}
\begin{center}
\begin{tabular}{l@{\quad}lllllll}
\hline
\multicolumn{1}{l}{\rule{0pt}{12pt}}&\multicolumn{1}{l}{\rule{0pt}{12pt}cano}&\multicolumn{1}{l}{\rule{0pt}{12pt}cano partial }&\multicolumn{1}{l}{\rule{0pt}{12pt}all }&\multicolumn{1}{l}{\rule{0pt}{50pt}all\_part}\\[5pt]
\hline



   all-entities   &51.0  &61.7  &66.2  &78.4 \\	
   Wikipedia      &61.8  &74.8  &71.5  &77.9\\
   twitter        &1.9   &1.9   &41.7  &80.4\\
  
 
\hline
\hline
  all-entities    &59.0  &72.2  &79.8  &90.2\\
   Wikipedia      &70.0  &86.1  &82.4  &90.7\\
   twitter        & 8.7  &8.7   &67.9  &88.2\\
\hline

\end{tabular}
\end{center}
\label{tab:name}
\end{table}


The upper part of Table \ref{tab:name} shows the recall performances on the cleansed version and the lower part on the raw version. The recall performances for all entity types  are increased substantially in the raw version. Recall increases on Wikipedia entities  vary from 8.2 to 12.8, and in Twitter entities from 6.8 to 26.2. In all entities, it varies from 8.0 to 13.6.  The recall increases are substantial. To put it into perspective, an 11.8 increase in recall on all entities is a retrieval of 2864 more unique document-entity pairs. This suggests that cleansing has removed some documents that we could otherwise retrieve. 

\subsection{Entity Profiles}
If we look at the recall performances for the raw corpus,   filtering documents by canonical names achieves a recall of  59\%.  Adding other name variants  improves the recall to 79.8\%, an increase of 20.8\%. This means  20.8\% of documents mentioned the entities by other names  rather than by their canonical names. Canonical partial names achieve a recall of 72\%  and the partial names of all name variants achieves 90.2\%. This says that 18.2\% of documents mentioned the entities by  partial names of other non-canonical name variants. 


\subsection{Breakdown of results by document source category}

  
  
  \begin{table*}
\caption{Breakdown of sources and delta }
\begin{center}\begin{tabular}{l*{9}{c}r}
 && \multicolumn{3}{ c| }{All entities}  & \multicolumn{3}{ c| }{Wikipedia} &\multicolumn{3}{ c| }{Twitter} \\ 
 & &Others&news&social & Others&news&social &  Others&news&social \\
\hline
 
\multirow{4}{*}{Vital} &cano                 &82.2 &65.6    &70.9          &90.9 &80.1   &76.8             &8.1   &6.3   &   30.5\\
			 &cano\_part - cano  	&8.2  &14.9    &12.3           &9.1  &18.6   &14.1             &0      &0       &0  \\
                         &all - cano         	&12.6  &19.7    &12.3          &5.5  &15.8   &8.4             &73   &35.9    &38.3  \\
	                 &all\_part - cano\_part&9.7    &18.7  &12.7       &0    &0.5  &5.1        &93.2 & 93 &64.4 \\
	                 &all\_part - all     	&5.4  &13.9     &12.7           &3.6  &3.3    &10.8              &20.3   &57.1    &26.1 \\
	                 \hline
	                 
\multirow{4}{*}{Relevant} &cano                 &84.2 &53.4    &55.6          &88.4 &75.6   &63.2             &10.6   &2.2    &  6\\
			 &cano\_part - cano  	&10.5  &15.1    &12.2          &11.1  &21.7   &14.1             &0   &0    &0  \\
                         &all - cano         	&11.7  &36.6    &17.3          &9.2  &19.5   &9.9             &54.5   &76.3   &66  \\
	                 &all\_part - cano\_part &4.2  &26.9   &15.8          &0.2    &0.7    &6.7           &72.2   &87.6 &75 \\
	                 &all\_part - all     	&3    &5.4     &10.7           &2.1  &2.9    &11              &18.2   &11.3    &9 \\
	                 
	                 \hline
\multirow{4}{*}{total} 	&cano                &    81.1   &56.5   &58.2         &87.7 &76.3   &65.6          &9.8  &1.4    &13.5\\
			&cano\_part - cano   	&10.9   &15.5   &12.4         &11.9  &21.3   &14.4          &0     &0       &0\\
			&all - cano         	&13.8   &30.6   &16.9         &9.1  &18.9   &10.2          &63.6  &61.8    &57.5 \\
                        &all\_part - cano\_part	&7.2   &24.8   &15.9          &0.1    &0.7    &6.8           &82.2  &89.1    &71.3\\
                        &all\_part - all     	&4.3   &9.7    &11.4           &3.0  &3.1   &11.0          &18.9  &27.3    &13.8\\	                 
	                 
                                  	                 
	                
	                 
\hline
\end{tabular}
\end{center}
\label{tab:source-delta2}
\end{table*}


 \begin{table*}
\caption{Breakdown of sources and delta }
\begin{center}\begin{tabular}{l*{9}{c}r}
 && \multicolumn{3}{ c| }{All entities}  & \multicolumn{3}{ c| }{Wikipedia} &\multicolumn{3}{ c| }{Twitter} \\ 
 & &Others&news&social & Others&news&social &  Others&news&social \\
\hline
 
\multirow{4}{*}{Vital} &cano                 &82.2& 65.6& 70.9& 90.9&  80.1& 76.8&   8.1&  6.3&  30.5\\
&cano part & 90.4& 80.6& 83.1& 100.0& 98.7& 90.9&   8.1&  6.3&  30.5\\
&all  & 94.8& 85.4& 83.1& 96.4&  95.9& 85.2&   81.1& 42.2& 68.8\\
&all part &100& 99.2& 95.9& 100.0&  99.2& 96.0&   100&  99.3& 94.9\\
\hline
	                 
\multirow{4}{*}{Relevant} &cano & 84.2& 53.4& 55.6& 88.4& 75.6& 63.2& 10.6& 2.2& 6.0\\
&cano part &94.7& 68.5& 67.8& 99.6& 97.3& 77.3& 10.6& 2.2& 6.0\\
&all & 95.8& 90.1& 72.9& 97.6& 95.1& 73.1& 65.2& 78.4& 72.0\\
&all part &98.8& 95.5& 83.7& 99.7& 98.0& 84.1& 83.3& 89.7& 81.0\\
	                 
	                 \hline
\multirow{4}{*}{total} 	&cano    &   81.1& 56.5& 58.2& 87.7& 76.4& 65.7& 9.8& 3.6& 13.5\\
&cano part &92.0& 72.0& 70.6& 99.6& 97.7& 80.1& 9.8& 3.6& 13.5\\
&all & 94.8& 87.1& 75.2& 96.8& 95.3& 75.8& 73.5& 65.4& 71.1\\
&all part & 99.2& 96.8& 86.6& 99.8& 98.4& 86.8& 92.4& 92.7& 84.9\\
	                 
\hline
\end{tabular}
\end{center}
\label{tab:source-delta}
\end{table*}
    

The results  of the different entity profiles on the raw corpus are broken down by source categories and relevance rank (vital, or relevant).  In total, there are 24162 vital or relevant unique entity-document pairs. 9521 of them are vital  and  17424 are relevant. These documents  are categorized into 8 source categories: 0.98\% arxiv(a), 0.034\% classified(c), 0.34\% forum(f), 5.65\% linking(l), 11.53\% mainstream-news(m-n), 18.40\% news(n), 12.93\% social(s) and 50.2\% weblog(w). 

The 8 document source categories are regrouped into three for two reasons: 1) some groups are very similar to each other. Mainstream-news and news are  similar. The reason they exist separately, in the first place,  is because they were collected from two different sources, by different groups and at different times. we call them news from now on.  The same is true with weblog and social, and we call them social from now on.   2) some groups have so small number of annotations that treating them independently does not make much sense. Majority of vital or relevant annotations are social (social and weblog) (63.13\%). News (mainstream +news) make up 30\%. Thus, news and social make up about 93\% of all annotations.  The rest make up about 7\% and are all grouped as others. 

The results of the breakdown by document categories is presented in a multi-dimensional table shown in \ref{tab:source-delta}. There are three outer columns for  all entities, Wikipedia and Twitter. Each of the outer columns consist of the document categories of other,news and social. The rows consist of Vital, relevant and total each of which have the four entity profiles.   
 
 
 
 \subsection{ Relevance Rating: Vital and relevant}
 
 When comparing the recall performances in vital and relevant, we observe that canonical names achieve better in vital than in relevant. This is specially true with Wikipedia entities. For example, the recall for news is 80.1 and for social is 76, while the corresponding recall in relevant is 75.6 and 63.2 respectively. We can generally see that the recall in vital are better than the recall in relevant suggesting that relevant documents are more probable to mention the entities and when they do, using some of their common name variants. 
 
%  \subsection{Difference by document categories}
%  
 
%  Generally, there is greater variation in relevant rank than in vital. This is specially true in most of the Delta's for Wikipedia. This  maybe be explained by news items referring to  vital documents by a some standard name than documents that are relevant. Twitter entities show greater deltas than Wikipedia entities in both vital and relevant. The greater variation can be explained by the fact that the canonical name of Twitter entities retrieves very few documents. The deltas that involve canonical names of Twitter entities, thus, show greater deltas.  
%  

% If we look in recall performances, In Wikipedia entities, the order seems to be others, news and social. This means that others achieve a higher recall than news than social.  However, in Twitter entities, it does not show such a strict pattern. In all, entities also, we also see almost the same pattern of other, news and social. 



  
\subsection{Recall across document categories(others, news and social)}
The recall for Wikipedia entities in \ref{tab:name} ranged from 61.8\% (canonical names) to 77.9\% (partial names of name variants. We looked at how these recall is distributed across the three document categories. In Table \ref{tab:source-delta}, Wikipedia column, we see, across all entity profiles, that others have a higher recall followed by news. Social documents achieve the lowest recall.  While the news recall  ranged from 76.4\% to 98.4\%, the recall for social documents ranged from 65.7\% to 86.8\%. Others achieve higher than news and news achieve higher than social. This pattern  holds across  all name variants in  Wikipedia  entities. Notice that the others category stands for arxiv (scientific documents), classifieds, forums and linking.

In Twitter entities, however, the pattern is different. In canonical names (and their partials), social documents achieve higher recall than news . This suggests that social documents refer to Twitter entities by their canonical names (user names) more than news. In partial names of all name variants, news achieve better results than social. The difference in recall between canonical and partial names of all name variants shows that news do not refer to Twitter entities by their user names, they refer to them with their display names.

Overall, across all entities types and all entity profiles, others achieve better recall than news, and  news, in turn, achieve higher recall than social documents. This suggests that social documents are the hardest  to retrieve.  This of course makes sense since social posts are short and are more likely to point to other resources, or use short informal names.





We computed four percentage increases in recall (deltas)  between the different entity profiles (see Table \ref{tab:source-delta2}. It is to help writng, will be deleted before submission). The first delta is the recall percentage between partial names of canonical names and canonical names. The second  is the delta between name variants and canonical names. The third is the difference between partial names of name variants  and partial names of canonical names and the fourth between partial names of name variants and name variants. we believe these four deltas offer a clear meaning. The delta between all name variants and canonical names shows the percentage of documents that the new name variants retrieve, but the canonical name does not. Similarly, the delta between partial names of name variants and partial names of canonical names shows the percentage of document-entity pairs that can be gained by the partial names of the name variants. 

In most of the  deltas, news followed by social followed by others show greater difference. This suggests s that news refer to entities by different names, rather than by a certain standard name.  This is counter-intuitive since one would expect news to mention entities by some consistent name(s) thereby reducing the difference. The deltas, for Wikipedia entities, between canonical partials and canonicals,  and all name variants and canonicals are high  suggesting that partial names and all other name variants bring in new documents that can not be retrieved by canonical names. The rest of the two deltas are very small suggesting that partial names of all name variants do not bring in new relevant documents. In Twitter entities,  name variants bring in new documents. 

% The  biggest delta  observed is in Twitter entities between partials of all name variants and partials of canonicals (93\%). delta. Both of them are for news category.  For Wikipedia entities, the highest delta observed is 19.5\% in cano\_part - cano followed by 17.5\% in all\_part in relevant.  
  
  \subsection{Entity Types: Wikipedia and Twitter)}
Table \ref{tab:name} shows the difference between Wikipedia and Twitter entities.  Wikipedia entities' canonical names achieve a recall of 70\%, and partial names of canonical names achieve a recall of 86.1\%. This is an increase in recall of 16.1\%. By contrast, the increase in recall of partial names of all name variants over just all name variants is 8.3.  The high increase in recall when moving from canonical names  to their partial names, in comparison to the lower increase when moving from all name variants to their partial names can be explained by saturation. This is to mean that documents have already been extracted by the different name variants and thus using their partial names does not bring in many new relevant documents. One interesting observation is that, on Wikipedia entities, partial names of canonical names achieve better recall than name variants.  This holds in both cleansed and raw extractions. %In the raw extraction, the difference is about 3.7. 

In Twitter entities, however, it is different. Both canonical and their partial names perform the same and the recall is very low. Canonical names and partial canonical names are the same for Twitter entities because they are one word names. For example in https://twitter.com/roryscovel, ``roryscovel`` is the canonical name and its partial name is also the same.  The low recall is because the canonical names of Twitter entities are not really names; they are usually arbitrarily created user names. It shows that  documents do not refer to Twitter entities by their user names. They refer to them by their display names, which is reflected in the recall (67.9\%). The use of partial names of display names increases the recall to 88.2\%.

When we talk at an aggregate-level (both Twitter and Wikipedia entities), we observe two important patterns. 1) we see that recall increases as we move from canonical names to canonical partial names, to all name variants, and to partial names of name variants. But we saw that that is not the case in Wikipedia entities.  The influence, therefore, comes from Twitter entities. 2) Using canonical names retrieves the least number of vital or relevant documents, and the partial names of all name variants retrieves the most number of documents. The difference in performance is 31.9\% on all entities, 20.7\% on Wikipedia entities, and 79.5\% on Twitter entities. This is a significant performance difference. 


The tables in \ref{tab:name} and \ref{tab:source-delta} show recall for Wikipedia entities are higher than for Twitter. This indicates that Wikipedia entities are easier to match in documents than Twitter. This can be due to one reasons: 1) Wikipedia entities are relatively well described than Twitter entities. The fact that we can retrieve different name variants from DBpedia is a measure of relatively rich description. By contrast, we have only two names for Twitter entities: their user names and their display names which we collect from their Twitter pages. 2) Twitter entities have a less richer entity profiles such as DBpedia from which we can collect alternative names. The results in the tables also show  that Twitter entities are mentioned by their display names more than  by their user names. However,  social documents mention Twitter entities by their user names more than news suggesting a distinction in standard between news and social documents. 



   \subsection{Impact on classification}
  In the overall experimental setup, classification, ranking,  and evaluationn are kept constant. In here, we present results showing how  the choices in corpus, entity types, and entity profiles impact these latest stages of the pipeline.  In tables \ref{tab:class-vital} and \ref{tab:class-vital-relevant}, we show the performances in F-measure and SU. 
\begin{table*}
\caption{vital performance under different name variants(upper part from cleansed, lower part from raw)}
\begin{center}
\begin{tabular}{ll@{\quad}lllllll}
\hline
&\multicolumn{1}{l}{\rule{0pt}{12pt}}&\multicolumn{1}{l}{\rule{0pt}{12pt}cano}&\multicolumn{1}{l}{\rule{0pt}{12pt}cano partial }&\multicolumn{1}{l}{\rule{0pt}{12pt}all }&\multicolumn{1}{l}{\rule{0pt}{50pt}all\_part}\\[5pt]



   all-entities &F& 0.241&0.261&0.259&0.265\\
	      &SU&0.259  &0.258 &0.263 &0.262 \\	
   Wikipedia &F&0.252&0.274& 0.265&0.271\\
	      &SU& 0.261& 0.259&  0.265&0.264 \\
   
   
   twitter &F&0.105&0.105&0.218&0.228\\
     &SU &0.105&0.250& 0.254&80.253\\
  
 
\hline
\hline
  all-entities &F & 0.240 &0.272 &0.250&0.251\\
	  &SU& 0.258   &0.151  &0.264  &0.258\\
   Wikipedia&F &0.257&0.257&0.257&0.255\\
   &SU	     & 0.265&0.265 &0.266 & 0.259\\
   twitter&F &0.188&0.188&0.208&0.231\\
	&SU&    0.269 &0.250\% &0.250&0.253\\
\hline

\end{tabular}
\end{center}
\label{tab:class-vital}
\end{table*}
  
  
  \begin{table*}
\caption{vital-relevant performances under different name variants (upper part from cleansed, lower part from raw)}
\begin{center}
\begin{tabular}{ll@{\quad}lllllll}
\hline
&\multicolumn{1}{l}{\rule{0pt}{12pt}}&\multicolumn{1}{l}{\rule{0pt}{12pt}cano}&\multicolumn{1}{l}{\rule{0pt}{12pt}cano partial }&\multicolumn{1}{l}{\rule{0pt}{12pt}all }&\multicolumn{1}{l}{\rule{0pt}{50pt}all\_part}\\[5pt]



   all-entities &F& 0.497&0.560&0.579&0.607\\
	      &SU&0.468  &0.484 &0.483 &0.492 \\	
   Wikipedia &F&0.546&0.618&0.599&0.617\\
   &SU&0.494  &0.513 &0.498 &0.508 \\
   
   twitter &F&0.142&0.142& 0.458&0.542\\
    &SU &0.317&0.328&0.392&0.392\\
  
 
\hline
\hline
  all-entities &F& 0.509 &0.594 &0.590&0.612\\
    &SU       &0.459   &0.502  &0.478  &0.488\\
   Wikipedia &F&0.550&0.617&0.605&0.618\\
   &SU	     & 0.483&0.498 &0.487 & 0.495\\
   twitter &F&0.210&0.210&0.499&0.580\\
	&SU&    0.319  &0.317 &0.421&0.446\\
\hline

\end{tabular}
\end{center}
\label{tab:class-vital-relevant}
\end{table*}




On Wikipedia entities, except in the canonical profile, the cleansed version achieves  better results than the raw version.  However, on Twitter entities, the raw corpus achieves a better  in all profiles (except in in the partial names of all names variants).  In all entities (both Wikipedia and Twitter), we see that in three profiles, cleansed achieves better. only in canonical partial, does raw perform better.  This result is interesting because we saw in previous sections that the raw corpus achieves a higher recall. In the case of partial names of name variants, for example, 10\% more relevant documents are retrieved. This suggests that a gain in recall does not necessarily mean a gain in F\_measure here. One explanation for this is that it brings in many false positives from, among related links, adverts, etc.  

For Wikipedia entities,  canonical partial names seem to achieve the highest performance. For Twitter, the partial names of name variants achieve  better results. In vital-relevant, in three cases, raw achieves better results (except in cano partials). For Twitter entities, the raw corpus achieves better results.  In terms of  entity profiles, Wikipedia's canonical partial names achieves  the best F-score. For Twitter, as before, partial names of canonical names. 

It seems, the raw corpus has more effect on Twitter entities performances. An increase in recall does not necessarily mean an increase in F-measure.  The fact that canonical partial names achieve better results is interesting.  We know that partial names were used as a baseline in TREC KBA 2012, but no one of the KBA participants actually used partial names for filtering.


\subsection{Missing relevant documents \label{miss}}
There is a trade-off between using a richer entity-profile and retrieval of irrelevant documents. The richer the profile, the more relevant documents it retrieves, but also the more irrelevant documents. To put into perspective, lets compare the number of documents that are retrieved with partial names of name variants and partial names of canonical names. Using the raw corpus, the partial names of canonical names extracts a total of 2547487 and achieves a recall of 72.2\%. By contrast, the partial names of name variants extracts a total of 4735318 documents and achieves a recall of 90.2\%. The total number of documents extracted increases by 85.9\% for a recall gain of 18\%. The rest of the documents, that is 67.9\%, are newly introduced irrelevant documents. There is an advantage in excluding irrelevant documents from filtering because they confuse the later stages of the pipeline. 

 The use of the partial names of name variants for filtering is an aggressive attempt to retrieve as many relevant documents as possible at the cost of retrieving irrelevant documents. However, we still miss about  2363(10\%) the vital-relevant documents.  Why are these documents missed? If they are not mentioned by partial names of name variants, what are they mentioned by? Table \ref{tab:miss} shows the documents that we miss with respect to cleansed and raw corpus.  The upper part shows the number of documents missing from cleansed and raw versions of the corpus. The lower part of the table shows the intersections and exclusions in each corpus.  

\begin{table}
\caption{The number of documents that are missing from raw and cleansed extractions. }
\begin{center}
\begin{tabular}{l@{\quad}llllll}
\hline
\multicolumn{1}{l}{\rule{0pt}{12pt}category}&\multicolumn{1}{l}{\rule{0pt}{12pt}Vital }&\multicolumn{1}{l}{\rule{0pt}{12pt}Relevant }&\multicolumn{1}{l}{\rule{0pt}{12pt}Total }\\[5pt]
\hline

Cleansed &1284 & 1079 & 2363 \\
Raw & 276 & 4951 & 5227 \\
\hline
 missing only from cleansed &1065&2016&3081\\
  missing only from raw  &57 &160 &217 \\
  Missing from both &219 &1927&2146\\
\hline



\end{tabular}
\end{center}
\label{tab:miss}
\end{table}

 It is normal to assume that  the set of document-entity pairs extracted from cleansed are a sub-set of those   that are extracted from the raw corpus. We find that that is not the case. There are 217  unique entity-document pairs that are retrieved from the cleansed corpus, but not from the raw. 57 of them are vital.    Similarly,  there are  3081 document-entity pairs that are missing  from cleansed, but are present in  raw. 1065 of them are vital.  Examining the content of the documents reveals that it is due to a missing part of text from a corresponding document.  All the documents that we miss from the raw corpus are social, particularly from the category social  (not from weblogs). These are document such as tweets and other posts from other social media. To meet the format of the raw data, some of them must have been converted later after collection and on the way lost a part or all of their content. It is similar for the documents that we miss from cleansed: a part or the  content  is lost in 
converting.  In both cases the mention of the entity happened to be on the part of the text that is cut out during conversion. 
 

 The interesting set  of relevance judgments are those that  we miss from both raw and cleansed extractions. These are 2146 unique document-entity pairs, 219 of them are with vital relevance judgments.   The total number of entities in the missed vital annotations is  28 Wikipedia and 7  Twitter, making a total of 35. Looking into document categories shows that the  great majority (86.7\%) of the documents are social. This suggests that social (tweets and blogs) can talk about the entities without mentioning  them by name. This is, of course, inline with intuition. 
   
   
   Vital documents show higher recall than relevant. This is not surprising as we it is more likely that vital documents mention the entities more than relevant. Across document categories, we observe a pattern in recall of others, followed by news, and then by social. Social documents are the hardest to retrieve. This can be explained by the fact that social documents (tweets, blogs) are more likely to point to a resource without mentioning the entities. By contrast news documents mention the entities they talk about. 
   
   
%    
%    \begin{table*}
% \caption{Breakdown of missing documents by sources for cleansed, raw and cleansed-and-raw}
% \begin{center}\begin{tabular}{l*{9}r}
%   &others&news&social \\
% \hline
% 
% 			&missing from raw only &   0 &0   &217 \\
% 			&missing from cleansed only   &430   &1321     &1341 \\
% 
%                          &missing from both    &19 &317     &2196 \\
%                         
%                          
% 
% \hline
% \end{tabular}
% \end{center}
% \label{tab:miss-category}
% \end{table*}

However, it is interesting to look into the actual content of the documents to gain an insight into the ways a document can talk about an entity without mentioning the entity.  We collected 35 documents, one for each entity, for manual examination. Here below we present the reasons.
\paragraph{Outgoing link mentions} A post (tweet) with an outgoing link which mentions the entity.
\paragraph{Event place - Event} A document that talks about an event is vital to the location entity where it takes place.  For example Maha Music Festival takes place in Lewis and Clark\_Landing, and a document talking about the festival is vital for the park. There are also cases where an event's address places the event in a park and due to that the document becomes vital to the park. This is basically being mentioned by address which belongs to alarger space. 
\paragraph{Entity -related entity} A document about an important figure such as artist, athlete  can be vital to another. This is specially true if the two are contending for the same title, one has snatched a title, or award from the other. 
\paragraph{Organization - main activity} A document that talks about about an area on which the company is active is vital for the organization. For example, Atacocha is a mining company  and a news item on mining waste was annotated vital. 
\paragraph{Entity - group} If an entity belongs to a certain group (class),  a news item about the group can be vital for the individual members. FrankandOak is  named innovative company and a news item that talks about the group  of innovative companies is relevant for a  it. Other examples are: a  big event  of which an entity is related such an Film awards for actors. 
\paragraph{Artist - work} Documents that discuss the work of artists can be relevant to the artists. Such cases include  books or films being vital for the book author or the director (actor) of the film. Robocop is film whose screenplay is by Joshua Zetumer. A blog that talks about the film was judged vital for Joshua Zetumer. 
\paragraph{Politician - constituency} A major political event in a certain constituency is vital for the politician from that constituency. 
 A good example is a weblog that talks about two north Dakota counties being drought disasters. The news is vital for Joshua Boschee, a politician, a member of North Dakota democratic party.  
\paragraph{head - organization} A document that talks about an organization of which the entity is the head can be vital for the entity.  Jasper\_Schneider is USDA Rural Development state director for North Dakota and an article about problems of primary health centers in North Dakota is judged vital for him. 
\paragraph{World Knowledge} Some things are impossible to know without your world knowledge. For example ''refreshments, treats, gift shop specials, "bountiful, fresh and fabulous holiday decor," a demonstration of simple ways to create unique holiday arrangements for any home; free and open to the public`` is judged relevant to Hjemkomst\_Center. This is a social media post, and unless one knows the person posting it, there is no way that this text shows that. Similarly ''learn about the gray wolf's hunting and feeding behaviors and watch the wolves have their evening meal of a full deer carcass; $15 for members, $20 for nonmembers`` is judged vital to Red\_River\_Zoo.  
\paragraph{No document content} Some documents were found to have no content
\paragraph{Not clear why} It is not clear why some documents are annotated vital for some entities.
 



%    To gain more insight, I sampled for each 35 entities, one document-entity pair and looked into the contents. The results are in \ref{tab:miss from both}
%    
%    \begin{table*}
% \caption{Missing documents and their mentions }
% \begin{center}
% 
%  \begin{tabular}{l*{4}{l}l}
%  &entity&mentioned by &remark \\
% \hline
%  Jeremy McKinnon  & Jeremy McKinnon& social, mentioned in read more link\\
% Blair Thoreson   & & social, There is no mention by name, the article talks about a subject that is political (credit rating), not apparent to me\\
%   Lewis and Clark Landing&&Normally, maha music festival does not mention ,but it was held there \\
% Cementos Lima &&It appears a mistake to label it vital. the article talks about insurance and centos lima is a cement company.entity-deleted from wiki\\
% Corn Belt Power Cooperative & &No content at all\\
% Marion Technical Institute&&the text could be of any place. talks about a place whose name is not mentioned. 
%  roryscovel & &Talks about a video hinting that he might have seen in the venue\\
% Jim Poolman && talks of party convention, of which he is member  politician\\
% Atacocha && No mention by name The article talks about waste from mining and Anacocha is a mining company.\\
% Joey Mantia & & a mention of a another speeedskater\\
% Derrick Alston&&Text swedish, no mention.\\
% Paul Johnsgard&& not immediately clear why \\
% GandBcoffee&& not immediately visible why\\
% Bob Bert && talks about a related media and entertainment\\
% FrankandOak&& an article that talks about a the realease of the most innovative companies of which FrankandOak is one. \\
% KentGuinn4Mayor && a theft in a constituency where KentGuinn4Mayor is vying.\\
% Hjemkomst Center && event announcement without mentioning where. it takes a a knowledge of \\
% BlossomCoffee && No content\\
% Scotiabank Per\%25C3\%25BA && no content\\
% Drew Wrigley && politics and talk of oilof his state\\
% Joshua Zetumer && mentioned by his film\\
% Théo Mercier && No content\\
% Fargo Air Museum && No idea why\\
% Stevens Cooperative School && no content\\
% Joshua Boschee && No content\\
% Paul Marquart &&  No idea why\\
% Haven Denney && article on skating competition\\
% Red River Zoo && animal show in the zoo, not indicated by name\\
% RonFunches && talsk about commedy, but not clear whyit is central\\
% DeAnne Smith && No mention, talks related and there are links\\
% Richard Edlund && talks an ward ceemony in his field \\
% Jennifer Baumgardner && no idea why\\
% Jeff Tamarkin && not clear why\\
% Jasper Schneider &&no mention, talks about rural development of which he is a director \\
% urbren00 && No content\\
% \hline
% \end{tabular}
% \end{center}
% \label{tab:miss from both}
% \end{table*}

We also observed that although documents have different document ids, several of them have the same content. In the vital annotation, there are only three (88 news, and 409 weblog). In the 35  vital document-entity pairs we examined, and 13 are news and  22 are social. 

   
  
\section{Analysis and Discussion}


We conducted experiments to study  the impacts on recall of 
different components of the filtering step of entity-based filtering and ranking pipeline. Specifically 
we conducted experiments to study the impacts of cleansing, 
entity profile, relevance rankings, categories of documents, and documents that are missed. We also measured their impacts on classification performance. 

Experimental results using TREC-KBA problem setting and dataset  show that cleansing removes entire or parts of document contents making them difficult to retrieve. These documents can, otherwise, be retrieved from the raw version. The use of the raw corpus brings in documents that can not be retrieved from the cleansed corpus. This is true for all entity profiles and for all entity types. The  recall increase is  between 6.8\% to 26.2\%. These increase, in actual document-entity pairs,  is in thousands. 

The use of different profiles also shows a big difference in percentage recall. Except in the case of Wikipedia, where the use of canonical partial name achieves better than name variants, there is a steady increase in recall from canonical names to partial canonical names, name variants and partial names of name variants. The difference between partial names of name variants and canonical names is 30.8\%. And between partial names of name variants and partial names of canonical names is 18.0\%.  

Does this increase in recall as we move from a less richer profile to a more richer profile translate to an increase in classification performance? The results show that it does not. In most profiles, for both Wikipedia and total entities, the cleansed version performs better than the raw version. In Twitter entities, the raw corpus achieves better except in the case of all name variants. However, the difference in performance are so small that the increase can be ignored. The highest performance for Wikipedia entities is achieved with partial names of canonical names, rather than partial names of all names variants which retrieve 18.0\% more documents. 


However, for vital plus relevant, the raw corpus performs  better except in partial canonical names. In all cases, Wikipedia's canonical partial names achieves better performance than any other profile. This is interesting because the retrieval of thousands of document-entity pairs did not translate to an increase in performance  in classification. One reason why an increase in recall does not translate to an increase in F-measure later is because of the retrieval of many false positives which confuse the classifier. A good profile for Wikipedia entities seem canonical partial names suggesting that there is actually no need to go and fetch different names variants. For Twitter entities, the use of partial names of their display names are  good entity profiles. 
  




\section{Conclusions}
In this paper, we examined the filtering step of the CCR pipeline by holding the classification stage constant. In particular, we studied the cleansing step, different entity profiles, type of entities (Wikipedia or Twitter), categories of documents (news, social, or others) and the vital (relevant) documents. While doing so, we attempted to find answers to these research questions: 
\begin{enumerate}
  \item Does cleansing affect filtering and subsequent performance
  \item What is the most effective way of entity profile representation
  \item Is filtering different for Wikipedia and Twitter entities?
  \item Are some type of documents easily filterable and others not ? 
  \item Does a gain in recall at filtering step translate to a gain in F-measure at the end of the pipeline?
  \item What are the vital(relevant documents that are not retrievable by a system?
\end{enumerate}

While the use of the raw corpus and some entity profiles increases recall substantially, it did not bring a considerable improvement on classification performance. While the absence of  increase in performance for some entity profiles can be explained by the fact that they bring in false positives that confuse the system, the absence of considerable increase in performance in the raw corpus defies explanation. 

Although both Wikipedia and Twitter entities are similar in the sense that they both are sparse entities, our results warrant that they be treated differently. We have seen, for example, that Wikipedia entities best profile representation is the use of partial names of canonical names while Twitter Entities' best profile is the partial names of name variants. Moreover, we have seen that Twitter entities achieve better results with the raw corpus while Wikipedia entities with the cleansed corpus. 

The categories of documents also have an impact on performance. News items show greater variation in performance between the different Entity profiles. This indicates that news items use a wide, non-uniform way of referring entities more than social documents. This is counter-intuitive for one would normally expect news to be more standardized than social (blogs and tweets).


As we strive to use different entity profiles from less rich to very rich, we observe that there are still documents that we still miss no matter what. While some are possible to retrieve with some modifications, some others are not. There are some document that indicate that computers do not seem to get them no matter how rich representation of entities they use. These are documents such as those that  are mentioned in one of the following ways: in a page linked to by 
read more,
venue-event, world knowledge, party-politician, company-related event, entity-related entity, artist - artists work. We also found that some documents have no content at all. Another group of documents ones for which it is not clear why they are relevant. 


%ACKNOWLEDGMENTS are optional
%\section{Acknowledgments}

%
% The following two commands are all you need in the
% initial runs of your .tex file to
% produce the bibliography for the citations in your paper.
\bibliographystyle{abbrv}
\bibliography{sigproc}  % sigproc.bib is the name of the Bibliography in this case
% You must have a proper ".bib" file
%  and remember to run:
% latex bibtex latex latex
% to resolve all references
%
% ACM needs 'a single self-contained file'!
%
%APPENDICES are optional
%\balancecolumns


\end{document}