Changeset - ef04fe3de859
[Not reviewed]
0 2 0
Gebrekirstos Gebremeskel - 11 years ago 2014-06-11 19:59:31
destinycome@gmail.com
updated
2 files changed with 145 insertions and 154 deletions:
0 comments (0 inline, 0 general)
mypaper-final.tex
Show inline comments
 
@@ -141,10 +141,10 @@ documents (news, blog, tweets) can influence filtering.
 
 
 \section{Data Description}
 
We use TREC-KBA 2013 dataset \footnote{http://http://trec-kba.org/trec-kba-2013.shtml}. The dataset consists of a time-stamped  stream corpus, a set of KB entities, and a set of relevance judgments. 
 
\subsection{Stream corpus} The stream corpus comes in two versions: raw and cleaned. The raw  and cleansed versions are 6.45TB and 4.5TB respectively,  after xz-compression and GPG encryption. The raw data is a  dump of  raw HTML pages. The cleansed version is the raw data after its HTML tags are stripped off and non-English documents removed. The stream corpus is organized in hourly folders each of which contains many  chunk files. Each chunk file contains between hundreds and hundreds of thousands of serialized  thrift objects. One thrift object is one document. A document could be a blog article, a news article, or a social media post (including tweet).  The stream corpus comes from three sources: TREC KBA 2012 (social, news and linking) \footnote{http://trec-kba.org/kba-stream-corpus-2013.shtml}, arxiv\footnote{http://arxiv.org/}, and spinn3r\footnote{http://spinn3r.com/}. Table \ref{tab:streams}   shows the sources, the number of hourly directories, and the number of chunk files. 
 
\subsection{Stream corpus} The stream corpus comes in two versions: raw and cleaned. The raw  and cleansed versions are 6.45TB and 4.5TB respectively,  after xz-compression and GPG encryption. The raw data is a  dump of  raw HTML pages. The cleansed version is the raw data after its HTML tags are stripped off and only English documents identified with Chromium Compact Language Detector \footnote{https://code.google.com/p/chromium-compact-language-detector/} are included.  The stream corpus is organized in hourly folders each of which contains many  chunk files. Each chunk file contains between hundreds and hundreds of thousands of serialized  thrift objects. One thrift object is one document. A document could be a blog article, a news article, or a social media post (including tweet).  The stream corpus comes from three sources: TREC KBA 2012 (social, news and linking) \footnote{http://trec-kba.org/kba-stream-corpus-2013.shtml}, arxiv\footnote{http://arxiv.org/}, and spinn3r\footnote{http://spinn3r.com/}. Table \ref{tab:streams}   shows the sources, the number of hourly directories, and the number of chunk files. 
 
 
\begin{table*}
 
\caption{retrieved documents to different sources }
 
\begin{table}
 
\caption{Retrieved documents to different sources }
 
\begin{center}
 
 
 \begin{tabular}{l*{4}{l}l}
 
@@ -165,7 +165,7 @@ We use TREC-KBA 2013 dataset \footnote{http://http://trec-kba.org/trec-kba-2013.
 
\end{tabular}
 
\end{center}
 
\label{tab:streams}
 
\end{table*}
 
\end{table}
 
 
\subsection{KB entities}
 
 
@@ -177,7 +177,7 @@ TREC-KBA provided relevance judgments for training and testing. Relevance judgme
 
 
 
 \subsection{Stream Filtering}
 
 Given a stream of documents of news items, blogs and social media on one hand and KB entities (Wikipedia, Twitter)  on the other,  we study the factors and choices that affect filtering perfromance. Specifically, we conduct in-depth analysis on the cleansing step, the entity-profile construction, the document category of the stream items, and the type of entities (Wikipedia or Twitter). We also study the impact of choices on classification performance. Finally, we conduct manual examination of the relevant documents that defy filtering. We strive to answer the following research questions:
 
 Given a stream of documents of news items, blogs and social media on one hand and KB entities (Wikipedia, Twitter)  on the other,  we study the factors and choices that affect filtering performance. Specifically, we conduct in-depth analysis on the cleansing step, the entity-profile construction, the document category of the stream items, and the type of entities (Wikipedia or Twitter). We also study the impact of choices on classification performance. Finally, we conduct manual examination of the relevant documents that defy filtering. We strive to answer the following research questions:
 
 
 
 \begin{enumerate}
 
  \item Does cleansing affect filtering and subsequent performance
 
@@ -185,8 +185,7 @@ TREC-KBA provided relevance judgments for training and testing. Relevance judgme
 
  \item Is filtering different for Wikipedia and Twitter entities?
 
  \item Are some type of documents easily filterable and others not ? 
 
  \item Does a gain in recall at filtering step translate to a gain in F-measure at the end of the pipeline?
 
  \item What are the vital(relevant documents that are not retrievable by a system?
 
  \item Are there vital (relevant) documents that are not filterable by a reasonable system?
 
  \item What are the vital(relevant) documents that are not retrievable by a system?
 
\end{enumerate}
 
 
\subsection{Evaluation}
 
@@ -209,12 +208,12 @@ We work with the subset of stream corpus documents  for whom there exist  annota
 
 
\subsection{Entity Profiling}
 
We build profiles for the KB entities of interest. We have two types: Twitter and Wikipedia. Both Entities are selected, on purpose, to be sparse, less-documented.  For the Twitter entities, we visit their respective Twitter pages  and  manually fetch their display names. For the Wikipedia entities, we fetch different name variants from DBpedia, namely  name, label, birth name, alternative names, redirects, nickname, or alias.  The extraction results are in Table \ref{tab:sources}.
 
\begin{table*}
 
\caption{Number of retrieved name variants for different sources }
 
\begin{table}
 
\caption{Number of different DBpedia name variants}
 
\begin{center}
 
 
 \begin{tabular}{l*{4}{c}l}
 
 &Name Variant&Number of strings \\
 
 &Name variant& Number of strings \\
 
\hline
 
 Name  &82&82\\
 
 Label   &121 &121\\
 
@@ -228,19 +227,19 @@ Redirect  &49&96 \\
 
\end{tabular}
 
\end{center}
 
\label{tab:sources}
 
\end{table*}
 
\end{table}
 
 
 
We have a total of 121 Wikipedia entities. Not every entity has a value for the different name variants. Every entity has a DBpedia label.  Only 82 entities have a name string and only 49 entities have redirect strings. Most of the entities have only one string, but some have 2, 3, 4 and 5. One entity, Buddy\_MacKay,has the highest (12) number of redirect strings. 6 entities have  birth names, only 1 ( Chuck Pankow) has a nick name,  ``The Peacemaker'',  only 1 entity has alias and only 4 have alternative names.
 
We have a total of 121 Wikipedia entities.  Every entity has a DBpedia label.  Only 82 entities have a name string and only 49 entities have redirect strings. Most of the entities have only one string, but some have several redirect sterings. One entity, Buddy\_MacKay, has the highest (12) number of redirect strings. 6 entities have  birth names, 1 entity has a nick name, 1 entity has alias and  4 entities have alternative names.
 
 
We combined the different name variants we extracted to form a set of strings for each KB entity. Specifically, we merged the names, labels, redirects, birth names, nick names and alternative names of each entity. For Twitter entities, we used the display names that we collected. We consider the names of the entities that are part of the URL as canonical names . For example in http://en.wikipedia.org/wiki/Benjamin\_Bronfman, Benjamin Bronfman is a canonical name of the entity. From the combined name variants and the canonical names, we  created four sets of profiles for each entity:  canonical names (cano), partial names of canonical names (cano\_part), all names variants (all) and partial names of all names(all\_part). 
 
We combined the different name  we extracted to form a set of strings for each KB entity. Specifically, we merged the names, labels, redirects, birth names, nick names and alternative names of each entity. For Twitter entities, we used the display names that we collected. We consider the names of the entities that are part of the URL as canonical. For example in http://en.wikipedia.org/wiki/Benjamin\_Bronfman, Benjamin Bronfman is a canonical name of the entity. From the combined name variants and the canonical names, we  created four sets of profiles for each entity: canonical (cao) canonical partial (cano-part), all name variants combined (all) and partial names of all name variants(all-part). the names in paranthesis are used in table captions.
 
\subsection{Annotation Corpus}
 
 
The annotation set is a combination of the annotations from before the Training Time Range(TTR) and Evaluation Time Range (ETR) and consists of 68405 annotations.  Its breakdown into training and test sets is  shown in Table \ref{tab:breakdown}.
 
%The annotation set is a combination of the annotations from before the Training Time Range(TTR) and Evaluation Time Range (ETR) and consists of 68405 annotations.  Its breakdown into training and test sets is  shown in Table \ref{tab:breakdown}.
 
 
 
\begin{table}
 
\caption{Number of annotation documents with respect to different categories,(cleansed, raw, training, testing}
 
\caption{Number of annotation documents with respect to different categories(relevance rating, training and testing)}
 
\begin{center}
 
\begin{tabular}{l*{3}{c}r}
 
 &&Vital&Relevant  &Total \\
 
@@ -271,20 +270,20 @@ The annotation set is a combination of the annotations from before the Training
 
 
 
 
Most (more than 80\%) of the annotation documents are in the test set.  In both the training and test data for 2013, there are  68405 annotations, of which 50688 are unique document-entity pairs.   Out of 50688,  24162  unique document-entity pairs vital or relevant, of which 9521 are vital and 17424 are relevant. 
 
Most (more than 80\%) of the annotation documents are in the test set.  In both the training and test data for 2013, there are  68405 annotations, of which 50688 are unique document-entity pairs.   Out of 50688,  24162  unique document-entity pairs are vital-relevant, of which 9521 are vital and 17424 are relevant. 
 
 
 
 
 
\section{Experiments and Results}
 
 We conducted experiments to study  the effect of cleansing, different entity profiles, types of entities, category of documents, relevance ranks (vital or relevant), and the impact on classification.  In the following subsections, we present the results and discuss them.
 
 We conducted experiments to study  the effect of cleansing, different entity profiles, types of entities, category of documents, relevance ranks (vital or relevant), and the impact on classification.  In the following subsections, we present the results in different categories, and describe them.
 
 
 
 \subsection{Cleansing: raw or cleansed}
 
\begin{table}
 
\caption{Central or relevant documents that are retrieved under different name variants , upper part from cleansed, lower part from raw}
 
\caption{vital-relevant documents that are retrieved under different name variants (upper part from cleansed, lower part from raw)}
 
\begin{center}
 
\begin{tabular}{l@{\quad}lllllll}
 
\hline
 
\multicolumn{1}{l}{\rule{0pt}{12pt}}&\multicolumn{1}{l}{\rule{0pt}{12pt}cano}&\multicolumn{1}{l}{\rule{0pt}{12pt}cano partial }&\multicolumn{1}{l}{\rule{0pt}{12pt}all }&\multicolumn{1}{l}{\rule{0pt}{50pt}all\_part}\\[5pt]
 
&cano&cano-part  &all &all-part  \\
 
\hline
 
 
 
@@ -307,10 +306,10 @@ Most (more than 80\%) of the annotation documents are in the test set.  In both
 
\end{table}
 
 
 
The upper part of Table \ref{tab:name} shows the recall performances on the cleansed version and the lower part on the raw version. The recall performances for all entity types  are increased substantially in the raw version. Recall increases on Wikipedia entities  vary from 8.2 to 12.8, and in Twitter entities from 6.8 to 26.2. In all entities, it varies from 8.0 to 13.6.  The recall increases are substantial. To put it into perspective, an 11.8 increase in recall on all entities is a retrieval of 2864 more unique document-entity pairs. This suggests that cleansing has removed some documents that we could otherwise retrieve. 
 
The upper part of Table \ref{tab:name} shows the recall performances on the cleansed version and the lower part on the raw version. The recall performances for all entity types  are increased substantially in the raw version. Recall increases on Wikipedia entities  vary from 8.2 to 12.8, and in Twitter entities from 6.8 to 26.2. In all entities, it ranges from 8.0 to 13.6.  The recall increases are substantial. To put it into perspective, an 11.8 increase in recall on all entities is a retrieval of 2864 more unique document-entity pairs. %This suggests that cleansing has removed some documents that we could otherwise retrieve. 
 
 
\subsection{Entity Profiles}
 
If we look at the recall performances for the raw corpus,   filtering documents by canonical names achieves a recall of  59\%.  Adding other name variants  improves the recall to 79.8\%, an increase of 20.8\%. This means  20.8\% of documents mentioned the entities by other names  rather than by their canonical names. Canonical partial names achieve a recall of 72\%  and the partial names of all name variants achieves 90.2\%. This says that 18.2\% of documents mentioned the entities by  partial names of other non-canonical name variants. 
 
If we look at the recall performances for the raw corpus,   filtering documents by canonical names achieves a recall of  59\%.  Adding other name variants  improves the recall to 79.8\%, an increase of 20.8\%. This means  20.8\% of documents mentioned the entities by other names  rather than by their canonical names. Canonical partial  achieves a recall of 72\%  and name-variant partial achives 90.2\%. This says that 18.2\% of documents mentioned the entities by  partial names of other non-canonical name variants. 
 
 
 
\subsection{Breakdown of results by document source category}
 
@@ -318,31 +317,28 @@ If we look at the recall performances for the raw corpus,   filtering documents
 
  
 
  
 
  \begin{table*}
 
\caption{Breakdown of sources and delta }
 
\caption{Breakdown of  recall percentage increases by document categories }
 
\begin{center}\begin{tabular}{l*{9}{c}r}
 
 && \multicolumn{3}{ c| }{All entities}  & \multicolumn{3}{ c| }{Wikipedia} &\multicolumn{3}{ c| }{Twitter} \\ 
 
 & &Others&news&social & Others&news&social &  Others&news&social \\
 
 & &others&news&social & others&news&social &  others&news&social \\
 
\hline
 
 
 
\multirow{4}{*}{Vital} &cano                 &82.2 &65.6    &70.9          &90.9 &80.1   &76.8             &8.1   &6.3   &   30.5\\
 
			 &cano\_part - cano  	&8.2  &14.9    &12.3           &9.1  &18.6   &14.1             &0      &0       &0  \\
 
                         &all - cano         	&12.6  &19.7    &12.3          &5.5  &15.8   &8.4             &73   &35.9    &38.3  \\
 
	                 &all\_part - cano\_part&9.7    &18.7  &12.7       &0    &0.5  &5.1        &93.2 & 93 &64.4 \\
 
	                 &all\_part - all     	&5.4  &13.9     &12.7           &3.6  &3.3    &10.8              &20.3   &57.1    &26.1 \\
 
\multirow{4}{*}{Vital}	 &cano-part $-$ cano  	&8.2  &14.9    &12.3           &9.1  &18.6   &14.1             &0      &0       &0  \\
 
                         &all$-$ cano         	&12.6  &19.7    &12.3          &5.5  &15.8   &8.4             &73   &35.9    &38.3  \\
 
	                 &all-part $-$ cano\_part&9.7    &18.7  &12.7       &0    &0.5  &5.1        &93.2 & 93 &64.4 \\
 
	                 &all-part $-$ all     	&5.4  &13.9     &12.7           &3.6  &3.3    &10.8              &20.3   &57.1    &26.1 \\
 
	                 \hline
 
	                 
 
\multirow{4}{*}{Relevant} &cano                 &84.2 &53.4    &55.6          &88.4 &75.6   &63.2             &10.6   &2.2    &  6\\
 
			 &cano\_part - cano  	&10.5  &15.1    &12.2          &11.1  &21.7   &14.1             &0   &0    &0  \\
 
                         &all - cano         	&11.7  &36.6    &17.3          &9.2  &19.5   &9.9             &54.5   &76.3   &66  \\
 
	                 &all\_part - cano\_part &4.2  &26.9   &15.8          &0.2    &0.7    &6.7           &72.2   &87.6 &75 \\
 
	                 &all\_part - all     	&3    &5.4     &10.7           &2.1  &2.9    &11              &18.2   &11.3    &9 \\
 
\multirow{4}{*}{Relevant}  &cano-part $-$ cano  	&10.5  &15.1    &12.2          &11.1  &21.7   &14.1             &0   &0    &0  \\
 
                         &all $-$ cano         	&11.7  &36.6    &17.3          &9.2  &19.5   &9.9             &54.5   &76.3   &66  \\
 
	                 &all-part $-$ cano-part &4.2  &26.9   &15.8          &0.2    &0.7    &6.7           &72.2   &87.6 &75 \\
 
	                 &all-part $-$ all     	&3    &5.4     &10.7           &2.1  &2.9    &11              &18.2   &11.3    &9 \\
 
	                 
 
	                 \hline
 
\multirow{4}{*}{total} 	&cano                &    81.1   &56.5   &58.2         &87.7 &76.3   &65.6          &9.8  &1.4    &13.5\\
 
			&cano\_part - cano   	&10.9   &15.5   &12.4         &11.9  &21.3   &14.4          &0     &0       &0\\
 
			&all - cano         	&13.8   &30.6   &16.9         &9.1  &18.9   &10.2          &63.6  &61.8    &57.5 \\
 
                        &all\_part - cano\_part	&7.2   &24.8   &15.9          &0.1    &0.7    &6.8           &82.2  &89.1    &71.3\\
 
                        &all\_part - all     	&4.3   &9.7    &11.4           &3.0  &3.1   &11.0          &18.9  &27.3    &13.8\\	                 
 
\multirow{4}{*}{total} 	&cano-part $-$ cano   	&10.9   &15.5   &12.4         &11.9  &21.3   &14.4          &0     &0       &0\\
 
			&all $-$ cano         	&13.8   &30.6   &16.9         &9.1  &18.9   &10.2          &63.6  &61.8    &57.5 \\
 
                        &all-part $-$ cano-part	&7.2   &24.8   &15.9          &0.1    &0.7    &6.8           &82.2  &89.1    &71.3\\
 
                        &all-part $-$ all     	&4.3   &9.7    &11.4           &3.0  &3.1   &11.0          &18.9  &27.3    &13.8\\	                 
 
	                 
 
                                  	                 
 
	                
 
@@ -355,10 +351,10 @@ If we look at the recall performances for the raw corpus,   filtering documents
 
 
 
 \begin{table*}
 
\caption{Breakdown of sources and delta }
 
\caption{Breakdown of recall performances by document source category}
 
\begin{center}\begin{tabular}{l*{9}{c}r}
 
 && \multicolumn{3}{ c| }{All entities}  & \multicolumn{3}{ c| }{Wikipedia} &\multicolumn{3}{ c| }{Twitter} \\ 
 
 & &Others&news&social & Others&news&social &  Others&news&social \\
 
 & &others&news&social & others&news&social &  others&news&social \\
 
\hline
 
 
 
\multirow{4}{*}{Vital} &cano                 &82.2& 65.6& 70.9& 90.9&  80.1& 76.8&   8.1&  6.3&  30.5\\
 
@@ -387,13 +383,13 @@ If we look at the recall performances for the raw corpus,   filtering documents
 
 
The results  of the different entity profiles on the raw corpus are broken down by source categories and relevance rank (vital, or relevant).  In total, there are 24162 vital or relevant unique entity-document pairs. 9521 of them are vital  and  17424 are relevant. These documents  are categorized into 8 source categories: 0.98\% arxiv(a), 0.034\% classified(c), 0.34\% forum(f), 5.65\% linking(l), 11.53\% mainstream-news(m-n), 18.40\% news(n), 12.93\% social(s) and 50.2\% weblog(w). 
 
 
The 8 document source categories are regrouped into three for two reasons: 1) some groups are very similar to each other. Mainstream-news and news are  similar. The reason they exist separately, in the first place,  is because they were collected from two different sources, by different groups and at different times. we call them news from now on.  The same is true with weblog and social, and we call them social from now on.   2) some groups have so small number of annotations that treating them independently does not make much sense. Majority of vital or relevant annotations are social (social and weblog) (63.13\%). News (mainstream +news) make up 30\%. Thus, news and social make up about 93\% of all annotations.  The rest make up about 7\% and are all grouped as others. 
 
The results of the break down are presented in Table \ref{tab:source-delta}.  The 8 document source categories are regrouped into three for two reasons: 1) some groups are very similar to each other. Mainstream-news and news are  similar. The reason they exist separately, in the first place,  is because they were collected from two different sources, by different groups and at different times. we call them news from now on.  The same is true with weblog and social, and we call them social from now on.   2) some groups have so small number of annotations that treating them independently does not make much sense. Majority of vital or relevant annotations are social (social and weblog) (63.13\%). News (mainstream +news) make up 30\%. Thus, news and social make up about 93\% of all annotations.  The rest make up about 7\% and are all grouped as others. 
 
 
The results of the breakdown by document categories is presented in a multi-dimensional table shown in \ref{tab:source-delta}. There are three outer columns for  all entities, Wikipedia and Twitter. Each of the outer columns consist of the document categories of other,news and social. The rows consist of Vital, relevant and total each of which have the four entity profiles.   
 
 
 
 
 
 
 
 \subsection{ Relevance Rating: Vital and relevant}
 
 
 
 \subsection{ Relevance Rating: vital and relevant}
 
 
 
 When comparing the recall performances in vital and relevant, we observe that canonical names achieve better in vital than in relevant. This is specially true with Wikipedia entities. For example, the recall for news is 80.1 and for social is 76, while the corresponding recall in relevant is 75.6 and 63.2 respectively. We can generally see that the recall in vital are better than the recall in relevant suggesting that relevant documents are more probable to mention the entities and when they do, using some of their common name variants. 
 
 
 
@@ -408,44 +404,45 @@ The results of the breakdown by document categories is presented in a multi-dime
 
 
 
  
 
\subsection{Recall across document categories(others, news and social)}
 
The recall for Wikipedia entities in \ref{tab:name} ranged from 61.8\% (canonical names) to 77.9\% (partial names of name variants. We looked at how these recall is distributed across the three document categories. In Table \ref{tab:source-delta}, Wikipedia column, we see, across all entity profiles, that others have a higher recall followed by news. Social documents achieve the lowest recall.  While the news recall  ranged from 76.4\% to 98.4\%, the recall for social documents ranged from 65.7\% to 86.8\%. Others achieve higher than news and news achieve higher than social. This pattern  holds across  all name variants in  Wikipedia  entities. Notice that the others category stands for arxiv (scientific documents), classifieds, forums and linking.
 
 
In Twitter entities, however, the pattern is different. In canonical names (and their partials), social documents achieve higher recall than news . This suggests that social documents refer to Twitter entities by their canonical names (user names) more than news. In partial names of all name variants, news achieve better results than social. The difference in recall between canonical and partial names of all name variants shows that news do not refer to Twitter entities by their user names, they refer to them with their display names.
 
 
Overall, across all entities types and all entity profiles, others achieve better recall than news, and  news, in turn, achieve higher recall than social documents. This suggests that social documents are the hardest  to retrieve.  This of course makes sense since social posts are short and are more likely to point to other resources, or use short informal names.
 
 
\subsection{Recall across document categories: others, news and social}
 
The recall for Wikipedia entities in Table \ref{tab:name} ranged from 61.8\% (canonicals) to 77.9\% (name-variants).  Table \ref{tab:source-delta} shows how recall is distributed across document categories. For Wikipedia entities, across all entity profiles, others have a higher recall followed by news, and then by social.  While the recall for news  ranged from 76.4\% to 98.4\%, the recall for social documents ranged from 65.7\% to 86.8\%. In Twitter entities, however, the pattern is different. In canonicals (and their partials), social documents achieve higher recall than news. 
 
%This indicates that social documents refer to Twitter entities by their canonical names (user names) more than news do. In name- variant partial, news achieve better results than social. The difference in recall between canonicals and name-variants show that news do not refer to Twitter entities by their user names, they refer to them by their display names.
 
Overall, across all entities types and all entity profiles, others achieve higher recall than news, and  news, in turn, achieve higher recall than social documents. 
 
 
% This suggests that social documents are the hardest  to retrieve.  This  makes sense since social posts such as tweets and blogs are short and are more likely to point to other resources, or use short informal names.
 
 
 
 
We computed four percentage increases in recall (deltas)  between the different entity profiles (see Table \ref{tab:source-delta2}. It is to help writng, will be deleted before submission). The first delta is the recall percentage between partial names of canonical names and canonical names. The second  is the delta between name variants and canonical names. The third is the difference between partial names of name variants  and partial names of canonical names and the fourth between partial names of name variants and name variants. we believe these four deltas offer a clear meaning. The delta between all name variants and canonical names shows the percentage of documents that the new name variants retrieve, but the canonical name does not. Similarly, the delta between partial names of name variants and partial names of canonical names shows the percentage of document-entity pairs that can be gained by the partial names of the name variants. 
 
 
In most of the  deltas, news followed by social followed by others show greater difference. This suggests s that news refer to entities by different names, rather than by a certain standard name.  This is counter-intuitive since one would expect news to mention entities by some consistent name(s) thereby reducing the difference. The deltas, for Wikipedia entities, between canonical partials and canonicals,  and all name variants and canonicals are high  suggesting that partial names and all other name variants bring in new documents that can not be retrieved by canonical names. The rest of the two deltas are very small suggesting that partial names of all name variants do not bring in new relevant documents. In Twitter entities,  name variants bring in new documents. 
 
 
We computed four percentage increases in recall (deltas)  between the different entity profiles (Table \ref{tab:source-delta2}). The first delta is the recall percentage between canonical partial  and canonical. The second  is  between name= variant and canonical. The third is the difference between name-variant partial  and canonical partial and the fourth between name-variant partial and name-variant. we believe these four deltas offer a clear meaning. The delta between name-variant and canonical measn the percentage of documents that the new name variants retrieve, but the canonical name does not. Similarly, the delta between  name-variant partial and partial canonical-partial means the percentage of document-entity pairs that can be gained by the partial names of the name variants. 
 
% The  biggest delta  observed is in Twitter entities between partials of all name variants and partials of canonicals (93\%). delta. Both of them are for news category.  For Wikipedia entities, the highest delta observed is 19.5\% in cano\_part - cano followed by 17.5\% in all\_part in relevant.  
 
  
 
  \subsection{Entity Types: Wikipedia and Twitter)}
 
Table \ref{tab:name} shows the difference between Wikipedia and Twitter entities.  Wikipedia entities' canonical names achieve a recall of 70\%, and partial names of canonical names achieve a recall of 86.1\%. This is an increase in recall of 16.1\%. By contrast, the increase in recall of partial names of all name variants over just all name variants is 8.3.  The high increase in recall when moving from canonical names  to their partial names, in comparison to the lower increase when moving from all name variants to their partial names can be explained by saturation. This is to mean that documents have already been extracted by the different name variants and thus using their partial names does not bring in many new relevant documents. One interesting observation is that, on Wikipedia entities, partial names of canonical names achieve better recall than name variants.  This holds in both cleansed and raw extractions. %In the raw extraction, the difference is about 3.7. 
 
Table \ref{tab:name} shows the difference between Wikipedia and Twitter entities.  Wikipedia entities' canonical achieves a recall of 70\%, and canonical partial  achieves a recall of 86.1\%. This is an increase in recall of 16.1\%. By contrast, the increase in recall of name-variant partial over name-variant is 8.3.   
 
%The high increase in recall when moving from canonical names  to their partial names, in comparison to the lower increase when moving from all name variants to their partial names can be explained by saturation. This is to mean that documents have already been extracted by the different name variants and thus using their partial names does not bring in many new relevant documents. 
 
One interesting observation is that, For Wikipedia entities, canonical partial achieves better recall than name-variant in both cleansed and raw corpus.  %In the raw extraction, the difference is about 3.7. 
 
In Twitter entities, however, it is different. Both canonical and their partials perform the same and the recall is very low. Canonical  and canonical partial are the same for Twitter entities because they are one word strings. For example in https://twitter.com/roryscovel, ``roryscovel`` is the canonical name and its partial is also the same.  
 
%The low recall is because the canonical names of Twitter entities are not really names; they are usually arbitrarily created user names. It shows that  documents  refer to them by their display names, rarely by their user name, which is reflected in the name-variant recall (67.9\%). The use of name-variant partial increases the recall to 88.2\%.
 
 
In Twitter entities, however, it is different. Both canonical and their partial names perform the same and the recall is very low. Canonical names and partial canonical names are the same for Twitter entities because they are one word names. For example in https://twitter.com/roryscovel, ``roryscovel`` is the canonical name and its partial name is also the same.  The low recall is because the canonical names of Twitter entities are not really names; they are usually arbitrarily created user names. It shows that  documents do not refer to Twitter entities by their user names. They refer to them by their display names, which is reflected in the recall (67.9\%). The use of partial names of display names increases the recall to 88.2\%.
 
 
When we talk at an aggregate-level (both Twitter and Wikipedia entities), we observe two important patterns. 1) we see that recall increases as we move from canonical names to canonical partial names, to all name variants, and to partial names of name variants. But we saw that that is not the case in Wikipedia entities.  The influence, therefore, comes from Twitter entities. 2) Using canonical names retrieves the least number of vital or relevant documents, and the partial names of all name variants retrieves the most number of documents. The difference in performance is 31.9\% on all entities, 20.7\% on Wikipedia entities, and 79.5\% on Twitter entities. This is a significant performance difference. 
 
 
The tables in \ref{tab:name} and \ref{tab:source-delta} show recall for Wikipedia entities are higher than for Twitter. Generally, at both aggregate and document category levels, we observe that recall increases as we move from canonicals to canonical partial, to name-variant, and to name-variant partial. The only case where this does not hold is in the transition from Wikipedia's canonical partial to name-variant. At the aggregate level(as can be inferred from Table \ref{tab:name}), the difference in performance between  canonical  and name-variant partial is 31.9\% on all entities, 20.7\% on Wikipedia entities, and 79.5\% on Twitter entities. This is a significant performance difference. 
 
 
 
The tables in \ref{tab:name} and \ref{tab:source-delta} show recall for Wikipedia entities are higher than for Twitter. This indicates that Wikipedia entities are easier to match in documents than Twitter. This can be due to one reasons: 1) Wikipedia entities are relatively well described than Twitter entities. The fact that we can retrieve different name variants from DBpedia is a measure of relatively rich description. By contrast, we have only two names for Twitter entities: their user names and their display names which we collect from their Twitter pages. 2) Twitter entities have a less richer entity profiles such as DBpedia from which we can collect alternative names. The results in the tables also show  that Twitter entities are mentioned by their display names more than  by their user names. However,  social documents mention Twitter entities by their user names more than news suggesting a distinction in standard between news and social documents. 
 
 
 
 
   \subsection{Impact on classification}
 
  In the overall experimental setup, classification, ranking,  and evaluationn are kept constant. In here, we present results showing how  the choices in corpus, entity types, and entity profiles impact these latest stages of the pipeline.  In tables \ref{tab:class-vital} and \ref{tab:class-vital-relevant}, we show the performances in F-measure and SU. 
 
%   In the overall experimental setup, classification, ranking,  and evaluationn are kept constant. 
 
  In here, we present results showing how  the choices in corpus, entity types, and entity profiles impact these latest stages of the pipeline.  In tables \ref{tab:class-vital} and \ref{tab:class-vital-relevant}, we show the performances in F-measure and SU. 
 
\begin{table*}
 
\caption{vital performance under different name variants(upper part from cleansed, lower part from raw)}
 
\begin{center}
 
\begin{tabular}{ll@{\quad}lllllll}
 
\hline
 
&\multicolumn{1}{l}{\rule{0pt}{12pt}}&\multicolumn{1}{l}{\rule{0pt}{12pt}cano}&\multicolumn{1}{l}{\rule{0pt}{12pt}cano partial }&\multicolumn{1}{l}{\rule{0pt}{12pt}all }&\multicolumn{1}{l}{\rule{0pt}{50pt}all\_part}\\[5pt]
 
 
%&\multicolumn{1}{l}{\rule{0pt}{12pt}}&\multicolumn{1}{l}{\rule{0pt}{12pt}cano}&\multicolumn{1}{l}{\rule{0pt}{12pt}canonical partial }&\multicolumn{1}{l}{\rule{0pt}{12pt}name-variant }&\multicolumn{1}{l}{\rule{0pt}{50pt}name-variant partial}\\[5pt]
 
  &&cano&cano-part&all  &all-part \\
 
 
 
   all-entities &F& 0.241&0.261&0.259&0.265\\
 
@@ -465,7 +462,7 @@ The tables in \ref{tab:name} and \ref{tab:source-delta} show recall for Wikipedi
 
   Wikipedia&F &0.257&0.257&0.257&0.255\\
 
   &SU	     & 0.265&0.265 &0.266 & 0.259\\
 
   twitter&F &0.188&0.188&0.208&0.231\\
 
	&SU&    0.269 &0.250\% &0.250&0.253\\
 
	&SU&    0.269 &0.250 &0.250&0.253\\
 
\hline
 
 
\end{tabular}
 
@@ -479,9 +476,9 @@ The tables in \ref{tab:name} and \ref{tab:source-delta} show recall for Wikipedi
 
\begin{center}
 
\begin{tabular}{ll@{\quad}lllllll}
 
\hline
 
&\multicolumn{1}{l}{\rule{0pt}{12pt}}&\multicolumn{1}{l}{\rule{0pt}{12pt}cano}&\multicolumn{1}{l}{\rule{0pt}{12pt}cano partial }&\multicolumn{1}{l}{\rule{0pt}{12pt}all }&\multicolumn{1}{l}{\rule{0pt}{50pt}all\_part}\\[5pt]
 
 
%&\multicolumn{1}{l}{\rule{0pt}{12pt}}&\multicolumn{1}{l}{\rule{0pt}{12pt}canonical}&\multicolumn{1}{l}{\rule{0pt}{12pt}canonical partial }&\multicolumn{1}{l}{\rule{0pt}{12pt}name-variant }&\multicolumn{1}{l}{\rule{0pt}{50pt}name-variant partial}\\[5pt]
 
 
 &&cano&cano-part&all  &all-part \\
 
 
   all-entities &F& 0.497&0.560&0.579&0.607\\
 
	      &SU&0.468  &0.484 &0.483 &0.492 \\	
 
@@ -510,20 +507,22 @@ The tables in \ref{tab:name} and \ref{tab:source-delta} show recall for Wikipedi
 
 
 
 
On Wikipedia entities, except in the canonical profile, the cleansed version achieves  better results than the raw version.  However, on Twitter entities, the raw corpus achieves a better  in all profiles (except in in the partial names of all names variants).  In all entities (both Wikipedia and Twitter), we see that in three profiles, cleansed achieves better. only in canonical partial, does raw perform better.  This result is interesting because we saw in previous sections that the raw corpus achieves a higher recall. In the case of partial names of name variants, for example, 10\% more relevant documents are retrieved. This suggests that a gain in recall does not necessarily mean a gain in F\_measure here. One explanation for this is that it brings in many false positives from, among related links, adverts, etc.  
 
Table \ref{tab:class-vital} shows the recall performance for vitally judged documents.  On Wikipedia entities, except in the canonical profile, the cleansed version achieves  better results than the raw version.  However, on Twitter entities, the raw corpus achieves  better  in all entity profiles (except  in name-variant partial).  At an aggregate (both Wikipedia and Twitter) level, we see that in three profiles, cleansed achieves better.  Only in canonical partial, does raw perform better. Overall cleansed achieves better results than raw.  This result is interesting because we saw in previous sections that the raw corpus achieves  higher recall than cleansed. In the case name-variant partial, for example, 10\% more relevant documents are retrieved in the raw corpus. The gain in recall in raw corpus does not translate into a gain in F\_measure. In fact, in most cases F\_measure decreased. % One explanation for this is that it brings in many false positives from, among related links, adverts, etc.  
 
For Wikipedia entities,  canonical partial  achieves the highest performance. For Twitter, name-variant partial achieves  better results.
 
 
For Wikipedia entities,  canonical partial names seem to achieve the highest performance. For Twitter, the partial names of name variants achieve  better results. In vital-relevant, in three cases, raw achieves better results (except in cano partials). For Twitter entities, the raw corpus achieves better results.  In terms of  entity profiles, Wikipedia's canonical partial names achieves  the best F-score. For Twitter, as before, partial names of canonical names. 
 
In vital-relevant category (Table \ref{tab:class-vital-relevant}), the performances are different.  Except in canonical partial,  raw achieves better results in all cases. For Twitter entities, the raw corpus achieves better results in all cases.  In terms of  entity profiles, Wikipedia's canonical partial  achieves  the best F-score. For Twitter, as before, canonical partial. The raw corpus has more effect on relevant documents and Twitter entities.  
 
 
It seems, the raw corpus has more effect on Twitter entities performances. An increase in recall does not necessarily mean an increase in F-measure.  The fact that canonical partial names achieve better results is interesting.  We know that partial names were used as a baseline in TREC KBA 2012, but no one of the KBA participants actually used partial names for filtering.
 
%The fact that canonical partial names achieve better results is interesting.  We know that partial names were used as a baseline in TREC KBA 2012, but no one of the KBA participants actually used partial names for filtering.
 
 
 
\subsection{Missing relevant documents \label{miss}}
 
There is a trade-off between using a richer entity-profile and retrieval of irrelevant documents. The richer the profile, the more relevant documents it retrieves, but also the more irrelevant documents. To put into perspective, lets compare the number of documents that are retrieved with partial names of name variants and partial names of canonical names. Using the raw corpus, the partial names of canonical names extracts a total of 2547487 and achieves a recall of 72.2\%. By contrast, the partial names of name variants extracts a total of 4735318 documents and achieves a recall of 90.2\%. The total number of documents extracted increases by 85.9\% for a recall gain of 18\%. The rest of the documents, that is 67.9\%, are newly introduced irrelevant documents. There is an advantage in excluding irrelevant documents from filtering because they confuse the later stages of the pipeline. 
 
\subsection{Missing vital-relevant documents \label{miss}}
 
T
 
% 
 
 
 The use of the partial names of name variants for filtering is an aggressive attempt to retrieve as many relevant documents as possible at the cost of retrieving irrelevant documents. However, we still miss about  2363(10\%) the vital-relevant documents.  Why are these documents missed? If they are not mentioned by partial names of name variants, what are they mentioned by? Table \ref{tab:miss} shows the documents that we miss with respect to cleansed and raw corpus.  The upper part shows the number of documents missing from cleansed and raw versions of the corpus. The lower part of the table shows the intersections and exclusions in each corpus.  
 
 The use of name-variant partial for filtering is an aggressive attempt to retrieve as many relevant documents as possible at the cost of retrieving irrelevant documents. However, we still miss about  2363(10\%) of the vital-relevant documents.  Why are these documents missed? If they are not mentioned by partial names of name variants, what are they mentioned by? Table \ref{tab:miss} shows the documents that we miss with respect to cleansed and raw corpus.  The upper part shows the number of documents missing from cleansed and raw versions of the corpus. The lower part of the table shows the intersections and exclusions in each corpus.  
 
 
\begin{table}
 
\caption{The number of documents that are missing from raw and cleansed extractions. }
 
\caption{The number of documents missing  from raw and cleansed extractions. }
 
\begin{center}
 
\begin{tabular}{l@{\quad}llllll}
 
\hline
 
@@ -545,14 +544,13 @@ Raw & 276 & 4951 & 5227 \\
 
\label{tab:miss}
 
\end{table}
 
 
 It is normal to assume that  the set of document-entity pairs extracted from cleansed are a sub-set of those   that are extracted from the raw corpus. We find that that is not the case. There are 217  unique entity-document pairs that are retrieved from the cleansed corpus, but not from the raw. 57 of them are vital.    Similarly,  there are  3081 document-entity pairs that are missing  from cleansed, but are present in  raw. 1065 of them are vital.  Examining the content of the documents reveals that it is due to a missing part of text from a corresponding document.  All the documents that we miss from the raw corpus are social, particularly from the category social  (not from weblogs). These are document such as tweets and other posts from other social media. To meet the format of the raw data, some of them must have been converted later after collection and on the way lost a part or all of their content. It is similar for the documents that we miss from cleansed: a part or the  content  is lost in 
 
converting.  In both cases the mention of the entity happened to be on the part of the text that is cut out during conversion. 
 
One would  assume that  the set of document-entity pairs extracted from cleansed are a sub-set of those   that are extracted from the raw corpus. We find that that is not the case. There are 217  unique entity-document pairs that are retrieved from the cleansed corpus, but not from the raw. 57 of them are vital.    Similarly,  there are  3081 document-entity pairs that are missing  from cleansed, but are present in  raw. 1065 of them are vital.  Examining the content of the documents reveals that it is due to a missing part of text from a corresponding document.  All the documents that we miss from the raw corpus are social. These are documents such as tweets and blogs, posts from other social media. To meet the format of the raw data (binary byte array), some of them must have been converted later, after collection and on the way lost a part or the entire content. It is similar for the documents that we miss from cleansed: a part or the entire content  is lost in during the cleansing process (the removal of HTML tags and non-English documents).  In both cases the mention of the entity happened to be on the part of the text that is cut out during transformation. 
 
 
 
 
 The interesting set  of relevance judgments are those that  we miss from both raw and cleansed extractions. These are 2146 unique document-entity pairs, 219 of them are with vital relevance judgments.   The total number of entities in the missed vital annotations is  28 Wikipedia and 7  Twitter, making a total of 35. Looking into document categories shows that the  great majority (86.7\%) of the documents are social. This suggests that social (tweets and blogs) can talk about the entities without mentioning  them by name. This is, of course, inline with intuition. 
 
 The interesting set  of relevance judgments are those that  we miss from both raw and cleansed extractions. These are 2146 unique document-entity pairs, 219 of them are with vital relevance judgments.   The total number of entities in the missed vital annotations is  28 Wikipedia and 7  Twitter, making a total of 35. The  great majority (86.7\%) of the documents are social. This suggests that social (tweets and blogs) can talk about the entities without mentioning  them by name more than news and others do. This is, of course, inline with intuition. 
 
   
 
   
 
   Vital documents show higher recall than relevant. This is not surprising as we it is more likely that vital documents mention the entities more than relevant. Across document categories, we observe a pattern in recall of others, followed by news, and then by social. Social documents are the hardest to retrieve. This can be explained by the fact that social documents (tweets, blogs) are more likely to point to a resource without mentioning the entities. By contrast news documents mention the entities they talk about. 
 
%    
 
   
 
   
 
%    
 
@@ -575,21 +573,6 @@ converting.  In both cases the mention of the entity happened to be on the part
 
% \label{tab:miss-category}
 
% \end{table*}
 
 
However, it is interesting to look into the actual content of the documents to gain an insight into the ways a document can talk about an entity without mentioning the entity.  We collected 35 documents, one for each entity, for manual examination. Here below we present the reasons.
 
\paragraph{Outgoing link mentions} A post (tweet) with an outgoing link which mentions the entity.
 
\paragraph{Event place - Event} A document that talks about an event is vital to the location entity where it takes place.  For example Maha Music Festival takes place in Lewis and Clark\_Landing, and a document talking about the festival is vital for the park. There are also cases where an event's address places the event in a park and due to that the document becomes vital to the park. This is basically being mentioned by address which belongs to alarger space. 
 
\paragraph{Entity -related entity} A document about an important figure such as artist, athlete  can be vital to another. This is specially true if the two are contending for the same title, one has snatched a title, or award from the other. 
 
\paragraph{Organization - main activity} A document that talks about about an area on which the company is active is vital for the organization. For example, Atacocha is a mining company  and a news item on mining waste was annotated vital. 
 
\paragraph{Entity - group} If an entity belongs to a certain group (class),  a news item about the group can be vital for the individual members. FrankandOak is  named innovative company and a news item that talks about the group  of innovative companies is relevant for a  it. Other examples are: a  big event  of which an entity is related such an Film awards for actors. 
 
\paragraph{Artist - work} Documents that discuss the work of artists can be relevant to the artists. Such cases include  books or films being vital for the book author or the director (actor) of the film. Robocop is film whose screenplay is by Joshua Zetumer. A blog that talks about the film was judged vital for Joshua Zetumer. 
 
\paragraph{Politician - constituency} A major political event in a certain constituency is vital for the politician from that constituency. 
 
 A good example is a weblog that talks about two north Dakota counties being drought disasters. The news is vital for Joshua Boschee, a politician, a member of North Dakota democratic party.  
 
\paragraph{head - organization} A document that talks about an organization of which the entity is the head can be vital for the entity.  Jasper\_Schneider is USDA Rural Development state director for North Dakota and an article about problems of primary health centers in North Dakota is judged vital for him. 
 
\paragraph{World Knowledge} Some things are impossible to know without your world knowledge. For example ''refreshments, treats, gift shop specials, "bountiful, fresh and fabulous holiday decor," a demonstration of simple ways to create unique holiday arrangements for any home; free and open to the public`` is judged relevant to Hjemkomst\_Center. This is a social media post, and unless one knows the person posting it, there is no way that this text shows that. Similarly ''learn about the gray wolf's hunting and feeding behaviors and watch the wolves have their evening meal of a full deer carcass; $15 for members, $20 for nonmembers`` is judged vital to Red\_River\_Zoo.  
 
\paragraph{No document content} Some documents were found to have no content
 
\paragraph{Not clear why} It is not clear why some documents are annotated vital for some entities.
 
 
 
 
 
 
%    To gain more insight, I sampled for each 35 entities, one document-entity pair and looked into the contents. The results are in \ref{tab:miss from both}
 
@@ -642,7 +625,7 @@ However, it is interesting to look into the actual content of the documents to g
 
% \label{tab:miss from both}
 
% \end{table*}
 
 
We also observed that although documents have different document ids, several of them have the same content. In the vital annotation, there are only three (88 news, and 409 weblog). In the 35  vital document-entity pairs we examined, and 13 are news and  22 are social. 
 
 
 
 
   
 
  
 
@@ -650,44 +633,68 @@ We also observed that although documents have different document ids, several of
 
 
 
We conducted experiments to study  the impacts on recall of 
 
different components of the filtering step of entity-based filtering and ranking pipeline. Specifically 
 
different components of the filtering stage of entity-based filtering and ranking pipeline. Specifically 
 
we conducted experiments to study the impacts of cleansing, 
 
entity profile, relevance rankings, categories of documents, and documents that are missed. We also measured their impacts on classification performance. 
 
entity profiles, relevance ratings, categories of documents, entity profiles. We also measured  impact of the different factors and choices  on later stages of the pipeline. 
 
 
Experimental results using TREC-KBA problem setting and dataset  show that cleansing removes entire or parts of document contents making them difficult to retrieve. These documents can, otherwise, be retrieved from the raw version. The use of the raw corpus brings in documents that can not be retrieved from the cleansed corpus. This is true for all entity profiles and for all entity types. The  recall increase is  between 6.8\% to 26.2\%. These increase, in actual document-entity pairs,  is in thousands. 
 
Experimental results show that cleansing can remove entire or parts of the content of documents making them difficult to retrieve. These documents can, otherwise, be retrieved from the raw version. The use of the raw corpus brings in documents that can not be retrieved from the cleansed corpus. This is true for all entity profiles and for all entity types. The  recall difference between the cleansed and raw ranges from  6.8\% t 26.2\%. These increases, in actual document-entity pairs,  is in thousands. We believe this is a substantial increase. However, the recall increases do not always translate to improved F-score in overall performance.  In the vital relevance ranking for both Wikipedia and aggregate entities, the cleansed version performs better than the raw version.  In Twitter entities, the raw corpus achieves better except in the case of all name-variant, though the difference is negligible.  However, for vital-relevant, the raw corpus performs  better across all entity profiles and entity types except in partial canonical names of Wikipedia entities. 
 
 
The use of different profiles also shows a big difference in percentage recall. Except in the case of Wikipedia, where the use of canonical partial name achieves better than name variants, there is a steady increase in recall from canonical names to partial canonical names, name variants and partial names of name variants. The difference between partial names of name variants and canonical names is 30.8\%. And between partial names of name variants and partial names of canonical names is 18.0\%.  
 
The use of different profiles also shows a big difference in recall. Except in the case of Wikipedia where the use of canonical partial achieves better than name-variant, there is a steady increase in recall from canonical to  canonical partial, to name-variant, and to name-variant partial. This pattern is also observed across the document categories.  However, here too, the relationship between   the gain in recall as we move from less richer profile to a more richer profile and overall performance as measured by F-score  is not linear. 
 
 
Does this increase in recall as we move from a less richer profile to a more richer profile translate to an increase in classification performance? The results show that it does not. In most profiles, for both Wikipedia and total entities, the cleansed version performs better than the raw version. In Twitter entities, the raw corpus achieves better except in the case of all name variants. However, the difference in performance are so small that the increase can be ignored. The highest performance for Wikipedia entities is achieved with partial names of canonical names, rather than partial names of all names variants which retrieve 18.0\% more documents. 
 
In vital ranking, across all entity profiles and types of corpus, Wikipedia's canonical partial  achieves better performance than any other Wikipedia entity profiles. In vital-relevant documents too, Wikipedia's canonical partial achieves the best result. In the raw corpus, it achieves a little less than name-variant partial. For Twitter entities, the name-variant partial profile achieves the highest F-score across all entity profiles and types of corpus.  
 
 
 
However, for vital plus relevant, the raw corpus performs  better except in partial canonical names. In all cases, Wikipedia's canonical partial names achieves better performance than any other profile. This is interesting because the retrieval of thousands of document-entity pairs did not translate to an increase in performance  in classification. One reason why an increase in recall does not translate to an increase in F-measure later is because of the retrieval of many false positives which confuse the classifier. A good profile for Wikipedia entities seem canonical partial names suggesting that there is actually no need to go and fetch different names variants. For Twitter entities, the use of partial names of their display names are  good entity profiles. 
 
There are 3 interesting observations: 1) cleansing impacts Twitter entities and relevant documents.  This  is validated by the observation that recall  gains in Twitter entities and the relevant categories in the raw corpus also translate into overall performance gains. This observation implies that cleansing removes relevant and social documents than it does vital and news. That it removes relevant documents more than vital can be explained by the fact that cleansing removes the related links and adverts which may contain a mention of the entities. One example we saw was the the cleansing removed an image with a text of an entity name which was actually relevant. And that it removes social documents can be explained by the fact that most of the missing of the missing  docuemnts from cleansed are social. And all the docuemnts that are missing from raw corpus social. So in both cases socuial seem to suffer from text transformation and cleasing processes. 2) Taking both performance (recall at filtering and overall F-score during evaluation) into account, Wikipedia's canonical partial is the best entity profile for Wikipedia entities. This is interesting  to see that the retrieval of of  thousands vital-relevant document-entity pairs by name-variant partial does not translate to an increase in over all performance. It is even more interesting since canonical partial was not considered as contending profile for stream filtering by any of participant to the best of our knowledge. With this understanding, there  is actually no need to go and fetch different names variants from DBpedia, a saving of time and computational resources.
 
 
The deltas between entity profiles, relevance ratings, and document categories reveal four differences between Wikipedia and Twitter entities. 1) For Wikipedia entities, the difference between canonical partial and canonical is higher(16.1\%) than between name-variant partial and  name-variant(18.3\%).  This can be explained by saturation. This is to mean that documents have already been extracted by  name-variants and thus using their partials does not bring in many new relevant documents.  2) Twitter entities are mentioned by name-variant or name-variant partial and that is seen in the high recall achieved  compared to the low recall achieved by canonical(or their partial). This indicates that documents (specially news and others) almost never use user names to refer to Twitter entities. Name-variant partials are the best entity profiles for Twitter entities. 3) However, comparatively speaking, social documents refer to Twitter entities by their user names than news and others suggesting a difference in adherence to standard in names and naming. 4) Wikipedia entities achieve higher recall and higher overall performance. 
 
 
The high recall and subsequent higher overall performance of Wikipedia entities can  be due to two reasons. 1) Wikipedia entities are relatively well described than Twitter entities. The fact that we can retrieve different name variants from DBpedia is a measure of relatively rich description. Rich description plays a role in both filtering and computation of features such as similarity measures in later stages of the pipeline.   By contrast, we have only two names for Twitter entities: their user names and their display names which we collect from their Twitter pages. 2) There is not DBpedia-like resource for Twitter entities from which alternative names cane be collected.   
 
 
 
In the experimental results, we also observed that recall scores in the vital category are higher than in the relevant category. This observation  confirms one commonly held assumption:(frequency) mention is related to relevance.  this is the assumption why term frequency is used an indicator of document relevance in many information retrieval systems. The more  a document mentions an entity explicitly by name, the more likely the document is vital to the entity.
 
 
\section{Conclusions}
 
In this paper, we examined the filtering step of the CCR pipeline by holding the classification stage constant. In particular, we studied the cleansing step, different entity profiles, type of entities (Wikipedia or Twitter), categories of documents (news, social, or others) and the vital (relevant) documents. While doing so, we attempted to find answers to these research questions: 
 
\begin{enumerate}
 
  \item Does cleansing affect filtering and subsequent performance
 
  \item What is the most effective way of entity profile representation
 
  \item Is filtering different for Wikipedia and Twitter entities?
 
  \item Are some type of documents easily filterable and others not ? 
 
  \item Does a gain in recall at filtering step translate to a gain in F-measure at the end of the pipeline?
 
  \item What are the vital(relevant documents that are not retrievable by a system?
 
\end{enumerate}
 
Across document categories, we observe a pattern in recall of others, followed by news, and then by social. Social documents are the hardest to retrieve. This can be explained by the fact that social documents (tweets and  blogs) are more likely to point to a resource where the entity is mentioned, mention the entities with some short abbreviation, or talk without mentioning the entities, but with some context in mind. By contrast news documents mention the entities they talk about using the common name variants more than social documents do. However, the greater difference in percentage recall between the different entity profiles in the news category indicates news refer to a given entity with different names, rather than by one standard name. By contrast others show least variation in referring to news. Social documents falls in between the two.  The deltas, for Wikipedia entities, between canonical partials and canonicals,  and name-variants and canonicals are high, an indication that canonical partials and name-variants bring in new relevant documents that can not be retrieved by canonicals. The rest of the two deltas are very small,  suggesting that partial names of name variants do not bring in new relevant documents. 
 
 
 
\section{Unfilterable documents}
 
There is a trade-off between using a richer entity-profile and retrieval of irrelevant documents. The richer the profile, the more relevant documents it retrieves, but also the more irrelevant documents. To put it into perspective, lets compare the number of documents that are retrieved with  canonical partial and with name-variant partial. Using the raw corpus, the former retrieves a total of 2547487 documents and achieves a recall of 72.2\%. By contrast, the later retrieves a total of 4735318 documents and achieves a recall of 90.2\%. The total number of documents extracted increases by 85.9\% for a recall gain of 18\%. The rest of the documents, that is 67.9\%, are newly introduced irrelevant documents. 
 
 
% 
 
 
 We observed that there are vital-relevant documents that we miss from raw only, and similarly from cleansed only. The reason for this is transformation from one format to another. The most interesting documents are those that we miss from both raw and cleansed corpus. We first identified the number of KB entities who have a vital relevance judgment and  whose documents can not be retrieved (they were 35 in total) and conducted a manual examination into their content to find out why they are missing. 
 
 
 
 
 
 We  observed  that among the missing documents, different document ids can have the same content, and be judged multiple times for a given entity.  %In the vital annotation, there are 88 news, and 409 weblog. 
 
 Avoiding duplicates, we randomly selected 35 documents, one for each entity.   The documents are 13 news and  22  social. Here below we have classified the situation under which a document can be vital for an entity without mentioning the entities with the different entity  profiles we used for filtering. 
 
\paragraph{Outgoing link mentions} A post (tweet) with an outgoing link which mentions the entity.
 
\paragraph{Event place - Event} A document that talks about an event is vital to the location entity where it takes place.  For example Maha Music Festival takes place in Lewis and Clark\_Landing, and a document talking about the festival is vital for the park. There are also cases where an event's address places the event in a park and due to that the document becomes vital to the park. This is basically being mentioned by address which belongs to alarger space. 
 
\paragraph{Entity -related entity} A document about an important figure such as artist, athlete  can be vital to another. This is specially true if the two are contending for the same title, one has snatched a title, or award from the other. 
 
\paragraph{Organization - main activity} A document that talks about about an area on which the company is active is vital for the organization. For example, Atacocha is a mining company  and a news item on mining waste was annotated vital. 
 
\paragraph{Entity - group} If an entity belongs to a certain group (class),  a news item about the group can be vital for the individual members. FrankandOak is  named innovative company and a news item that talks about the group  of innovative companies is relevant for a  it. Other examples are: a  big event  of which an entity is related such an Film awards for actors. 
 
\paragraph{Artist - work} Documents that discuss the work of artists can be relevant to the artists. Such cases include  books or films being vital for the book author or the director (actor) of the film. Robocop is film whose screenplay is by Joshua Zetumer. A blog that talks about the film was judged vital for Joshua Zetumer. 
 
\paragraph{Politician - constituency} A major political event in a certain constituency is vital for the politician from that constituency. 
 
 A good example is a weblog that talks about two north Dakota counties being drought disasters. The news is vital for Joshua Boschee, a politician, a member of North Dakota democratic party.  
 
\paragraph{head - organization} A document that talks about an organization of which the entity is the head can be vital for the entity.  Jasper\_Schneider is USDA Rural Development state director for North Dakota and an article about problems of primary health centers in North Dakota is judged vital for him. 
 
\paragraph{World Knowledge} Some things are impossible to know without your world knowledge. For example ''refreshments, treats, gift shop specials, "bountiful, fresh and fabulous holiday decor," a demonstration of simple ways to create unique holiday arrangements for any home; free and open to the public`` is judged relevant to Hjemkomst\_Center. This is a social media post, and unless one knows the person posting it, there is no way that this text shows that. Similarly ''learn about the gray wolf's hunting and feeding behaviors and watch the wolves have their evening meal of a full deer carcass; $15 for members, $20 for nonmembers`` is judged vital to Red\_River\_Zoo.  
 
\paragraph{No document content} Some documents were found to have no content
 
\paragraph{Not clear why} It is not clear why some documents are annotated vital for some entities.
 
 
 
While the use of the raw corpus and some entity profiles increases recall substantially, it did not bring a considerable improvement on classification performance. While the absence of  increase in performance for some entity profiles can be explained by the fact that they bring in false positives that confuse the system, the absence of considerable increase in performance in the raw corpus defies explanation. 
 
 
Although both Wikipedia and Twitter entities are similar in the sense that they both are sparse entities, our results warrant that they be treated differently. We have seen, for example, that Wikipedia entities best profile representation is the use of partial names of canonical names while Twitter Entities' best profile is the partial names of name variants. Moreover, we have seen that Twitter entities achieve better results with the raw corpus while Wikipedia entities with the cleansed corpus. 
 
 
The categories of documents also have an impact on performance. News items show greater variation in performance between the different Entity profiles. This indicates that news items use a wide, non-uniform way of referring entities more than social documents. This is counter-intuitive for one would normally expect news to be more standardized than social (blogs and tweets).
 
 
  
 
As we strive to use different entity profiles from less rich to very rich, we observe that there are still documents that we still miss no matter what. While some are possible to retrieve with some modifications, some others are not. There are some document that indicate that computers do not seem to get them no matter how rich representation of entities they use. These are documents such as those that  are mentioned in one of the following ways: in a page linked to by 
 
read more,
 
venue-event, world knowledge, party-politician, company-related event, entity-related entity, artist - artists work. We also found that some documents have no content at all. Another group of documents ones for which it is not clear why they are relevant. 
 
 
 
 
 
\section{Conclusions}
 
In this paper, we examined the filtering stage of the entity-centric stream filtering and ranking  by holding the later stages of fixed. In particular, we studied the cleansing step, different entity profiles, type of entities(Wikipedia or Twitter), categories of documents(news, social, or others) and the relevance ratings. We attempted to address the following research questions: 1) does cleansing affect filtering and subsequent performance? 2) what is the most effective way of entity profiling? 3) is filtering different for Wikipedia and Twitter entities? 4) are some type of documents easily filterable and others not? 5) does a gain in recall at filtering step translate to a gain in F-measure at the end of the pipeline? and 6) what are the circumstances under which vital documents can not be retrieved? 
 
 
Cleansing does remove parts or entire contents of documents making them irretrievable. However, because of the introduction of false positives, recall gains by  raw corpus and some  richer entity profiles do not necessarily translate to overall performance gain. The results conclusion on this is mixed in the sense that cleansing helps improve the recall on vital documents and Wikipedia entities, but reduces the recall on Twitter entities and the relative category of relevance ranking. Vital and relevant documents show a difference in retrieval nonperformance documents are easier to filter than relevant.  
 
 
 
Despite an aggressive attempt to filter as many vital-relevant documents as possible,  we observe that there are still documents that we miss. While some are possible to retrieve with some modifications, some others are not. There are some document that indicate that an information filtering system does not seem to get them no matter how rich representation of entities they use. These circumstances under which this happens are many. We found that some documents have no content at all, subjectivity(it is not clear why some are judged vital). However, the main circumstances under which vital  documents can defy filtering is: outgoing link mentions, 
 
venue-event, entity - related entity, organization - main area of operation, entity - group, artist - artist's work,  party-politician, and world knowledge.  
 
 
 
%ACKNOWLEDGMENTS are optional
sigproc.bib
Show inline comments
 
 
 
@inproceedings{balog2013multi,
 
      title={Multi-step classification approaches to cumulative citation recommendation},
 
      author={Balog, Krisztian and Ramampiaro, Heri and Takhirov, Naimdjon and N{\o}rv{\aa}g, Kjetil},
 
@@ -15,12 +14,6 @@
 
  year={2012}
 
}
 
 
@article{frank2013stream,
 
  title={Evaluating Stream Filtering for Entity Profile Updates for TREC 2013},
 
  author={Frank, John R and Bauer, J and  Kleiman-Weiner, Max and Roberts, Daniel A and Tripuraneni, Nilesh  and  Zhang, Ce and R{\'e}, Christopher and Voohees, Ellen and Soboroff, Ian},
 
  journal={Proceedings of The 22th TREC},
 
  year={2013}
 
}
 
 
@inproceedings{robertson2002trec,
 
  title={The TREC 2002 Filtering Track Report.},
 
@@ -32,10 +25,12 @@
 
  year={2002}
 
}
 
 
 
 
@inproceedings{wang2013bit,
 
  title={BIT and MSRA at TREC KBA CCR Track 2013},
 
  author={Wang, Jingang and Song, Dandan and Lin, Chin-Yew and Liao, Lejian},
 
  booktitle={Notebook of the TExt Retrieval Conference},
 
  journal={Proceedings of The 22th TREC},
 
  year={2013}
 
}
 
 
@@ -43,18 +38,17 @@
 
@inproceedings{bordino2013penguins,
 
  title={Penguins in sweaters, or serendipitous entity search on user-generated content},
 
  author={Bordino, Ilaria and Mejova, Yelena and Lalmas, Mounia},
 
  booktitle={Proceedings of the 22nd ACM international conference on Conference on information \& knowledge management},
 
  journal={Proceedings of the 22nd ACM international conference on Conference on information \& knowledge management},
 
  pages={109--118},
 
  year={2013},
 
  organization={ACM}
 
}
 
 
 
 
@inproceedings{ceccarelli2013learning,
 
  title={Learning relatedness measures for entity linking},
 
  author={Ceccarelli, Diego and Lucchese, Claudio and Orlando, Salvatore and Perego, Raffaele and Trani, Salvatore},
 
  booktitle={Proceedings of the 22nd ACM international conference on Conference on information \& knowledge management},
 
  journal={Proceedings of the 22nd ACM international conference on Conference on information \& knowledge management},
 
  pages={139--148},
 
  year={2013},
 
  organization={ACM}
 
@@ -63,90 +57,80 @@
 
@inproceedings{taneva2013gem,
 
  title={Gem-based entity-knowledge maintenance},
 
  author={Taneva, Bilyana and Weikum, Gerhard},
 
  booktitle={Proceedings of the 22nd ACM international conference on Conference on information \& knowledge management},
 
  journal={Proceedings of the 22nd ACM international conference on Conference on information \& knowledge management},
 
  pages={149--158},
 
  year={2013},
 
  organization={ACM}
 
}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
@inproceedings{wang2013bit,
 
  title={BIT and MSRA at TREC KBA CCR Track 2013},
 
  author={Wang, Jingang and Song, Dandan and Lin, Chin-Yew and Liao, Lejian},
 
  booktitle={Notebook of the TExt Retrieval Conference},
 
  journal={Proceedings of The 22th TREC},
 
  year={2013}
 
}
 
 
@inproceedings{abbes2013irit,
 
  title={IRIT at TREC Knowledge Base Acceleration 2013: Cumulative Citation Recommendation Task},
 
  author={Abbes, Rafik and Pinel-Sauvagnat, Karen and Hernandez, Nathalie and Boughanem, Mohand},
 
  booktitle={Notebook of the TExt Retrieval Conference},
 
  journal={Proceedings of The 22th TREC},
 
  year={2013}
 
}
 
 
 
@inproceedings{liu2013related,
 
  title={A Related Entity based Approach for Knowledge Base Acceleration},
 
  author={Liu, Xitong and Fang, H},
 
  booktitle={Notebook of the TExt Retrieval Conference},
 
  journal={Proceedings of The 22th TREC},
 
  year={2013}
 
}
 
 
 
@article{dietzumass,
 
  title={UMass at TREC 2013 Knowledge Base Acceleration Track},
 
  author={Dietz, Laura and Dalton, Jeffrey}
 
 booktitle={Notebook of the TExt Retrieval Conference},
 
  author={Dietz, Laura and Dalton, Jeffrey},
 
 journal={Proceedings of The 22th TREC},
 
  year={2013}
 
}
 
 
@article{zhangpris,
 
  title={PRIS at TREC2013 Knowledge Base Acceleration Track},
 
  author={Zhang, Chunyun and Xu, Weiran and Liu, Ruifang and Zhang, Weitai and Zhang, Dai and Ji, Janshu and Yang, Jing}
 
   booktitle={Notebook of the TExt Retrieval Conference},
 
  author={Zhang, Chunyun and Xu, Weiran and Liu, Ruifang and Zhang, Weitai and Zhang, Dai and Ji, Janshu and Yang, Jing},
 
   journal={Proceedings of The 22th TREC},
 
  year={2013}
 
}
 
 
 
@article{illiotrec2013,
 
  title={ The University of Illinois' Graduate School of Library and Information Science at TREC 2013},
 
  author={ Efron,Miles and   Willis,Craig and   Organisciak, Peter and  Balsamo,Brian and  Lucic, Ana}
 
   booktitle={Notebook of the TExt Retrieval Conference},
 
  author={ Efron,Miles and   Willis,Craig and   Organisciak, Peter and  Balsamo,Brian and  Lucic, Ana},
 
  journal={Proceedings of The 22th TREC},
 
  year={2013}
 
}
 
 
@article{uvakba2013,
 
  title={ Filtering Documents over Time for Evolving Topics -The University of Amsterdam at TREC 2013 KBA CCR},
 
  author={Kenter, Tom}
 
   booktitle={Notebook of the TExt Retrieval Conference},
 
  author={Kenter, Tom},
 
   journal={Proceedings of The 22th TREC},
 
  year={2013}
 
}
 
 
@article{bouvierfiltering,
 
  title={Filtering Entity Centric Documents using Numerics and Temporals features within RF Classifier},
 
  author={Bouvier, Vincent and Bellot, Patrice}
 
   booktitle={Notebook of the TExt Retrieval Conference},
 
  author={Bouvier, Vincent and Bellot, Patrice},
 
   journal={Proceedings of The 22th TREC},
 
  year={2013}
 
}
 
 
@article{niauniversity,
 
  title={University of Florida Knowledge Base Acceleration Notebook},
 
  author={Nia, Morteza Shahriari and Grant, Christan and Peng, Yang and Wang, Daisy Zhe and Petrovic, Milenko}
 
   booktitle={Notebook of the TExt Retrieval Conference},
 
  author={Nia, Morteza Shahriari and Grant, Christan and Peng, Yang and Wang, Daisy Zhe and Petrovic, Milenko},
 
   journal={Proceedings of The 22th TREC},
 
  year={2013}
 
}
 
 
 
@article{frank2013stream,
 
  title={Evaluating Stream Filtering for Entity Profile Updates for TREC 2013},
 
  author={Frank, John R and Bauer, J and  Kleiman-Weiner, Max and Roberts, Daniel A and Tripuraneni, Nilesh  and  Zhang, Ce and R{\'e}, Christopher and Voohees, Ellen and Soboroff, Ian},
 
  journal={Proceedings of The 22th TREC},
 
  year={2013}
 
}
 
0 comments (0 inline, 0 general)