HCDA/cikm-paper Changeset - 430e35705d08 · Centrum Wiskunde & Informatica (CWI)

@@ -134,26 +134,26 @@ documents (news, blog, tweets) can influence filtering.

 In this paper,  we hold the subsequent steps of the pipeline fixed, zoom in on the filtering step and  conduct an in-depth analysis of the main components in it.  In particular, we study  cleansing, different entity profiling,  type of entities (Wikipedia or Twitter), and type of documents (social, news, etc).  The main contribution of the paper:

 An in-depth analysis of the factors that affect entity-based stream filtering

 Identifying optimal entity profiles vis-avis not compromising precision

 Describing relevant documents that are not amenable to filtering and thereby estimating the upper-bound on entity-based filtering

 The rest of the paper is is organized as follows:

 \section{Data and Task description}

We use the TREC KBA-CCR-2013 dataset \footnote{http://http://trec-kba.org/trec-kba-2013.shtml}. The dataset consists of a time-stamped  stream corpus, a set of KB entities, and a set of relevance judgments.

\subsection{Stream corpus} The stream corpus comes in two versions: raw and cleaned. The raw  and cleansed versions are 6.45TB and 4.5TB respectively,  after xz-compression and GPG encryption. The raw data is a  dump of  raw HTML pages. The cleansed version is the raw data after its HTML tags are stripped off. The stream corpus is organized in hourly folders each of which contain many  chunk files. Each chunk file contains between hundreds and hundreds of thousands  serialized  thrift objects. One thrift object is one document. A document could be a blog article, a news article, or a social media post (including tweet).  The stream corpus comes from three sources: TREC KBA 2012 (social, news and linking) \footnote{http://trec-kba.org/kba-stream-corpus-2013.shtml}, arxiv\footnote{http://arxiv.org/}, and spinn3r\footnote{http://spinn3r.com/}. Table \ref{tab:streams}   shows the sources and the number of hourly directories, and number of chunk files.

We use TREC KBA-CCR-2013 dataset \footnote{http://http://trec-kba.org/trec-kba-2013.shtml} and problem setting. The dataset consists of a time-stamped  stream corpus, a set of KB entities, and a set of relevance judgments.

\subsection{Stream corpus} The stream corpus comes in two versions: raw and cleaned. The raw  and cleansed versions are 6.45TB and 4.5TB respectively,  after xz-compression and GPG encryption. The raw data is a  dump of  raw HTML pages. The cleansed version is the raw data after its HTML tags are stripped off and non-English docuemnts removed. The stream corpus is organized in hourly folders each of which contains many  chunk files. Each chunk file contains between hundreds and hundreds of thousands of serialized  thrift objects. One thrift object is one document. A document could be a blog article, a news article, or a social media post (including tweet).  The stream corpus comes from three sources: TREC KBA 2012 (social, news and linking) \footnote{http://trec-kba.org/kba-stream-corpus-2013.shtml}, arxiv\footnote{http://arxiv.org/}, and spinn3r\footnote{http://spinn3r.com/}. Table \ref{tab:streams}   shows the sources, the number of hourly directories, and the number of chunk files.

\begin{table*}

\caption{retrieved documents to different sources }

\begin{center}

 \begin{tabular}{l*{4}{l}l}

 documents     &   chunk files    &    Sub-stream \\

\hline

126,952         &11,851         &arxiv \\

394,381,405      &   688,974        & social \\

134,933,117       &  280,658       &  news \\

@@ -163,28 +163,28 @@ We use the TREC KBA-CCR-2013 dataset \footnote{http://http://trec-kba.org/trec-k

14,755,278         &36,272     &    CLASSIFIED (spinn3r)\\

52,412         &9,499         &REVIEW (spinn3r)\\

7,637         &5,168         &MEMETRACKER (spinn3r)\\

1,040,520,595   &      2,222,554 &        Total\\

\end{tabular}

\end{center}

\label{tab:streams}

\end{table*}

\subsection{KB entities}

 The KB entities consist of 20 Twitter entities and 121 Wikipedia entities. The selected entities are, on purpose, sparse. The entities consist of 71 people, 1 organizations, and 24 facilities.

 The KB entities consist of 20 Twitter entities and 121 Wikipedia entities. The selected entities are, on purpose, sparse. The entities consist of 71 people, 1 organization, and 24 facilities.

\subsection{Relevance judgments}

TREC-KBA provided relevance judgments for training and testing. Relevance judgments are given to a document-entity pairs. Documents with citation-worthy content to a given entity are annotated  as \emph{vital},  while documents with tangentially relevant content or  that lack freshliness or  with content that can be useful for initial KB-dossier are annotated as \emph{relevant}. Documents with no relevant content are labeled \emph{neutral} and spam is labeled as \emph{garbage}.

TREC-KBA provided relevance judgments for training and testing. Relevance judgments are given to a document-entity pairs. Documents with citation-worthy content to a given entity are annotated  as \emph{vital},  while documents with tangentially relevant content, or docuemnts that lack freshliness o  with content that can be useful for initial KB-dossier are annotated as \emph{relevant}. Documents with no relevant content are labeled \emph{neutral} and spam is labeled as \emph{garbage}.  The inter-annotator agreement on vital in 2012 was 70\% while in 2013 it is 76\%. This is due to the more refined definition of vital and the distinction made between vital and relevant.

 \subsection{Research questions}

 Given a stream of documents of news items, blogs and social media on one hand and KB entities (Wikipedia, Twitter)  on the other,  we study the cleansing step, the entity-profile construction, the category of the stream items, the type of entities (Wikipedia or Twitter), and the impact on classification.  In particular, we strive to answer the following questions:

 \begin{enumerate}

  \item Does cleansing affect filtering and subsequent performance

  \item What is the most effective way of entity profile representation

  \item Is filtering different for Wikipedia and Twitter entities?

  \item Are some type of documents easily filterable and others not ?

  \item Does a gain in recall at filtering step translate to a gain in F-measure at the end of the pipeline?

@@ -269,25 +269,25 @@ The annotation set is a combination of the annotations from before the Training

Most (more than 80\%) of the annotation documents are in the test set. Some annotation documents do not have a cleansed content. In both the training and test data for 2013, there are  68405 annotations, of which 50688 are unique document-entity pairs.   Out of 50688,  24162  unique document-entity pairs vital or relevant, of which 9521 are vital and 17424 are relevant.

\section{Experiments and Results}

 We conducted experiments to study  the effect of cleansing, different entity profiles, types of entities, category of documents, relevance ranks (vital or relevant), and the impact on classification.  For ease of understanding, we present the results in two categories: cleansing, and document categories. In each case we study the number of annotated documents that are retrieved.

 \subsection{Cleansing: raw vs. cleansed}

 \subsection{Cleansing: raw or cleansed}

\begin{table}

\caption{Central or relevant documents that are retrieved under different name variants , upper part from cleansed, lower part from raw}

\begin{center}

\begin{tabular}{l@{\quad}lllllll}

\hline

\multicolumn{1}{l}{\rule{0pt}{12pt}}&\multicolumn{1}{l}{\rule{0pt}{12pt}cano}&\multicolumn{1}{l}{\rule{0pt}{12pt}cano partial }&\multicolumn{1}{l}{\rule{0pt}{12pt}all }&\multicolumn{1}{l}{\rule{0pt}{50pt}all\_part}\\[5pt]

\hline

   all-entities   &51.0  &61.7  &66.2  &78.4 \\

   Wikipedia      &61.8  &74.8  &71.5  &77.9\\

@@ -303,30 +303,24 @@ Most (more than 80\%) of the annotation documents are in the test set. Some anno

\end{tabular}

\end{center}

\label{tab:name}

\end{table}

The upper part of Table \ref{tab:name} shows the recall performances on the cleansed version and the lower part on the raw version. The recall performances for all entity types  are increased substantially in the raw version. Recall increases on Wikipedia entities  vary from 8.2 to 12.8, and in Twitter entities from 6.8 to 26.2. In all entities, it varies from 8.0 to 13.6.  The recall increases are substantial. To put it into perspective, an 11.8 increase in recall on all entities is a retrieval of 2864 more unique document-entity pairs. This suggests that cleansing has removed some documents that we could otherwise retrieve.

\subsection{Entity Profiles}

If we look at the recall performances for the raw corpus,   filtering documents by canonical names achieves a recall of  59\%.  Adding other name variants  improves the recall to 79.8\%, an increase of 20.8\%. This means  20.8\% of documents mentioned the entities by other names  rather than by their canonical names. Canonical partial names achieve a recall of 72\%  and the partial names of all name variants achieves 90.2\%. This says that 18.2\% of documents mentioned the entities by  partial names of other non-canonical name variants.

\subsection{Entity Type( Wikipedia vs. Twitter)}

Recall performances on Wikipedia entities show that canonical names achieve a recall of 70\%, and partial names of canonical names achieve a recall of 86.1\%. This is an increase in recall of 16.1\%. By contrast, the increase in recall of partial names of all name variants over just all name variants is 8.3.  The high increase in recall when moving from canonical names  to their partial names, in comparison to the lower increase when moving from all name variants to their partial names can be explained by saturation. This is to mean that documents have already been extracted by the different name variants and thus using partial name does not bring in many new documents. One interesting observation is that, on Wikipedia entities, partial names of canonical names achieve better results than name variants.  This holds in both cleansed and raw extractions. %In the raw extraction, the difference is about 3.7.

In Twitter entities, however, it is different. Both canonical and their partial names perform the same and the recall is very low. Canonical names and partial canonical names are the same for Twitter entities because they are one word names. For example in https://twitter.com/roryscovel, ``roryscovel`` is the canonical name and its partial name is also the same.  That they perform very low is because the canonical names of Twitter entities are not really names; they are usually arbitrarily created user names. It shows that  people do not refer to Twitter entities by their user names. They refer to them by their display names, which is reflected in the recall (67.9\%). The use of partial names of all name variants increases the recall to 88.2\%.

When we talk at an aggregate-level (both Twitter and Wikipedia entities), we observe two important patterns. 1) we see that recall increases as we move from canonical names to canonical partial names, to all name variants, and to partial names of name variants. But we saw that that is not the case in Wikipedia entities.  The influence, therefore, comes from Twitter entities. 2) Using canonical names retrieves the least number of vital or relevant documents, and the partial names of all name variants retrieves the most number of documents. The difference in performance is 31.9\% on all entities, 20.7\% on Wikipedia entities, and 79.5\% on Twitter entities. This is a significant performance difference.

\subsection{Breakdown of results by document source category}

  \begin{table*}

\caption{Breakdown of sources and delta }

\begin{center}\begin{tabular}{l*{9}{c}r}

 && \multicolumn{3}{ c| }{All entities}  & \multicolumn{3}{ c| }{Wikipedia} &\multicolumn{3}{ c| }{Twitter} \\

 & &Others&news&social & Others&news&social &  Others&news&social \\

\hline

@@ -386,62 +380,69 @@ When we talk at an aggregate-level (both Twitter and Wikipedia entities), we obs

\hline

\end{tabular}

\end{center}

\label{tab:source-delta}

\end{table*}

The results  of the different entity profiles on the raw corpus are broken down by source categories and relevance rank (vital, or relevant).  In total, there are 24162 vital or relevant unique entity-document pairs. 9521 of them are vital  and  17424 are relevant. These documents  are categorized into 8 source categories: 0.98\% arxiv(a), 0.034\% classified(c), 0.34\% forum(f), 5.65\% linking(l), 11.53\% mainstream-news(m-n), 18.40\% news(n), 12.93\% social(s) and 50.2\% weblog(w).

The 8 document source categories are regrouped into three for two reasons: 1) some groups are very similar to each other. Mainstream-news and news are  similar. The reason they exist separately, in the first place,  is because they were collected from two different sources, by different groups and at different times. we call them news from now on.  The same is true with weblog and social, and we call them social from now on.   2) some groups have so small number of annotations that treating them independently does not make much sense. Majority of vital or relevant annotations are social (social and weblog) (63.13\%). News (mainstream +news) make up 30\%. Thus, news and social make up about 93\% of all annotations.  The rest make up about 7\% and are all grouped as others.

The results of the breakdown by document categopries is presented in a multi-dimensional table shown in \ref{tab:source-delta}. There are three outer columns for  all entities, Wikipedia and Twitter. Each of the outer columns consist of the document categories of other,news and social. The rows consist of Vital, relevant and total each of which have the four entity profiles.

The results of the breakdown by document categories is presented in a multi-dimensional table shown in \ref{tab:source-delta}. There are three outer columns for  all entities, Wikipedia and Twitter. Each of the outer columns consist of the document categories of other,news and social. The rows consist of Vital, relevant and total each of which have the four entity profiles.

 \subsection{Vital vs. relevant}

 \subsection{ Relevance Rating: Vital and relevant}

 When comparing the recall performances in vital and relevant, we observe that canonical names achieve better in vital than in relevant. This is specially true with Wikipedia entities. For example, the recall for news is 80.1 and for social is 76, while the corresponding recall in relevant is 75.6 and 63.2 respectively. We can generally see that the recall in vital are better than the recall in relevant suggesting that relevant documents are more probable to mention the entities and when they do, using some of their common name variants.

%  \subsection{Difference by document categories}

%  Generally, there is greater variation in relevant rank than in vital. This is specially true in most of the Delta's for Wikipedia. This  maybe be explained by news items referring to  vital documents by a some standard name than documents that are relevant. Twitter entities show greater deltas than Wikipedia entities in both vital and relevant. The greater variation can be explained by the fact that the canonical name of Twitter entities retrieves very few documents. The deltas that involve canonical names of Twitter entities, thus, show greater deltas.

% If we look in recall performances, In Wikipedia entities, the order seems to be others, news and social. This means that others achieve a higher recall than news than social.  However, in Twitter entities, it does not show such a strict pattern. In all, entities also, we also see almost the same pattern of other, news and social.

\subsection{Document category: others. news and social}

The recall for Wikipedia entities in \ref{tab:name} ranged from 61.8\% (cannonical names) to 77.9\% (partial names of name variants. We looked at how these recall is distributted across the three document categories. In Table \ref{tab:source-delta}, Wikipedia column, we see, across all entity profiles, that others have a higher recall followed by news. Social documents achieve the lowest recall.  While the news recall  ranged from 76.4\% to 98.4\%, the recall for social documents ranged from 65.7\% to 86.8\%. Others achieve higher than news and news achieve higher than social. This pattern  holds across  all name variants in  Wikipedia  entities. Notice that the others category stands for arxiv (scientific documents), classifieds, forums and linking.

The recall for Wikipedia entities in \ref{tab:name} ranged from 61.8\% (canonical names) to 77.9\% (partial names of name variants. We looked at how these recall is distributed across the three document categories. In Table \ref{tab:source-delta}, Wikipedia column, we see, across all entity profiles, that others have a higher recall followed by news. Social documents achieve the lowest recall.  While the news recall  ranged from 76.4\% to 98.4\%, the recall for social documents ranged from 65.7\% to 86.8\%. Others achieve higher than news and news achieve higher than social. This pattern  holds across  all name variants in  Wikipedia  entities. Notice that the others category stands for arxiv (scientific documents), classifieds, forums and linking.

In Twitter entities, however, the pattern is different. In cannonical names (and their partials), social documents achieve higher recall than news . This suggests that social documents refer to Twitter entities by their cannonical names (user names) more than news. In partial names of all name variants, news achieve better results than social. The difference in recall between cannonical and partial names of all name variants shows that news do not refer to Twitter entties by their user names, they refer to them with their display names.

In Twitter entities, however, the pattern is different. In canonical names (and their partials), social documents achieve higher recall than news . This suggests that social documents refer to Twitter entities by their canonical names (user names) more than news. In partial names of all name variants, news achieve better results than social. The difference in recall between canonical and partial names of all name variants shows that news do not refer to Twitter entities by their user names, they refer to them with their display names.

Overall, across all entities types and all entity profiles, others achieve better recall than news, and  news, in turn, achieve higher recall than social documents. This suggests that social documents are the hardest  to retrieve.  This of course makes sense since social posts are short and are more likely to point to other resources, or use short informal names.

We computed four percentage increases in recall (deltas)  between the difefrent entity profiles (see \ref{tab:source-delta2}. The first delta is the recall percentage between partial names of canonical names and canonical names. The second  is the delta between name variants and canonical names. The third is the difference between partial names of name variants  and partial names of canonical names and the fourth between partial names of name variants and name variants. we believe these four deltas offer a clear meaning. The delta between all name variants and canonical names shows the percentage of documents that the new name variants retrieve, but the canonical name does not. Similarly, the delta between partial names of name variants and partial names of canonical names shows the percentage of document-entity pairs that can be gained by the partial names of the name variants.

We computed four percentage increases in recall (deltas)  between the different entity profiles (see \ref{tab:source-delta2}. The first delta is the recall percentage between partial names of canonical names and canonical names. The second  is the delta between name variants and canonical names. The third is the difference between partial names of name variants  and partial names of canonical names and the fourth between partial names of name variants and name variants. we believe these four deltas offer a clear meaning. The delta between all name variants and canonical names shows the percentage of documents that the new name variants retrieve, but the canonical name does not. Similarly, the delta between partial names of name variants and partial names of canonical names shows the percentage of document-entity pairs that can be gained by the partial names of the name variants.

In most of the  deltas, news followed by social followed by others show greater difference. This suggests s that news refer to entities by different names, rather than by a certain standard name.  This is counter-intuitive since one would expect news to mention entities by some consistent name(s) thereby reducing the difference. The deltas, for Wikipedia entities, between canonical partials and canonicals,  and all name variants and canonicals are high  suggesting that partial names and all other name variants bring in new docuemnts that can not be retrieved by canonical names. The rest of the two deltas are very small suggesting that partial names of all name variants do not bring in new relevant docuemnts. In Twitter entities,  name variants bring in new documents.

In most of the  deltas, news followed by social followed by others show greater difference. This suggests s that news refer to entities by different names, rather than by a certain standard name.  This is counter-intuitive since one would expect news to mention entities by some consistent name(s) thereby reducing the difference. The deltas, for Wikipedia entities, between canonical partials and canonicals,  and all name variants and canonicals are high  suggesting that partial names and all other name variants bring in new documents that can not be retrieved by canonical names. The rest of the two deltas are very small suggesting that partial names of all name variants do not bring in new relevant documents. In Twitter entities,  name variants bring in new documents.

% The  biggest delta  observed is in Twitter entities between partials of all name variants and partials of canonicals (93\%). delta. Both of them are for news category.  For Wikipedia entities, the highest delta observed is 19.5\% in cano\_part - cano followed by 17.5\% in all\_part in relevant.

\subsection{Wikipedia versus Twitter}

The tables in \ref{tab:name} and \ref{tab:source-delta} show, recall for Wikipedia entities are higher than for Twitter. This indicates that Wikipedia entities are easier to match in docuemnts than Twitter. This can be due to two reasons: 1) Wikipedia entities are relatively well described rthan Twitter entities. The fact that we can retrieve difeffernt name variants from DBpedia is a measure of relative description. By contrast, we have only two names for Twitter entities:their user names and theur display names wh9ich we collect from their Twitter pages. 2) DbPedia entities are less obscure and that is why they are not in Wikipedia anyways. Another point is that mentioned by their display names more than they are by their user names. We also observed that social docuemnts mention Twitter entities by their user names more than news suggesting a disnction between the standard in news and social documents.

  \subsection{Entity Type: Wikipedia and Twitter)}

From Table \ref{tab:name} shows the difefrence between Wikipedia and Twitter entities.  Wikipedia entities' canonical names achieve a recall of 70\%, and partial names of canonical names achieve a recall of 86.1\%. This is an increase in recall of 16.1\%. By contrast, the increase in recall of partial names of all name variants over just all name variants is 8.3.  The high increase in recall when moving from canonical names  to their partial names, in comparison to the lower increase when moving from all name variants to their partial names can be explained by saturation. This is to mean that documents have already been extracted by the different name variants and thus using partial name does not bring in many new documents. One interesting observation is that, on Wikipedia entities, partial names of canonical names achieve better results than name variants.  This holds in both cleansed and raw extractions. %In the raw extraction, the difference is about 3.7.

In Twitter entities, however, it is different. Both canonical and their partial names perform the same and the recall is very low. Canonical names and partial canonical names are the same for Twitter entities because they are one word names. For example in https://twitter.com/roryscovel, ``roryscovel`` is the canonical name and its partial name is also the same.  That they perform very low is because the canonical names of Twitter entities are not really names; they are usually arbitrarily created user names. It shows that  documents do not refer to Twitter entities by their user names. They refer to them by their display names, which is reflected in the recall (67.9\%). The use of partial names of all name variants increases the recall to 88.2\%.

When we talk at an aggregate-level (both Twitter and Wikipedia entities), we observe two important patterns. 1) we see that recall increases as we move from canonical names to canonical partial names, to all name variants, and to partial names of name variants. But we saw that that is not the case in Wikipedia entities.  The influence, therefore, comes from Twitter entities. 2) Using canonical names retrieves the least number of vital or relevant documents, and the partial names of all name variants retrieves the most number of documents. The difference in performance is 31.9\% on all entities, 20.7\% on Wikipedia entities, and 79.5\% on Twitter entities. This is a significant performance difference.

The tables in \ref{tab:name} and \ref{tab:source-delta} show, recall for Wikipedia entities are higher than for Twitter. This indicates that Wikipedia entities are easier to match in documents than Twitter. This can be due to two reasons: 1) Wikipedia entities are relatively well described than Twitter entities. The fact that we can retrieve different name variants from DBpedia is a measure of relative description. By contrast, we have only two names for Twitter entities:their user names and their display names which we collect from their Twitter pages. 2) DBpedia entities are less obscure and that is why they are not in Wikipedia anyways. Another point is that mentioned by their display names more than they are by their user names. We also observed that social documents mention Twitter entities by their user names more than news suggesting a distinction between the standard in news and social documents.

   \subsection{Impact on classification}

  In the overall experimental setup, the goal is to keep the classification constant. In here, we present results showing how filtering affects performance.  In tables \ref{tab:class-vital} and \ref{tab:class-vital-relevant}, we show the performances in F-measure and SU.

\begin{table*}

\caption{vital performance under different name variants , upper part from cleansed, lower part from raw}

\begin{center}

\begin{tabular}{ll@{\quad}lllllll}

\hline

\multicolumn{1}{l}{\rule{0pt}{12pt}}&\multicolumn{1}{l}{\rule{0pt}{12pt}cano}&\multicolumn{1}{l}{\rule{0pt}{12pt}cano partial }&\multicolumn{1}{l}{\rule{0pt}{12pt}all }&\multicolumn{1}{l}{\rule{0pt}{50pt}all\_part}\\[5pt]

\hline{Cleansed}

@@ -513,56 +514,56 @@ We have looked into the effect of cleansing in filtering. Further, we have looke

Let's look into cleansed versus raw on  Wikipedia entities: It seems that the cleansed version gives better results than the raw version in three of the profiles(except in canonical). However, in Twitter entities, the raw corpus achieves a better score in all profiles (except in name variants profiles).  In all entities, we see that in three profiles, cleansed achieves better (only in canonical partial, does raw perform better.)

In terms of profiles, Wikipedia's canonical partial names seem to achieve the highest performance. For Twitter, the partial names of name variants achieve  better results.

For vital plus relevant:

In three cases, raw achieves better results (except in cano partials). For Twitter entities, the raw corpus achieves better results.  In terms of  entity profiles, Wikipedia's canonical partial names achieve  the best F-score. For Twitter, as before, partial names of canonical names.

It seems, the raw corpus has more effect on Twitter entities performances. An increase in recall does not necessarily mean an increase in F-measure.

\subsection{Missing relevant documents \label{miss}}

There is a trade-off between using a richer entity-profile and retrieval of irrelevant documents. The richer the profile, the more relevant documents it retrieves, but also the more irrelevant documents. To put into perspective, lets compare the number of documents that are retrieved with partial names of name variants and partial names of canonical names. Using the raw corpus, the partial names of canonical names extracts a total of 2547487 and achieves a recall of 72.2\%. By contrast, the partial names of name variants extracts a total 4735318 documents and achieves a recall of 90.2\%. The total number of documents extracted increases by 85.9\% for a recall gain of 18\%. The rest of the documents, that is 67.9\%, are newly introduced irrelevant documents. There is an advantage in excluding irrelevant documents from filtering because they confuse the later stages of the pipeline.

There is a trade-off between using a richer entity-profile and retrieval of irrelevant documents. The richer the profile, the more relevant documents it retrieves, but also the more irrelevant documents. To put into perspective, lets compare the number of documents that are retrieved with partial names of name variants and partial names of canonical names. Using the raw corpus, the partial names of canonical names extracts a total of 2547487 and achieves a recall of 72.2\%. By contrast, the partial names of name variants extracts a total of 4735318 documents and achieves a recall of 90.2\%. The total number of documents extracted increases by 85.9\% for a recall gain of 18\%. The rest of the documents, that is 67.9\%, are newly introduced irrelevant documents. There is an advantage in excluding irrelevant documents from filtering because they confuse the later stages of the pipeline.

 The use of the partial names of name variants for filtering is, therefore, an aggressive attempt to retrieve as many relevant documents as possible at the cost retrieving irrelevant documents. However, we still miss about  2363(10\%) documents.  Why are these documents missed? If they are not mentioned by partial names of name variants, what are they mentioned by? Table \ref{tab:miss} shows the documents that miss with respect to cleansed and raw corpus.  The upper part shows the number of documents missing from cleansed and raw versions of the corpus. The lower part of the table shows the intersections and exclusions in each corpus.

 The use of the partial names of name variants for filtering is an aggressive attempt to retrieve as many relevant documents as possible at the cost of retrieving irrelevant documents. However, we still miss about  2363(10\%) the vital-relevant documents.  Why are these documents missed? If they are not mentioned by partial names of name variants, what are they mentioned by? Table \ref{tab:miss} shows the documents that we miss with respect to cleansed and raw corpus.  The upper part shows the number of documents missing from cleansed and raw versions of the corpus. The lower part of the table shows the intersections and exclusions in each corpus.

\begin{table}

\caption{The number of documents that are missing from raw and cleansed extractions. }

\begin{center}

\begin{tabular}{l@{\quad}llllll}

\hline

\multicolumn{1}{l}{\rule{0pt}{12pt}category}&\multicolumn{1}{l}{\rule{0pt}{12pt}Vital }&\multicolumn{1}{l}{\rule{0pt}{12pt}Relevant }&\multicolumn{1}{l}{\rule{0pt}{12pt}Total }\\[5pt]

\hline

Cleansed &1284 & 1079 & 2363 \\

Raw & 276 & 4951 & 5227 \\

\hline

 missing only from cleansed &1065&2016&3081\\

  missing only from raw  &57 &160 &217 \\

  Missing from both &219 &1927&2146\\

\hline

\end{tabular}

\end{center}

\label{tab:miss}

\end{table}

 It is normal to assume that  the set of document-entity pairs extracted from cleansed are a sub-set of those   that are extracted from the raw corpus. We find that that is not the case. There are 217  unique Entity-document pairs that are retrieved from the cleansed corpus, but not from the raw. 57 of them are vital.    Similarly,  there are  3081 document-entity pairs that are missing  from cleansed, but are present in  raw. 1065 of them are vital.  Examining the content of the documents reveals that it is due to a missing part of text from a corresponding document.  All the documents that we miss from the raw corpus are social, particularly from the category social  (not from weblogs). These are document such as tweets and other posts from other social media. To meet the format of the raw data, some of them must have been converted later after collection and on the way lost a part or all of their content. It is similar for the documents that we miss from cleansed: a part or the  content  is lost in

 It is normal to assume that  the set of document-entity pairs extracted from cleansed are a sub-set of those   that are extracted from the raw corpus. We find that that is not the case. There are 217  unique entity-document pairs that are retrieved from the cleansed corpus, but not from the raw. 57 of them are vital.    Similarly,  there are  3081 document-entity pairs that are missing  from cleansed, but are present in  raw. 1065 of them are vital.  Examining the content of the documents reveals that it is due to a missing part of text from a corresponding document.  All the documents that we miss from the raw corpus are social, particularly from the category social  (not from weblogs). These are document such as tweets and other posts from other social media. To meet the format of the raw data, some of them must have been converted later after collection and on the way lost a part or all of their content. It is similar for the documents that we miss from cleansed: a part or the  content  is lost in

converting.  In both cases the mention of the entity happened to be on the part of the text that is cut out during conversion.

 The interesting set of of relevance judgments are those that  we miss from both raw and cleansed extractions. These are 2146 unique document-entity pairs, 219 of them are with vital relevance judgments.   The total number of entities in the missed vital annotations is  28 Wikipedia and 7  Twitter, making a total of 35. Looking into document categories shows that the  great majority (86.7\%) of the documents are social. This suggests that social (tweets and blogs) can talk about the entities without mentioning  them by name. This is, of course, inline with intuition.

 The interesting set  of relevance judgments are those that  we miss from both raw and cleansed extractions. These are 2146 unique document-entity pairs, 219 of them are with vital relevance judgments.   The total number of entities in the missed vital annotations is  28 Wikipedia and 7  Twitter, making a total of 35. Looking into document categories shows that the  great majority (86.7\%) of the documents are social. This suggests that social (tweets and blogs) can talk about the entities without mentioning  them by name. This is, of course, inline with intuition.

%    \begin{table*}

% \caption{Breakdown of missing documents by sources for cleansed, raw and cleansed-and-raw}

% \begin{center}\begin{tabular}{l*{9}r}

%   &others&news&social \\

% \hline

% 			&missing from raw only &   0 &0   &217 \\

@@ -570,40 +571,40 @@ converting.  In both cases the mention of the entity happened to be on the part

%                          &missing from both    &19 &317     &2196 \\

% \hline

% \end{tabular}

% \end{center}

% \label{tab:miss-category}

% \end{table*}

However, it is interesting to look into the actual content of the documents to gain an insight into the ways a document can talk about an entity without mentioning the entity.  We collected 35 documents, one for each entity, for manual examination. Here below we present the reasons.

\paragraph{Outgoing link mentions} a post (tweet) with an outgoing link which mentions the entity.

\paragraph{Event place - Event} a document that talks about an event is vital to the location entity where it takes place.  For example Maha Music Festival takes place in Lewis and Clark\_Landing, and a document talking about the festival is vital for the park. There are also cases where an event's address places the event in a park and due to the the document becomes vital to the park. This basically being mentioned by address

\paragraph{Entity -related entity} A document about an important figure such as artist, athlete  can be vital to another. This is specially true if the two are contending for the same title, one has snatched a title, and award from the other.

\paragraph{Organization - main activity} A document that talks about about an area on which the company is active is vital for the organization. For example, Atacocha is a mining company and and an news item on mining waste was annotated vital.

\paragraph{Entity - class} If an entity belongs to a certain class (group) and a news item about the class can be vital for the individual members. FrankandOak is  named innovative company and a news item that talks about a class of innovative companies is relevant for a  it. Other examples are: a  big event  of which an entity is related such an Film awards for actors.

\paragraph{Artist - work} documents that discuss the work of artists can be relevant to the artists. Such cases include  books or films being vital for the book author or the director (actor) of the film. robocop is film whose screenplay is by Joshua Zetumer. An blog that talks about the film was judged vital for Joshua Zetumer.

\paragraph{Outgoing link mentions} A post (tweet) with an outgoing link which mentions the entity.

\paragraph{Event place - Event} A document that talks about an event is vital to the location entity where it takes place.  For example Maha Music Festival takes place in Lewis and Clark\_Landing, and a document talking about the festival is vital for the park. There are also cases where an event's address places the event in a park and due to that the document becomes vital to the park. This is basically being mentioned by address which belongs to alarger space.

\paragraph{Entity -related entity} A document about an important figure such as artist, athlete  can be vital to another. This is specially true if the two are contending for the same title, one has snatched a title, or award from the other.

\paragraph{Organization - main activity} A document that talks about about an area on which the company is active is vital for the organization. For example, Atacocha is a mining company  and a news item on mining waste was annotated vital.

\paragraph{Entity - group} If an entity belongs to a certain group (class),  a news item about the group can be vital for the individual members. FrankandOak is  named innovative company and a news item that talks about the group  of innovative companies is relevant for a  it. Other examples are: a  big event  of which an entity is related such an Film awards for actors.

\paragraph{Artist - work} Documents that discuss the work of artists can be relevant to the artists. Such cases include  books or films being vital for the book author or the director (actor) of the film. Robocop is film whose screenplay is by Joshua Zetumer. A blog that talks about the film was judged vital for Joshua Zetumer.

\paragraph{Politician - constituency} A major political event in a certain constituency is vital for the politician from that constituency.

 A good example is a weblog that talks about two north Dakota counties being drought disasters. The news is vital for Joshua Boschee, a politician, a member of North Dakota democratic party.

\paragraph{head -organization} a document that talks about an organization of which the entity is the head can be vital for the entity.  Jasper\_Schneider is USDA Rural Development state director for North Dakota and an article about problems of primary health centers in North Dakota is judged vital for him.

\paragraph{head - organization} A document that talks about an organization of which the entity is the head can be vital for the entity.  Jasper\_Schneider is USDA Rural Development state director for North Dakota and an article about problems of primary health centers in North Dakota is judged vital for him.

\paragraph{World Knowledge} Some things are impossible to know without your world knowledge. For example ''refreshments, treats, gift shop specials, "bountiful, fresh and fabulous holiday decor," a demonstration of simple ways to create unique holiday arrangements for any home; free and open to the public`` is judged relevant to Hjemkomst\_Center. This is a social media post, and unless one knows the person posting it, there is no way that this text shows that. Similarly ''learn about the gray wolf's hunting and feeding behaviors and watch the wolves have their evening meal of a full deer carcass; $15 for members, $20 for nonmembers`` is judged vital to Red\_River\_Zoo.

\paragraph{No document content} Some documents were found to have no content

\paragraph{Not clear why} It is not clear why some documents are annotated vital for some entities.

Although they have different document ids, many of the documents have the same content. In the vital annotation, there are only three (88 mainstream, social, 401 weblog). In the 35 document vital document-entity pairs we examined, 22 are social, and 13 are news.

%    To gain more insight, I sampled for each 35 entities, one document-entity pair and looked into the contents. The results are in \ref{tab:miss from both}

%    \begin{table*}

% \caption{Missing documents and their mentions }

% \begin{center}

%  \begin{tabular}{l*{4}{l}l}

%  &entity&mentioned by &remark \\

% \hline

%  Jeremy McKinnon  & Jeremy McKinnon& social, mentioned in read more link\\

% Blair Thoreson   & & social, There is no mention by name, the article talks about a subject that is political (credit rating), not apparent to me\\

%   Lewis and Clark Landing&&Normally, maha music festival does not mention ,but it was held there \\

@@ -636,25 +637,25 @@ Although they have different document ids, many of the documents have the same c

% DeAnne Smith && No mention, talks related and there are links\\

% Richard Edlund && talks an ward ceemony in his field \\

% Jennifer Baumgardner && no idea why\\

% Jeff Tamarkin && not clear why\\

% Jasper Schneider &&no mention, talks about rural development of which he is a director \\

% urbren00 && No content\\

% \hline

% \end{tabular}

% \end{center}

% \label{tab:miss from both}

% \end{table*}

We also observed that although docuemnts have different document ids, several of them have the same content. In the vital annotation, there are only three (88 news, and 409 weblog). In the 35  vital document-entity pairs we examined, and 13 are news and  22 are social.

\section{Analysis and Discussion}

We conducted experiments to study  the impacts on recall of

different components of the filtering step of the CCR pipeline. Specifically

we conducted experiments to study the impacts of cleansing,

entity profile, relevance rankings, categories of documents, and documents that are missed.

Experimental results using TREC-KBA task show that cleansing removes documents  or part of the documents making them difficult to retrieve. These documents can, otherwise, be retrieved from the raw version. The use of the raw corpus brings in documents that can not be retrieved from the cleansed corpus. This is true for all entity profiles and for all entity types. The  recall increase is  between 6.8\% to 26.2\%. These increase, in actual document-entity pairs,  is in thousands.