HCDA/cikm-paper Changeset - 4d6b80d23816 · Centrum Wiskunde & Informatica (CWI)

@@ -35,6 +35,10 @@

\title{Entity-Centric Stream Filtering and ranking: Filtering and Unfilterable Documents

%SUGGESTION:

%\title{The Impact of Entity-Centric Stream Filtering on Recall and

%  Missed Documents}

% You need the command \numberofauthors to handle the 'placement

% and alignment' of the authors beneath the title.

@@ -140,14 +144,14 @@ documents (news, blog, tweets) can influence filtering.

 \section{Data Description}

We use TREC-KBA 2013 dataset \footnote{http://http://trec-kba.org/trec-kba-2013.shtml}. The dataset consists of a time-stamped  stream corpus, a set of KB entities, and a set of relevance judgments.

\subsection{Stream corpus} The stream corpus comes in two versions: raw and cleaned. The raw  and cleansed versions are 6.45TB and 4.5TB respectively,  after xz-compression and GPG encryption. The raw data is a  dump of  raw HTML pages. The cleansed version is the raw data after its HTML tags are stripped off and only English documents identified with Chromium Compact Language Detector \footnote{https://code.google.com/p/chromium-compact-language-detector/} are included.  The stream corpus is organized in hourly folders each of which contains many  chunk files. Each chunk file contains between hundreds and hundreds of thousands of serialized  thrift objects. One thrift object is one document. A document could be a blog article, a news article, or a social media post (including tweet).  The stream corpus comes from three sources: TREC KBA 2012 (social, news and linking) \footnote{http://trec-kba.org/kba-stream-corpus-2013.shtml}, arxiv\footnote{http://arxiv.org/}, and spinn3r\footnote{http://spinn3r.com/}. Table \ref{tab:streams}   shows the sources, the number of hourly directories, and the number of chunk files.

We use the TREC-KBA 2013 dataset \footnote{http://http://trec-kba.org/trec-kba-2013.shtml}. The dataset consists of a time-stamped  stream corpus, a set of KB entities, and a set of relevance judgments.

\subsection{Stream corpus} The stream corpus comes in two versions: raw and cleaned. The raw  and cleansed versions are 6.45TB and 4.5TB respectively,  after xz-compression and GPG encryption. The raw data is a  dump of  raw HTML pages. The cleansed version is the raw data after its HTML tags are stripped off and only English documents identified with Chromium Compact Language Detector \footnote{https://code.google.com/p/chromium-compact-language-detector/} are included.  The stream corpus is organized in hourly folders each of which contains many  chunk files. Each chunk file contains between hundreds and hundreds of thousands of serialized  thrift objects. One thrift object is one document. A document could be a blog article, a news article, or a social media post (including tweet).  The stream corpus comes from three sources: TREC KBA 2012 (social, news and linking) \footnote{http://trec-kba.org/kba-stream-corpus-2012.shtml}, arxiv\footnote{http://arxiv.org/}, and spinn3r\footnote{http://spinn3r.com/}. Table \ref{tab:streams}   shows the sources, the number of hourly directories, and the number of chunk files.

\begin{table}

\caption{Retrieved documents to different sources }

\begin{center}

 \begin{tabular}{l*{4}{l}l}

 \begin{tabular}{r*{4}{r}l}

 documents     &   chunk files    &    Sub-stream \\

\hline

@@ -170,10 +174,35 @@ We use TREC-KBA 2013 dataset \footnote{http://http://trec-kba.org/trec-kba-2013.

\subsection{KB entities}

 The KB entities consist of 20 Twitter entities and 121 Wikipedia entities. The selected entities are, on purpose, sparse. The entities consist of 71 people, 1 organization, and 24 facilities.

\subsection{Relevance judgments}

TREC-KBA provided relevance judgments for training and testing. Relevance judgments are given to a document-entity pairs. Documents with citation-worthy content to a given entity are annotated  as \emph{vital},  while documents with tangentially relevant content, or documents that lack freshliness o  with content that can be useful for initial KB-dossier are annotated as \emph{relevant}. Documents with no relevant content are labeled \emph{neutral} and spam is labeled as \emph{garbage}.  The inter-annotator agreement on vital in 2012 was 70\% while in 2013 it is 76\%. This is due to the more refined definition of vital and the distinction made between vital and relevant.

\subsection{Breakdown of results by document source category}

%The results of the different entity profiles on the raw corpus are

%broken down by source categories and relevance rank% (vital, or

%relevant).

In total, there are 24162 vital or relevant unique entity-document

pairs. 9521 of them are vital  and  17424 are relevant. These

documents  are categorized into 8 source categories: 0.98\% arxiv(a),

0.034\% classified(c), 0.34\% forum(f), 5.65\% linking(l), 11.53\%

mainstream-news(m-n), 18.40\% news(n), 12.93\% social(s) and 50.2\%

weblog(w). We have regrouped these source categories into three groups

``news'', ``social'', and ``other'', for two reasons: 1) some groups

are very similar to each other. Mainstream-news and news are

similar. The reason they exist separately, in the first place,  is

because they were collected from two different sources, by different

groups and at different times. we call them news from now on.  The

same is true with weblog and social, and we call them social from now

on.   2) some groups have so small number of annotations that treating

them independently does not make much sense. Majority of vital or

relevant annotations are social (social and weblog) (63.13\%). News

(mainstream +news) make up 30\%. Thus, news and social make up about

93\% of all annotations.  The rest make up about 7\% and are all

grouped as others.

 \section{Stream Filtering}

@@ -194,7 +223,7 @@ TREC-KBA provided relevance judgments for training and testing. Relevance judgme

The TREC filtering and the filtering as part of the entity-centric stream filtering and ranking pipepline have different purposes. The TREC filtering track's goal is the binary classification of documents: for each incoming docuemnt, it decides whether the incoming document is relevant or not for a given profile. The docuemnts are either relevant or not. In our case, the documents have relevance ranking and the goal of the filtering stage is to filter as many potentially relevant documents as possible, but less  irrelevant documents as possible not to obfuscate the later stages of the piepline.  Filtering as part of the pipeline needs that delicate balance between retrieving relavant documents and irrrelevant documensts. Bcause of this, filtering in this case can only be studied by binding it to the later stages of the entity-centric pipeline. This bond influnces how we do evaluation.

To achieve this,  we use recall percentages in the filtering stage for the different choices of entity profiles. However, we use the overall performance to select the best entity profiles.To generate the overall pipeline performance we use the official TREC KBA evaluation metrics and scripts \cite{frank2013stream}. The primary metric is pick F-score at different relevance cut-off, and the secondary metric is scaled utility(SU) which measures how much irrelevant docuemnts it filters.

To achieve this,  we use recall percentages in the filtering stage for the different choices of entity profiles. However, we use the overall performance to select the best entity profiles.To generate the overall pipeline performance we use the official TREC KBA evaluation metric and scripts \cite{frank2013stream} to report max-F, the maximum F-score obtained over all relevance cut-offs.

\section{Literature Review}

@@ -318,42 +347,36 @@ The upper part of Table \ref{tab:name} shows the recall performances on the clea

If we look at the recall performances for the raw corpus,   filtering documents by canonical names achieves a recall of  59\%.  Adding other name variants  improves the recall to 79.8\%, an increase of 20.8\%. This means  20.8\% of documents mentioned the entities by other names  rather than by their canonical names. Canonical partial  achieves a recall of 72\%  and name-variant partial achives 90.2\%. This says that 18.2\% of documents mentioned the entities by  partial names of other non-canonical name variants.

\subsection{Breakdown of results by document source category}

  \begin{table*}

\caption{Breakdown of  recall percentage increases by document categories }

\begin{center}\begin{tabular}{l*{9}{c}r}

 && \multicolumn{3}{ c| }{All entities}  & \multicolumn{3}{ c| }{Wikipedia} &\multicolumn{3}{ c| }{Twitter} \\

 & &others&news&social & others&news&social &  others&news&social \\

\hline

\multirow{4}{*}{Vital}	 &cano-part $-$ cano  	&8.2  &14.9    &12.3           &9.1  &18.6   &14.1             &0      &0       &0  \\

                         &all$-$ cano         	&12.6  &19.7    &12.3          &5.5  &15.8   &8.4             &73   &35.9    &38.3  \\

	                 &all-part $-$ cano\_part&9.7    &18.7  &12.7       &0    &0.5  &5.1        &93.2 & 93 &64.4 \\

	                 &all-part $-$ all     	&5.4  &13.9     &12.7           &3.6  &3.3    &10.8              &20.3   &57.1    &26.1 \\

	                 \hline

\multirow{4}{*}{Relevant}  &cano-part $-$ cano  	&10.5  &15.1    &12.2          &11.1  &21.7   &14.1             &0   &0    &0  \\

                         &all $-$ cano         	&11.7  &36.6    &17.3          &9.2  &19.5   &9.9             &54.5   &76.3   &66  \\

	                 &all-part $-$ cano-part &4.2  &26.9   &15.8          &0.2    &0.7    &6.7           &72.2   &87.6 &75 \\

	                 &all-part $-$ all     	&3    &5.4     &10.7           &2.1  &2.9    &11              &18.2   &11.3    &9 \\

	                 \hline

\multirow{4}{*}{total} 	&cano-part $-$ cano   	&10.9   &15.5   &12.4         &11.9  &21.3   &14.4          &0     &0       &0\\

			&all $-$ cano         	&13.8   &30.6   &16.9         &9.1  &18.9   &10.2          &63.6  &61.8    &57.5 \\

                        &all-part $-$ cano-part	&7.2   &24.8   &15.9          &0.1    &0.7    &6.8           &82.2  &89.1    &71.3\\

                        &all-part $-$ all     	&4.3   &9.7    &11.4           &3.0  &3.1   &11.0          &18.9  &27.3    &13.8\\

\hline

\end{tabular}

\end{center}

\label{tab:source-delta2}

\end{table*}

%\begin{table*}

%\caption{Breakdown of recall percentage increases by document categories }

%\begin{center}\begin{tabular}{l*{9}{c}r}

% && \multicolumn{3}{ c| }{All entities}  & \multicolumn{3}{ c| }{Wikipedia} &\multicolumn{3}{ c| }{Twitter} \\

% & &others&news&social & others&news&social &  others&news&social \\

%\hline

%\multirow{4}{*}{Vital}	 &cano-part $-$ cano  	&8.2  &14.9    &12.3           &9.1  &18.6   &14.1             &0      %&0       &0  \\

%                         &all$-$ cano         	&12.6  &19.7    &12.3          &5.5  &15.8   &8.4             &73   &35%.9    &38.3  \\

%	                 &all-part $-$ cano\_part&9.7    &18.7  &12.7       &0    &0.5  &5.1        &93.2 & 93 &64.4 \\%

%	                 &all-part $-$ all     	&5.4  &13.9     &12.7           &3.6  &3.3    &10.8              &20.3 %  &57.1    &26.1 \\

%	                 \hline

%\multirow{4}{*}{Relevant}  &cano-part $-$ cano  	&10.5  &15.1    &12.2          &11.1  &21.7   &14.1            % &0   &0    &0  \\

%                         &all $-$ cano         	&11.7  &36.6    &17.3          &9.2  &19.5   &9.9             &%54.5   &76.3   &66  \\

%	                 &all-part $-$ cano-part &4.2  &26.9   &15.8          &0.2    &0.7    &6.7           &72.2   &8%7.6 &75 \\

%	                 &all-part $-$ all     	&3    &5.4     &10.7           &2.1  &2.9    &11              &18.2   &%11.3    &9 \\

%	                 \hline

%\multirow{4}{*}{total} 	&cano-part $-$ cano   	&10.9   &15.5   &12.4         &11.9  &21.3   &14.4          &0 %    &0       &0\\

%			&all $-$ cano         	&13.8   &30.6   &16.9         &9.1  &18.9   &10.2          &63.6  &61.8%    &57.5 \\

%                        &all-part $-$ cano-part	&7.2   &24.8   &15.9          &0.1    &0.7    &6.8           &8%2.2  &89.1    &71.3\\

%                        &all-part $-$ all     	&4.3   &9.7    &11.4           &3.0  &3.1   &11.0          &18.9  &27.3%    &13.8\\

%\hline

%\end{tabular}

%\end{center}

%\label{tab:source-delta2}

%\end{table*}

 \begin{table*}

@@ -387,10 +410,8 @@ If we look at the recall performances for the raw corpus,   filtering documents

\end{table*}

The results  of the different entity profiles on the raw corpus are broken down by source categories and relevance rank (vital, or relevant).  In total, there are 24162 vital or relevant unique entity-document pairs. 9521 of them are vital  and  17424 are relevant. These documents  are categorized into 8 source categories: 0.98\% arxiv(a), 0.034\% classified(c), 0.34\% forum(f), 5.65\% linking(l), 11.53\% mainstream-news(m-n), 18.40\% news(n), 12.93\% social(s) and 50.2\% weblog(w).

The results of the break down are presented in Table \ref{tab:source-delta}.  The 8 document source categories are regrouped into three for two reasons: 1) some groups are very similar to each other. Mainstream-news and news are  similar. The reason they exist separately, in the first place,  is because they were collected from two different sources, by different groups and at different times. we call them news from now on.  The same is true with weblog and social, and we call them social from now on.   2) some groups have so small number of annotations that treating them independently does not make much sense. Majority of vital or relevant annotations are social (social and weblog) (63.13\%). News (mainstream +news) make up 30\%. Thus, news and social make up about 93\% of all annotations.  The rest make up about 7\% and are all grouped as others.

The break down of the raw corpus by document source category is presented in Table

\ref{tab:source-delta}.

@@ -418,11 +439,25 @@ Overall, across all entities types and all entity profiles, others achieve highe

% This suggests that social documents are the hardest  to retrieve.  This  makes sense since social posts such as tweets and blogs are short and are more likely to point to other resources, or use short informal names.

We computed four percentage increases in recall (deltas)  between the different entity profiles (Table \ref{tab:source-delta2}). The first delta is the recall percentage between canonical partial  and canonical. The second  is  between name= variant and canonical. The third is the difference between name-variant partial  and canonical partial and the fourth between name-variant partial and name-variant. we believe these four deltas offer a clear meaning. The delta between name-variant and canonical measn the percentage of documents that the new name variants retrieve, but the canonical name does not. Similarly, the delta between  name-variant partial and partial canonical-partial means the percentage of document-entity pairs that can be gained by the partial names of the name variants.

% The  biggest delta  observed is in Twitter entities between partials of all name variants and partials of canonicals (93\%). delta. Both of them are for news category.  For Wikipedia entities, the highest delta observed is 19.5\% in cano\_part - cano followed by 17.5\% in all\_part in relevant.

%%NOTE TABLE REMOVED:\\\\

%We computed four percentage increases in recall (deltas)  between the

%different entity profiles (Table \ref{tab:source-delta2}). The first

%delta is the recall percentage between canonical partial  and

%canonical. The second  is  between name= variant and canonical. The

%third is the difference between name-variant partial  and canonical

%partial and the fourth between name-variant partial and

%name-variant. we believe these four deltas offer a clear meaning. The

%delta between name-variant and canonical measn the percentage of

%documents that the new name variants retrieve, but the canonical name

%does not. Similarly, the delta between  name-variant partial and

%partial canonical-partial means the percentage of document-entity

%pairs that can be gained by the partial names of the name variants.

% The  biggest delta  observed is in Twitter entities between partials

% of all name variants and partials of canonicals (93\%). delta. Both

% of them are for news category.  For Wikipedia entities, the highest

% delta observed is 19.5\% in cano\_part - cano followed by 17.5\% in

% all\_part in relevant.

  \subsection{Entity Types: Wikipedia and Twitter}

Table \ref{tab:name} shows the difference between Wikipedia and Twitter entities.  Wikipedia entities' canonical achieves a recall of 70\%, and canonical partial  achieves a recall of 86.1\%. This is an increase in recall of 16.1\%. By contrast, the increase in recall of name-variant partial over name-variant is 8.3.

@@ -436,12 +471,12 @@ In Twitter entities, however, it is different. Both canonical and their partials

The tables in \ref{tab:name} and \ref{tab:source-delta} show recall for Wikipedia entities are higher than for Twitter. Generally, at both aggregate and document category levels, we observe that recall increases as we move from canonicals to canonical partial, to name-variant, and to name-variant partial. The only case where this does not hold is in the transition from Wikipedia's canonical partial to name-variant. At the aggregate level(as can be inferred from Table \ref{tab:name}), the difference in performance between  canonical  and name-variant partial is 31.9\% on all entities, 20.7\% on Wikipedia entities, and 79.5\% on Twitter entities. This is a significant performance difference.

%% TODO: PERHAPS SUMMARY OF DISCUSSION HERE

   \subsection{Impact on classification}

\section{Impact on classification}

%   In the overall experimental setup, classification, ranking,  and evaluationn are kept constant.

  In here, we present results showing how  the choices in corpus, entity types, and entity profiles impact these latest stages of the pipeline.  In tables \ref{tab:class-vital} and \ref{tab:class-vital-relevant}, we show the performances in F-measure and SU.

  In here, we present results showing how  the choices in corpus, entity types, and entity profiles impact these latest stages of the pipeline.  In tables \ref{tab:class-vital} and \ref{tab:class-vital-relevant}, we show the performances in max-F.

\begin{table*}

\caption{vital performance under different name variants(upper part from cleansed, lower part from raw)}

\begin{center}

@@ -451,24 +486,24 @@ The tables in \ref{tab:name} and \ref{tab:source-delta} show recall for Wikipedi

  &&cano&cano-part&all  &all-part \\

   all-entities &F& 0.241&0.261&0.259&0.265\\

	      &SU&0.259  &0.258 &0.263 &0.262 \\

   Wikipedia &F&0.252&0.274& 0.265&0.271\\

	      &SU& 0.261& 0.259&  0.265&0.264 \\

   all-entities &max-F& 0.241&0.261&0.259&0.265\\

%	      &SU&0.259  &0.258 &0.263 &0.262 \\

   Wikipedia &max-F&0.252&0.274& 0.265&0.271\\

%	      &SU& 0.261& 0.259&  0.265&0.264 \\

   twitter &F&0.105&0.105&0.218&0.228\\

     &SU &0.105&0.250& 0.254&80.253\\

   twitter &max-F&0.105&0.105&0.218&0.228\\

%     &SU &0.105&0.250& 0.254&0.253\\

\hline

\hline

  all-entities &F & 0.240 &0.272 &0.250&0.251\\

	  &SU& 0.258   &0.151  &0.264  &0.258\\

   Wikipedia&F &0.257&0.257&0.257&0.255\\

   &SU	     & 0.265&0.265 &0.266 & 0.259\\

   twitter&F &0.188&0.188&0.208&0.231\\

	&SU&    0.269 &0.250 &0.250&0.253\\

  all-entities &max-F & 0.240 &0.272 &0.250&0.251\\

%	  &SU& 0.258   &0.151  &0.264  &0.258\\

   Wikipedia&max-F &0.257&0.257&0.257&0.255\\

%   &SU	     & 0.265&0.265 &0.266 & 0.259\\

   twitter&max-F &0.188&0.188&0.208&0.231\\

%	&SU&    0.269 &0.250 &0.250&0.253\\

\hline

\end{tabular}

@@ -486,23 +521,23 @@ The tables in \ref{tab:name} and \ref{tab:source-delta} show recall for Wikipedi

 &&cano&cano-part&all  &all-part \\

   all-entities &F& 0.497&0.560&0.579&0.607\\

	      &SU&0.468  &0.484 &0.483 &0.492 \\

   Wikipedia &F&0.546&0.618&0.599&0.617\\

   &SU&0.494  &0.513 &0.498 &0.508 \\

   all-entities &max-F& 0.497&0.560&0.579&0.607\\

%	      &SU&0.468  &0.484 &0.483 &0.492 \\

   Wikipedia &max-F&0.546&0.618&0.599&0.617\\

%   &SU&0.494  &0.513 &0.498 &0.508 \\

   twitter &F&0.142&0.142& 0.458&0.542\\

    &SU &0.317&0.328&0.392&0.392\\

   twitter &max-F&0.142&0.142& 0.458&0.542\\

%    &SU &0.317&0.328&0.392&0.392\\

\hline

\hline

  all-entities &F& 0.509 &0.594 &0.590&0.612\\

    &SU       &0.459   &0.502  &0.478  &0.488\\

   Wikipedia &F&0.550&0.617&0.605&0.618\\

   &SU	     & 0.483&0.498 &0.487 & 0.495\\

   twitter &F&0.210&0.210&0.499&0.580\\

	&SU&    0.319  &0.317 &0.421&0.446\\

  all-entities &max-F& 0.509 &0.594 &0.590&0.612\\

%    &SU       &0.459   &0.502  &0.478  &0.488\\

   Wikipedia &max-F&0.550&0.617&0.605&0.618\\

%   &SU	     & 0.483&0.498 &0.487 & 0.495\\

   twitter &max-F&0.210&0.210&0.499&0.580\\

%	&SU&    0.319  &0.317 &0.421&0.446\\

\hline

\end{tabular}

@@ -521,40 +556,6 @@ In vital-relevant category (Table \ref{tab:class-vital-relevant}), the performan

%The fact that canonical partial names achieve better results is interesting.  We know that partial names were used as a baseline in TREC KBA 2012, but no one of the KBA participants actually used partial names for filtering.

\subsection{Missing vital-relevant documents \label{miss}}

 The use of name-variant partial for filtering is an aggressive attempt to retrieve as many relevant documents as possible at the cost of retrieving irrelevant documents. However, we still miss about  2363(10\%) of the vital-relevant documents.  Why are these documents missed? If they are not mentioned by partial names of name variants, what are they mentioned by? Table \ref{tab:miss} shows the documents that we miss with respect to cleansed and raw corpus.  The upper part shows the number of documents missing from cleansed and raw versions of the corpus. The lower part of the table shows the intersections and exclusions in each corpus.

\begin{table}

\caption{The number of documents missing  from raw and cleansed extractions. }

\begin{center}

\begin{tabular}{l@{\quad}llllll}

\hline

\multicolumn{1}{l}{\rule{0pt}{12pt}category}&\multicolumn{1}{l}{\rule{0pt}{12pt}Vital }&\multicolumn{1}{l}{\rule{0pt}{12pt}Relevant }&\multicolumn{1}{l}{\rule{0pt}{12pt}Total }\\[5pt]

\hline

Cleansed &1284 & 1079 & 2363 \\

Raw & 276 & 4951 & 5227 \\

\hline

 missing only from cleansed &1065&2016&3081\\

  missing only from raw  &57 &160 &217 \\

  Missing from both &219 &1927&2146\\

\hline

\end{tabular}

\end{center}

\label{tab:miss}

\end{table}

One would  assume that  the set of document-entity pairs extracted from cleansed are a sub-set of those   that are extracted from the raw corpus. We find that that is not the case. There are 217  unique entity-document pairs that are retrieved from the cleansed corpus, but not from the raw. 57 of them are vital.    Similarly,  there are  3081 document-entity pairs that are missing  from cleansed, but are present in  raw. 1065 of them are vital.  Examining the content of the documents reveals that it is due to a missing part of text from a corresponding document.  All the documents that we miss from the raw corpus are social. These are documents such as tweets and blogs, posts from other social media. To meet the format of the raw data (binary byte array), some of them must have been converted later, after collection and on the way lost a part or the entire content. It is similar for the documents that we miss from cleansed: a part or the entire content  is lost in during the cleansing process (the removal of HTML tags and non-English documents).  In both cases the mention of the entity happened to be on the part of the text that is cut out during transformation.

 The interesting set  of relevance judgments are those that  we miss from both raw and cleansed extractions. These are 2146 unique document-entity pairs, 219 of them are with vital relevance judgments.   The total number of entities in the missed vital annotations is  28 Wikipedia and 7  Twitter, making a total of 35. The  great majority (86.7\%) of the documents are social. This suggests that social (tweets and blogs) can talk about the entities without mentioning  them by name more than news and others do. This is, of course, inline with intuition.

@@ -647,10 +648,47 @@ Experimental results show that cleansing can remove entire or parts of the conte

The use of different profiles also shows a big difference in recall. Except in the case of Wikipedia where the use of canonical partial achieves better than name-variant, there is a steady increase in recall from canonical to  canonical partial, to name-variant, and to name-variant partial. This pattern is also observed across the document categories.  However, here too, the relationship between   the gain in recall as we move from less richer profile to a more richer profile and overall performance as measured by F-score  is not linear.

%%%%% MOVED FROM LATER ON - CHECK FLOW

There is a trade-off between using a richer entity-profile and retrieval of irrelevant documents. The richer the profile, the more relevant documents it retrieves, but also the more irrelevant documents. To put it into perspective, lets compare the number of documents that are retrieved with  canonical partial and with name-variant partial. Using the raw corpus, the former retrieves a total of 2547487 documents and achieves a recall of 72.2\%. By contrast, the later retrieves a total of 4735318 documents and achieves a recall of 90.2\%. The total number of documents extracted increases by 85.9\% for a recall gain of 18\%. The rest of the documents, that is 67.9\%, are newly introduced irrelevant documents.

%%%%%%%%%%%%

In vital ranking, across all entity profiles and types of corpus, Wikipedia's canonical partial  achieves better performance than any other Wikipedia entity profiles. In vital-relevant documents too, Wikipedia's canonical partial achieves the best result. In the raw corpus, it achieves a little less than name-variant partial. For Twitter entities, the name-variant partial profile achieves the highest F-score across all entity profiles and types of corpus.

There are 3 interesting observations: 1) cleansing impacts Twitter entities and relevant documents. This is validated by the observation that recall gains in Twitter entities and the relevant categories in the raw corpus also translate into overall performance gains. This observation implies that cleansing removes relevant and social documents than it does vital and news. That it removes relevant documents more than vital can be explained by the fact that cleansing removes the related links and adverts which may contain a mention of the entities. One example we saw was the the cleansing removed an image with a text of an entity name which was actually relevant. And that it removes social documents can be explained by the fact that most of the missing of the missing docuemnts from cleansed are social. And all the docuemnts that are missing from raw corpus social. So in both cases socuial seem to suffer from text transformation and cleasing processes. 2) Taking both performance (recall at filtering and overall F-score during evaluation) into account, Wikipedia's canonical partial is the best entity profile for Wikipedia entities. This is interesting to see that the retrieval of of thousands vital-relevant document-entity pairs by name-variant partial does not translate to an increase in over all performance. It is even more interesting since canonical partial was not considered as contending profile for stream filtering by any of participant to the best of our knowledge. With this understanding, there is actually no need to go and fetch different names variants from DBpedia, a saving of time and computational resources.

There are 3 interesting observations:

1) cleansing impacts Twitter

entities and relevant documents.  This  is validated by the

observation that recall  gains in Twitter entities and the relevant

categories in the raw corpus also translate into overall performance

gains. This observation implies that cleansing removes relevant and

social documents than it does vital and news. That it removes relevant

documents more than vital can be explained by the fact that cleansing

removes the related links and adverts which may contain a mention of

the entities. One example we saw was the the cleansing removed an

image with a text of an entity name which was actually relevant. And

that it removes social documents can be explained by the fact that

most of the missing of the missing  docuemnts from cleansed are

social. And all the docuemnts that are missing from raw corpus

social. So in both cases socuial seem to suffer from text

transformation and cleasing processes.

%%%% NEEDS WORK:

2) Taking both performance (recall at filtering and overall F-score

during evaluation) into account, there is a clear trade-off between using a richer entity-profile and retrieval of irrelevant documents. The richer the profile, the more relevant documents it retrieves, but also the more irrelevant documents. To put it into perspective, lets compare the number of documents that are retrieved with  canonical partial and with name-variant partial. Using the raw corpus, the former retrieves a total of 2547487 documents and achieves a recall of 72.2\%. By contrast, the later retrieves a total of 4735318 documents and achieves a recall of 90.2\%. The total number of documents extracted increases by 85.9\% for a recall gain of 18\%. The rest of the documents, that is 67.9\%, are newly introduced irrelevant documents.

Wikipedia's canonical partial is the best entity profile for Wikipedia entities. This is interesting  to see that the retrieval of of  thousands vital-relevant document-entity pairs by name-variant partial does not translate to an increase in over all performance. It is even more interesting since canonical partial was not considered as contending profile for stream filtering by any of participant to the best of our knowledge. With this understanding, there  is actually no need to go and fetch different names variants from DBpedia, a saving of time and computational resources.

%%%%%%%%%%%%

The deltas between entity profiles, relevance ratings, and document categories reveal four differences between Wikipedia and Twitter entities. 1) For Wikipedia entities, the difference between canonical partial and canonical is higher(16.1\%) than between name-variant partial and  name-variant(18.3\%).  This can be explained by saturation. This is to mean that documents have already been extracted by  name-variants and thus using their partials does not bring in many new relevant documents.  2) Twitter entities are mentioned by name-variant or name-variant partial and that is seen in the high recall achieved  compared to the low recall achieved by canonical(or their partial). This indicates that documents (specially news and others) almost never use user names to refer to Twitter entities. Name-variant partials are the best entity profiles for Twitter entities. 3) However, comparatively speaking, social documents refer to Twitter entities by their user names than news and others suggesting a difference in adherence to standard in names and naming. 4) Wikipedia entities achieve higher recall and higher overall performance.

@@ -663,11 +701,46 @@ Across document categories, we observe a pattern in recall of others, followed b

\section{Unfilterable documents}

There is a trade-off between using a richer entity-profile and retrieval of irrelevant documents. The richer the profile, the more relevant documents it retrieves, but also the more irrelevant documents. To put it into perspective, lets compare the number of documents that are retrieved with  canonical partial and with name-variant partial. Using the raw corpus, the former retrieves a total of 2547487 documents and achieves a recall of 72.2\%. By contrast, the later retrieves a total of 4735318 documents and achieves a recall of 90.2\%. The total number of documents extracted increases by 85.9\% for a recall gain of 18\%. The rest of the documents, that is 67.9\%, are newly introduced irrelevant documents.

\subsection{Missing vital-relevant documents \label{miss}}

 We observed that there are vital-relevant documents that we miss from raw only, and similarly from cleansed only. The reason for this is transformation from one format to another. The most interesting documents are those that we miss from both raw and cleansed corpus. We first identified the number of KB entities who have a vital relevance judgment and  whose documents can not be retrieved (they were 35 in total) and conducted a manual examination into their content to find out why they are missing.

 The use of name-variant partial for filtering is an aggressive attempt to retrieve as many relevant documents as possible at the cost of retrieving irrelevant documents. However, we still miss about  2363(10\%) of the vital-relevant documents.  Why are these documents missed? If they are not mentioned by partial names of name variants, what are they mentioned by? Table \ref{tab:miss} shows the documents that we miss with respect to cleansed and raw corpus.  The upper part shows the number of documents missing from cleansed and raw versions of the corpus. The lower part of the table shows the intersections and exclusions in each corpus.

\begin{table}

\caption{The number of documents missing  from raw and cleansed extractions. }

\begin{center}

\begin{tabular}{l@{\quad}llllll}

\hline

\multicolumn{1}{l}{\rule{0pt}{12pt}category}&\multicolumn{1}{l}{\rule{0pt}{12pt}Vital }&\multicolumn{1}{l}{\rule{0pt}{12pt}Relevant }&\multicolumn{1}{l}{\rule{0pt}{12pt}Total }\\[5pt]

\hline

Cleansed &1284 & 1079 & 2363 \\

Raw & 276 & 4951 & 5227 \\

\hline

 missing only from cleansed &1065&2016&3081\\

  missing only from raw  &57 &160 &217 \\

  Missing from both &219 &1927&2146\\

\hline

\end{tabular}

\end{center}

\label{tab:miss}

\end{table}

One would  assume that  the set of document-entity pairs extracted from cleansed are a sub-set of those   that are extracted from the raw corpus. We find that that is not the case. There are 217  unique entity-document pairs that are retrieved from the cleansed corpus, but not from the raw. 57 of them are vital.    Similarly,  there are  3081 document-entity pairs that are missing  from cleansed, but are present in  raw. 1065 of them are vital.  Examining the content of the documents reveals that it is due to a missing part of text from a corresponding document.  All the documents that we miss from the raw corpus are social. These are documents such as tweets and blogs, posts from other social media. To meet the format of the raw data (binary byte array), some of them must have been converted later, after collection and on the way lost a part or the entire content. It is similar for the documents that we miss from cleansed: a part or the entire content  is lost in during the cleansing process (the removal of HTML tags and non-English documents).  In both cases the mention of the entity happened to be on the part of the text that is cut out during transformation.

 The interesting set  of relevance judgments are those that  we miss from both raw and cleansed extractions. These are 2146 unique document-entity pairs, 219 of them are with vital relevance judgments.   The total number of entities in the missed vital annotations is  28 Wikipedia and 7  Twitter, making a total of 35. The  great majority (86.7\%) of the documents are social. This suggests that social (tweets and blogs) can talk about the entities without mentioning  them by name more than news and others do. This is, of course, inline with intuition.

%%%%%%%%%%%%%%%%%%%%%%

We observed that there are vital-relevant documents that we miss from raw only, and similarly from cleansed only. The reason for this is transformation from one format to another. The most interesting documents are those that we miss from both raw and cleansed corpus. We first identified the number of KB entities who have a vital relevance judgment and  whose documents can not be retrieved (they were 35 in total) and conducted a manual examination into their content to find out why they are missing.

 We  observed  that among the missing documents, different document ids can have the same content, and be judged multiple times for a given entity.  %In the vital annotation, there are 88 news, and 409 weblog.