HCDA/cikm-paper Changeset - e9656e726fbf · Centrum Wiskunde & Informatica (CWI)

@@ -21,25 +21,25 @@

% Questions regarding SIGS should be sent to

% Adrienne Griscti ---> griscti@acm.org

% Questions/suggestions regarding the guidelines, .tex and .cls files, etc. to

% Gerald Murray ---> murray@hq.acm.org

% For tracking purposes - this is V3.1SP - APRIL 2009

\documentclass{acm_proc_article-sp}

\usepackage{booktabs}

\usepackage{multirow}

\usepackage{todonotes}

\usepackage{url}

\usepackage{url}

\begin{document}

\title{Entity-Centric Stream Filtering and ranking: Filtering and Unfilterable Documents

%SUGGESTION:

%\title{The Impact of Entity-Centric Stream Filtering on Recall and

%  Missed Documents}

% You need the command \numberofauthors to handle the 'placement

% and alignment' of the authors beneath the title.

@@ -56,25 +56,25 @@

% (two rows with three columns) beneath the article title.

% More than six makes the first-page appear very cluttered indeed.

% Use the \alignauthor commands to handle the names

% and affiliations for an 'aesthetic maximum' of six authors.

% Add names, affiliations, addresses for

% the seventh etc. author(s) as the argument for the

% \additionalauthors command.

% These 'additional authors' will be output/set for you

% without further effort on your part as the last section in

% the body of your article BEFORE References or any Appendices.

\numberofauthors{2} %  in this sample file, there are a *total*

\numberofauthors{8} %  in this sample file, there are a *total*

% of EIGHT authors. SIX appear on the 'first-page' (for formatting

% reasons) and the remaining two appear in the \additionalauthors section.

% \author{

% % You can go ahead and credit any number of authors here,

% % e.g. one 'row of three' or two rows (consisting of one row of three

% % and a second row of one, two or three).

% %

% % The command \alignauthor (no curly braces needed) should

% % precede each author name, affiliation/snail-mail address and

% % e-mail address. Additionally, tag each line of

% % affiliation/address with \affaddr, and tag the

@@ -122,25 +122,25 @@ be more manageable for the subsequent stages.

Nevertheless, this step has a large impact on the recall that can be

maximally attained! Therefore, in this study, we have focused on just

this filtering stage and conduct an in-depth analysis of the main design

decisions here: how to cleans the noisy text obtained online,

the methods to create entity profiles, the

types of entities of interest, document type, and the grade of

relevance of the document-entity pair under consideration.

We analyze how these factors (and the design choices made in their

corresponding system components) affect filtering performance.

We identify and characterize the relevant documents that do not pass

the filtering stage by examing their contents. This way, we

estimate a practical upper-bound of recall for entity-centric stream

filtering.

filtering.

\end{abstract}

% A category with the (minimum) three required fields

\category{H.4}{Information Filtering}{Miscellaneous}

%A category including the fourth, optional field follows...

%\category{D.2.8}{Software Engineering}{Metrics}[complexity measures, performance measures]

\terms{Theory}

\keywords{Information Filtering; Cumulative Citation Recommendation; knowledge maintenance; Stream Filtering;  emerging entities} % NOT required for Proceedings

@@ -206,49 +206,49 @@ document category (social, news, etc) on the filtering components'

performance. The main contribution of the

paper are an in-depth analysis of the factors that affect entity-based

stream filtering, identifying optimal entity profiles without

compromising precision, describing and classifying relevant documents

that are not amenable to filtering , and estimating the upper-bound

of recall on entity-based filtering.

The rest of the paper is is organized as follows:

\textbf{TODO!!}

 \section{Data Description}

We base this analysis on the TREC-KBA 2013 dataset%

\footnote{\url{http://trec-kba.org/trec-kba-2013.shtml}}

that consists of three main parts: a time-stamped stream corpus, a set of

KB entities to be curated, and a set of relevance judgments. A CCR

system now has to identify for each KB entity which documents in the

stream corpus are to be considered by the human curator.

\subsection{Stream corpus} The stream corpus comes in two versions:

raw and cleaned. The raw and cleansed versions are 6.45TB and 4.5TB

respectively,  after xz-compression and GPG encryption. The raw data

is a  dump of  raw HTML pages. The cleansed version is the raw data

after its HTML tags are stripped off and only English documents

identified with Chromium Compact Language Detector

\footnote{\url{https://code.google.com/p/chromium-compact-language-detector/}}

are included.  The stream corpus is organized in hourly folders each

of which contains many  chunk files. Each chunk file contains between

hundreds and hundreds of thousands of serialized  thrift objects. One

thrift object is one document. A document could be a blog article, a

news article, or a social media post (including tweet).  The stream

corpus comes from three sources: TREC KBA 2012 (social, news and

linking) \footnote{\url{http://trec-kba.org/kba-stream-corpus-2012.shtml}},

arxiv\footnote{\url{http://arxiv.org/}}, and

spinn3r\footnote{\url{http://spinn3r.com/}}.

Table \ref{tab:streams} shows the sources, the number of hourly

directories, and the number of chunk files.

We base this analysis on the TREC-KBA 2013 dataset%

\footnote{\url{http://trec-kba.org/trec-kba-2013.shtml}}

that consists of three main parts: a time-stamped stream corpus, a set of

KB entities to be curated, and a set of relevance judgments. A CCR

system now has to identify for each KB entity which documents in the

stream corpus are to be considered by the human curator.

\subsection{Stream corpus} The stream corpus comes in two versions:

raw and cleaned. The raw and cleansed versions are 6.45TB and 4.5TB

respectively,  after xz-compression and GPG encryption. The raw data

is a  dump of  raw HTML pages. The cleansed version is the raw data

after its HTML tags are stripped off and only English documents

identified with Chromium Compact Language Detector

\footnote{\url{https://code.google.com/p/chromium-compact-language-detector/}}

are included.  The stream corpus is organized in hourly folders each

of which contains many  chunk files. Each chunk file contains between

hundreds and hundreds of thousands of serialized  thrift objects. One

thrift object is one document. A document could be a blog article, a

news article, or a social media post (including tweet).  The stream

corpus comes from three sources: TREC KBA 2012 (social, news and

linking) \footnote{\url{http://trec-kba.org/kba-stream-corpus-2012.shtml}},

arxiv\footnote{\url{http://arxiv.org/}}, and

spinn3r\footnote{\url{http://spinn3r.com/}}.

Table \ref{tab:streams} shows the sources, the number of hourly

directories, and the number of chunk files.

\begin{table}

\caption{Retrieved documents to different sources }

\begin{center}

 \begin{tabular}{l*{4}{l}l}

 documents     &   chunk files    &    Sub-stream \\

\hline

126,952         &11,851         &arxiv \\

394,381,405      &   688,974        & social \\

134,933,117       &  280,658       &  news \\

5,448,875         &12,946         &linking \\

@@ -379,79 +379,79 @@ All of the studies used filtering as their first step to generate a smaller set

Moreover, there has not been a chance to study at this scale and/or a study into what type of documents defy filtering and why? In this paper, we conduct a manual examination of the documents that are missing and classify them into different categories. We also estimate the general upper bound of recall using the different entities profiles and choose the best profile that results in an increased over all performance as measured by F-measure.

\section{Method}

All analyses in this paper are carried out on the documents that have

relevance assessments associated to them. For this purpose, we

extracted those documents from the big corpus. We experiment with all

KB entities. For each KB entity, we extract different name variants

from DBpedia and Twitter.

\subsection{Entity Profiling}

We build entity profiles for the KB entities of interest. We have two

types: Twitter and Wikipedia. Both entities have been selected, on

purpose by the track organisers, to occur only sparsely and be less-documented.

For the Wikipedia entities, we fetch different name variants

from DBpedia: name, label, birth name, alternative names,

redirects, nickname, or alias.

These extraction results are summarized in Table

\ref{tab:sources}.

For the Twitter entities, we visit

their respective Twitter pages and fetch their display names.

We build entity profiles for the KB entities of interest. We have two

types: Twitter and Wikipedia. Both entities have been selected, on

purpose by the track organisers, to occur only sparsely and be less-documented.

For the Wikipedia entities, we fetch different name variants

from DBpedia: name, label, birth name, alternative names,

redirects, nickname, or alias.

These extraction results are summarized in Table

\ref{tab:sources}.

For the Twitter entities, we visit

their respective Twitter pages and fetch their display names.

\begin{table}

\caption{Number of different DBpedia name variants}

\begin{center}

 \begin{tabular}{l*{4}{c}l}

 Name variant& No. of strings  \\

\hline

 Name  &82\\

 Label   &121\\

Redirect  &49 \\

 Birth Name &6\\

 Nickname & 1&\\

 Alias &1 \\

 Alternative Names &4\\

\hline

\end{tabular}

\end{center}

\label{tab:sources}

\end{table}

The collection contains a total number of 121 Wikipedia entities.

Every entity has a corresponding DBpedia label.  Only 82 entities have

a name string and only 49 entities have redirect strings. (Most of the

entities have only one string, except for a few cases with multiple

redirect strings; Buddy\_MacKay, has the highest (12) number of

redirect strings.)

We combine the different name variants we extracted to form a set of

strings for each KB entity. For Twitter entities, we used the display

names that we collected. We consider the names of the entities that

are part of the URL as canonical. For example in entity\\

\url{http://en.wikipedia.org/wiki/Benjamin_Bronfman}\\

Benjamin Bronfman is a canonical name of the entity.

An example is given in Table \ref{tab:profile}.

From the combined name variants and

the canonical names, we  created four sets of profiles for each

entity: canonical(cano) canonical partial (cano-part), all name

variants combined (all) and partial names of all name

variants(all-part). We refer to the last two profiles as name-variant

and name-variant partial. The names in parentheses are used in table

captions.

The collection contains a total number of 121 Wikipedia entities.

Every entity has a corresponding DBpedia label.  Only 82 entities have

a name string and only 49 entities have redirect strings. (Most of the

entities have only one string, except for a few cases with multiple

redirect strings; Buddy\_MacKay, has the highest (12) number of

redirect strings.)

We combine the different name variants we extracted to form a set of

strings for each KB entity. For Twitter entities, we used the display

names that we collected. We consider the names of the entities that

are part of the URL as canonical. For example in entity\\

\url{http://en.wikipedia.org/wiki/Benjamin_Bronfman}\\

Benjamin Bronfman is a canonical name of the entity.

An example is given in Table \ref{tab:profile}.

From the combined name variants and

the canonical names, we  created four sets of profiles for each

entity: canonical(cano) canonical partial (cano-part), all name

variants combined (all) and partial names of all name

variants(all-part). We refer to the last two profiles as name-variant

and name-variant partial. The names in parentheses are used in table

captions.

\begin{table*}

\caption{Example entity profiles (upper part Wikipedia, lower part Twitter)}

\begin{center}

\begin{tabular}{l*{3}{c}}

 &Wikipedia&Twitter \\

\hline

 &Benjamin\_Bronfman& roryscovel\\

  cano&[Benjamin Bronfman] &[roryscovel]\\

  cano-part &[Benjamin, Bronfman]&[roryscovel]\\

  all&[Ben Brewer, Benjamin Zachary Bronfman] &[Rory Scovel] \\

@@ -491,29 +491,29 @@ The annotation set is a combination of the annotations from before the Training

\hline

\end{tabular}

\end{center}

\label{tab:breakdown}

\end{table}

%Most (more than 80\%) of the annotation documents are in the test set.

The 2013 training and test data contain 68405

annotations, of which 50688 are unique document-entity pairs.   Out of

these, 24162 unique document-entity pairs are vital (9521) or relevant

(17424).

%Most (more than 80\%) of the annotation documents are in the test set.

The 2013 training and test data contain 68405

annotations, of which 50688 are unique document-entity pairs.   Out of

these, 24162 unique document-entity pairs are vital (9521) or relevant

(17424).

\section{Experiments and Results}

 We conducted experiments to study  the effect of cleansing, different entity profiles, types of entities, category of documents, relevance ranks (vital or relevant), and the impact on classification.  In the following subsections, we present the results in different categories, and describe them.

 \subsection{Cleansing: raw or cleansed}

\begin{table}

\caption{Percentage of vital or relevant documents retrieved under different name variants (upper part from cleansed, lower part from raw)}

\begin{center}

\begin{tabular}{l@{\quad}rrrrrrr}

\hline

@@ -608,53 +608,53 @@ If we look at the recall performances for the raw corpus,   filtering documents

\label{tab:source-delta}

\end{table*}

%The break down of the raw corpus by document source category is presented in Table

%\ref{tab:source-delta}.

 \subsection{ Relevance Rating: vital and relevant}

When comparing recall for vital and relevant, we observe that

canonical names are more effective for vital than for relevant

entities, in particular for the Wikipedia entities.

%For example, the recall for news is 80.1 and for social is 76, while the corresponding recall in relevant is 75.6 and 63.2 respectively.

We conclude that the most relevant documents mention the

entities by their common name variants.

When comparing recall for vital and relevant, we observe that

canonical names are more effective for vital than for relevant

entities, in particular for the Wikipedia entities.

%For example, the recall for news is 80.1 and for social is 76, while the corresponding recall in relevant is 75.6 and 63.2 respectively.

We conclude that the most relevant documents mention the

entities by their common name variants.

%  \subsection{Difference by document categories}

%  Generally, there is greater variation in relevant rank than in vital. This is specially true in most of the Delta's for Wikipedia. This  maybe be explained by news items referring to  vital documents by a some standard name than documents that are relevant. Twitter entities show greater deltas than Wikipedia entities in both vital and relevant. The greater variation can be explained by the fact that the canonical name of Twitter entities retrieves very few documents. The deltas that involve canonical names of Twitter entities, thus, show greater deltas.

% If we look in recall performances, In Wikipedia entities, the order seems to be others, news and social. This means that others achieve a higher recall than news than social.  However, in Twitter entities, it does not show such a strict pattern. In all, entities also, we also see almost the same pattern of other, news and social.

\subsection{Recall across document categories: others, news and social}

The recall for Wikipedia entities in Table \ref{tab:name} ranged from

61.8\% (canonicals) to 77.9\% (name-variants).  Table

\ref{tab:source-delta} shows how recall is distributed across document

categories. For Wikipedia entities, across all entity profiles, others

have a higher recall followed by news, and then by social.  While the

recall for news ranges from 76.4\% to 98.4\%, the recall for social

documents ranges from 65.7\% to 86.8\%. In Twitter entities, however,

the pattern is different. In canonicals (and their partials), social

documents achieve higher recall than news.

The recall for Wikipedia entities in Table \ref{tab:name} ranged from

61.8\% (canonicals) to 77.9\% (name-variants).  Table

\ref{tab:source-delta} shows how recall is distributed across document

categories. For Wikipedia entities, across all entity profiles, others

have a higher recall followed by news, and then by social.  While the

recall for news ranges from 76.4\% to 98.4\%, the recall for social

documents ranges from 65.7\% to 86.8\%. In Twitter entities, however,

the pattern is different. In canonicals (and their partials), social

documents achieve higher recall than news.

%This indicates that social documents refer to Twitter entities by their canonical names (user names) more than news do. In name- variant partial, news achieve better results than social. The difference in recall between canonicals and name-variants show that news do not refer to Twitter entities by their user names, they refer to them by their display names.

Overall, across all entities types and all entity profiles, documents

Overall, across all entities types and all entity profiles, documents

in the others category achieve a higher recall than news, and news documents, in turn, achieve higher recall than social documents.

% This suggests that social documents are the hardest  to retrieve.  This  makes sense since social posts such as tweets and blogs are short and are more likely to point to other resources, or use short informal names.

%%NOTE TABLE REMOVED:\\\\

%We computed four percentage increases in recall (deltas)  between the

%different entity profiles (Table \ref{tab:source-delta2}). The first

%delta is the recall percentage between canonical partial  and

%canonical. The second  is  between name= variant and canonical. The

%third is the difference between name-variant partial  and canonical

@@ -663,77 +663,77 @@ in the others category achieve a higher recall than news, and news documents, in

%delta between name-variant and canonical means the percentage of

%documents that the new name variants retrieve, but the canonical name

%does not. Similarly, the delta between  name-variant partial and

%partial canonical-partial means the percentage of document-entity

%pairs that can be gained by the partial names of the name variants.

% The  biggest delta  observed is in Twitter entities between partials

% of all name variants and partials of canonicals (93\%). delta. Both

% of them are for news category.  For Wikipedia entities, the highest

% delta observed is 19.5\% in cano\_part - cano followed by 17.5\% in

% all\_part in relevant.

  \subsection{Entity Types: Wikipedia and Twitter}

Table \ref{tab:name} summarizes the differences between Wikipedia and

Twitter entities.  Wikipedia entities' canonical representation

achieves a recall of 70\%, while canonical partial achieves a recall of 86.1\%. This is an

increase in recall of 16.1\%. By contrast, the increase in recall of

name-variant partial over name-variant is 8.3\%.

%This high increase in recall when moving from canonical names to their

%partial names, in comparison to the lower increase when moving from

%all name variants to their partial names can be explained by

%saturation: documents have already been extracted by the different

%name variants and thus using their partial names do not bring in many

%new relevant documents.

For Wikipedia entities, canonical

partial achieves better recall than name-variant in both the cleansed and

the raw corpus.  %In the raw extraction, the difference is about 3.7.

In Twitter entities, recall of canonical matching is very low.%

\footnote{Canonical

and canonical partial are the same for Twitter entities because they

are one word strings. For example in https://twitter.com/roryscovel,

``roryscovel`` is the canonical name and its partial is identical.}

%The low recall is because the canonical names of Twitter entities are

%not really names; they are usually arbitrarily created user names. It

%shows that  documents  refer to them by their display names, rarely

%by their user name, which is reflected in the name-variant recall

%(67.9\%). The use of name-variant partial increases the recall to

%88.2\%.

The tables in \ref{tab:name} and \ref{tab:source-delta} show a higher recall

for Wikipedia than for Twitter entities. Generally, at both

aggregate and document category levels, we observe that recall

increases as we move from canonicals to canonical partial, to

name-variant, and to name-variant partial. The only case where this

does not hold is in the transition from Wikipedia's canonical partial

to name-variant. At the aggregate level (as can be inferred from Table

\ref{tab:name}), the difference in performance between  canonical  and

name-variant partial is 31.9\% on all entities, 20.7\% on Wikipedia

entities, and 79.5\% on Twitter entities.

Section \ref{sec:analysis} discusses the most plausible explanations for these findings.

Table \ref{tab:name} summarizes the differences between Wikipedia and

Twitter entities.  Wikipedia entities' canonical representation

achieves a recall of 70\%, while canonical partial achieves a recall of 86.1\%. This is an

increase in recall of 16.1\%. By contrast, the increase in recall of

name-variant partial over name-variant is 8.3\%.

%This high increase in recall when moving from canonical names to their

%partial names, in comparison to the lower increase when moving from

%all name variants to their partial names can be explained by

%saturation: documents have already been extracted by the different

%name variants and thus using their partial names do not bring in many

%new relevant documents.

For Wikipedia entities, canonical

partial achieves better recall than name-variant in both the cleansed and

the raw corpus.  %In the raw extraction, the difference is about 3.7.

In Twitter entities, recall of canonical matching is very low.%

\footnote{Canonical

and canonical partial are the same for Twitter entities because they

are one word strings. For example in https://twitter.com/roryscovel,

``roryscovel`` is the canonical name and its partial is identical.}

%The low recall is because the canonical names of Twitter entities are

%not really names; they are usually arbitrarily created user names. It

%shows that  documents  refer to them by their display names, rarely

%by their user name, which is reflected in the name-variant recall

%(67.9\%). The use of name-variant partial increases the recall to

%88.2\%.

The tables in \ref{tab:name} and \ref{tab:source-delta} show a higher recall

for Wikipedia than for Twitter entities. Generally, at both

aggregate and document category levels, we observe that recall

increases as we move from canonicals to canonical partial, to

name-variant, and to name-variant partial. The only case where this

does not hold is in the transition from Wikipedia's canonical partial

to name-variant. At the aggregate level (as can be inferred from Table

\ref{tab:name}), the difference in performance between  canonical  and

name-variant partial is 31.9\% on all entities, 20.7\% on Wikipedia

entities, and 79.5\% on Twitter entities.

Section \ref{sec:analysis} discusses the most plausible explanations for these findings.

%% TODO: PERHAPS SUMMARY OF DISCUSSION HERE

\section{Impact on classification}

In the overall experimental setup, classification, ranking, and

evaluation are kept constant. Following \cite{balog2013multi}

settings, we use

WEKA's\footnote{\url{http://www.cs.waikato.ac.nz/~ml/weka/}} Classification

Random Forest. However, we use fewer numbers of features which we

found to be more effective. We determined the effectiveness of the

features by running the classification algorithm using the fewer

features we implemented and their features. Our feature

implementations achieved better results.  The total numbers of

features we used are 13 and are listed below.

In the overall experimental setup, classification, ranking, and

evaluation are kept constant. Following \cite{balog2013multi}

settings, we use

WEKA's\footnote{\url{http://www.cs.waikato.ac.nz/~ml/weka/}} Classification

Random Forest. However, we use fewer numbers of features which we

found to be more effective. We determined the effectiveness of the

features by running the classification algorithm using the fewer

features we implemented and their features. Our feature

implementations achieved better results.  The total numbers of

features we used are 13 and are listed below.

\paragraph*{Google's Cross Lingual Dictionary (GCLD)}

This is a mapping of strings to Wikipedia concepts and vice versa

\cite{spitkovsky2012cross}.

(1) the probability with which a string is used as anchor text to

a Wikipedia entity

\paragraph*{jac}

  Jaccard similarity between the document and the entity's Wikipedia page

\paragraph*{cos}

  Cosine similarity between the document and the entity's Wikipedia page

@@ -952,43 +952,45 @@ The use of different profiles also shows a big difference in recall. Except in t

%%%%% MOVED FROM LATER ON - CHECK FLOW

There is a trade-off between using a richer entity-profile and retrieval of irrelevant documents. The richer the profile, the more relevant documents it retrieves, but also the more irrelevant documents. To put it into perspective, lets compare the number of documents that are retrieved with  canonical partial and with name-variant partial. Using the raw corpus, the former retrieves a total of 2547487 documents and achieves a recall of 72.2\%. By contrast, the later retrieves a total of 4735318 documents and achieves a recall of 90.2\%. The total number of documents extracted increases by 85.9\% for a recall gain of 18\%. The rest of the documents, that is 67.9\%, are newly introduced irrelevant documents.

%%%%%%%%%%%%

In vital ranking, across all entity profiles and types of corpus, Wikipedia's canonical partial  achieves better performance than any other Wikipedia entity profiles. In vital-relevant documents too, Wikipedia's canonical partial achieves the best result. In the raw corpus, it achieves a little less than name-variant partial. For Twitter entities, the name-variant partial profile achieves the highest F-score across all entity profiles and types of corpus.

Cleansing impacts Twitter

There are 3 interesting observations:

1) cleansing impacts Twitter

entities and relevant documents.  This  is validated by the

observation that recall  gains in Twitter entities and the relevant

categories in the raw corpus also translate into overall performance

gains. This observation implies that cleansing removes relevant and

social documents than it does vital and news. That it removes relevant

documents more than vital can be explained by the fact that cleansing

removes the related links and adverts which may contain a mention of

the entities. One example we saw was the the cleansing removed an

image with a text of an entity name which was actually relevant. And

that it removes social documents can be explained by the fact that

most of the missing of the missing  docuemnts from cleansed are

social. And all the docuemnts that are missing from raw corpus

social. So in both cases socuial seem to suffer from text

transformation and cleasing processes.

%%%% NEEDS WORK:

Taking both performance (recall at filtering and overall F-score

2) Taking both performance (recall at filtering and overall F-score

during evaluation) into account, there is a clear trade-off between using a richer entity-profile and retrieval of irrelevant documents. The richer the profile, the more relevant documents it retrieves, but also the more irrelevant documents. To put it into perspective, lets compare the number of documents that are retrieved with  canonical partial and with name-variant partial. Using the raw corpus, the former retrieves a total of 2547487 documents and achieves a recall of 72.2\%. By contrast, the later retrieves a total of 4735318 documents and achieves a recall of 90.2\%. The total number of documents extracted increases by 85.9\% for a recall gain of 18\%. The rest of the documents, that is 67.9\%, are newly introduced irrelevant documents.

Wikipedia's canonical partial is the best entity profile for Wikipedia entities. This is interesting  to see that the retrieval of of  thousands vital-relevant document-entity pairs by name-variant partial does not translate to an increase in over all performance. It is even more interesting since canonical partial was not considered as contending profile for stream filtering by any of participant to the best of our knowledge. With this understanding, there  is actually no need to go and fetch different names variants from DBpedia, a saving of time and computational resources.

%%%%%%%%%%%%

The deltas between entity profiles, relevance ratings, and document categories reveal four differences between Wikipedia and Twitter entities. 1) For Wikipedia entities, the difference between canonical partial and canonical is higher(16.1\%) than between name-variant partial and  name-variant(8.3\%).  This can be explained by saturation. This is to mean that documents have already been extracted by  name-variants and thus using their partials does not bring in many new relevant documents.  2) Twitter entities are mentioned by name-variant or name-variant partial and that is seen in the high recall achieved  compared to the low recall achieved by canonical(or their partial). This indicates that documents (specially news and others) almost never use user names to refer to Twitter entities. Name-variant partials are the best entity profiles for Twitter entities. 3) However, comparatively speaking, social documents refer to Twitter entities by their user names than news and others suggesting a difference in

adherence to standard in names and naming. 4) Wikipedia entities achieve higher recall and higher overall performance.