HCDA/cikm-paper Changeset - e26f865b0658 · Centrum Wiskunde & Informatica (CWI)

@@ -102,29 +102,9 @@

\maketitle

\begin{abstract}

Entity-centric information processing requires complex pipelines

involving both natural language processing and information retrieval

components. In entity-centric stream filtering and ranking, the

pipeline involves four stages: filtering, classification,

ranking (scoring) and evaluation. Filtering is an initial step, that

extracts a working-set of documents from the web-scale corpus, aiming

for a smaller size collection that would be more manageable in the

subsequent stages of the pipeline. This filtering step therefore

determines the maximally attainable performance of the overall system.

This paper investigates the filtering stage in isoltation, in context

of a cumulative citation recommendation problem. We conduct an

in-depth analysis of the main factors that determine filtering

effectiveness: cleansing noisy web data, methods to create entity

profiles, the types of entity of interest, document category, and the

relevance level of the entity-document pair under consideration.

We analyze how these factors (and the design choices made in their

corresponding system components) affect filtering performance.

We identify and characterize the relevant documents that do not pass the

filtering stage, and conduct a manual examination into their

contents. The paper classifies the ways unfilterable documents

are mentioned in text and estimates the practical upper-bound of

recall in entity-based filtering.

Entity-centric information processing requires complex pipelines involving both natural language processing and information retrieval components. In entity-centric stream filtering and ranking, the pipeline involves four  important stages: filtering, classification, ranking(scoring)  and evaluation. Filtering is an important step  that creates a manageable working set of documents  from a  web-scale corpus for the next stages.  It thus  determines the performance of the overall system.  Keeping the subsequent steps constant, we  zoom in on the filtering stage and conduct an in-depth analysis of the  main components of cleansing, entity profiles, relevance levels, category of documents and entity types with a view to understanding  the factors and choices that affect filtering performance. The study demonstrates the most  effective entity profiling,  identifies those relevant documents that defy filtering and conducts manual examination into their contents. The paper classifies the ways unfilterable documents

are mentioned in text and estimates the practical upper-bound of recall in  entity-based filtering.

\end{abstract}

% A category with the (minimum) three required fields

@@ -196,8 +176,12 @@ TREC-KBA provided relevance judgments for training and testing. Relevance judgme

 \subsection{Stream Filtering}

 Given a stream of documents of news items, blogs and social media on one hand and KB entities (Wikipedia, Twitter)  on the other,  we study the factors and choices that affect filtering performance. Specifically, we conduct in-depth analysis on the cleansing step, the entity-profile construction, the document category of the stream items, and the type of entities (Wikipedia or Twitter). We also study the impact of choices on classification performance. Finally, we conduct manual examination of the relevant documents that defy filtering. We strive to answer the following research questions:

 \section{Stream Filtering}

 The TREC Filtering track defines filtering as a ``system that sifts through stream of incoming information to find documents that are relevant to a set of user needs represented by profiles'' \cite{robertson2002trec}. Its information needs are long-term and are reprsented persistent profiles  unlike the traditional search system whose adhoc information need is represented by a search query. Adaptive Filtering, one task of the filtering track,  starts with  a persistent user profile and a very small number of positive examples. A filtering system can improve its user profiles with a feedback obtained from interaction with users, and thereby improve its performance. The  filtering stage of entity-based stream filtering and ranking can be likened to the adaptive filtering task of the filtering track. The persistent information needs are the KB entities, and the relevance judgments are the small number of postive examples.

 Stream filtering: given a stream of documents of news items, blogs and social media on one hand and KB entities  on the other, filter the stream for  potentially relevant documents  such that the relevance classifier(ranker) achieves as maximum performance as possible.  Specifically, we conduct in-depth analysis on the choices and factors affecting the cleansing step, the entity-profile construction, the document category of the stream items, and the type of entities (Wikipedia or Twitter) , and finally their impact  overall performance of the pipeline. Finally, we conduct manual examination of the vital documents that defy filtering. We strive to answer the following research questions:

 \begin{enumerate}

  \item Does cleansing affect filtering and subsequent performance

@@ -208,10 +192,12 @@ TREC-KBA provided relevance judgments for training and testing. Relevance judgme

  \item What are the vital(relevant) documents that are not retrievable by a system?

\end{enumerate}

\subsection{Evaluation}

The TREC filtering track

The TREC filtering and the filtering as part of the entity-centric stream filtering and ranking pipepline have different purposes. The TREC filtering track's goal is the binary classification of documents: for each incoming docuemnt, it decides whether the incoming document is relevant or not for a given profile. The docuemnts are either relevant or not. In our case, the documents have relevance ranking and the goal of the filtering stage is to filter as many potentially relevant documents as possible, but less  irrelevant documents as possible not to obfuscate the later stages of the piepline.  Filtering as part of the pipeline needs that delicate balance between retrieving relavant documents and irrrelevant documensts. Bcause of this, filtering in this case can only be studied by binding it to the later stages of the entity-centric pipeline. This bond influnces how we do evaluation.

\subsection{Literature Review}

To achieve this,  we use recall percentages in the filtering stage for the different choices of entity profiles. However, we use the overall performance to select the best entity profiles.To generate the overall pipeline performance we use the official TREC KBA evaluation metrics and scripts \cite{frank2013stream}. The primary metric is pick F-score at different relevance cut-off, and the secondary metric is scaled utility(SU) which measures how much irrelevant docuemnts it filters.

\section{Literature Review}

There has been a great deal of interest  as of late on entity-based filtering and ranking. One manifestation of that is the introduction of TREC KBA in 2012. Following that, there have been a number of research works done on the topic \cite{frank2012building, ceccarelli2013learning, taneva2013gem, wang2013bit, balog2013multi}.  These works are based on KBA 2012 task and dataset  and they address the whole problem of entity filtering and ranking.  TREC KBA continued in 2013, but the task underwent some changes. The main change between  the 2012 and 2013 are in the number of entities, the type of entities, the corpus and the relevance rankings.

The number of entities increased from 29 to 141, and it included 20 Twitter entities. The TREC KBA 2012 corpus is 1.9TB after xz-compression and has  400M documents. By contrast, the KBA 2013 corpus is 6.45 after XZ-compression and GPG encryption. A version with all-non English documented removed  is 4.5 TB and consists of 1 Billion documents. The 2013 corpus subsumed the 2012 corpus and added others from spinn3r, namely main-stream news, forum, arxiv, classified, reviews and meme-tracker.  A more important difference is, however, a change in the definitions of relevance ratings vital and relevant. While in KBA 2012, a document was judged vital if it has citation-worthy content for a given entity, in 2013 it must have the freshliness, that is the content must trigger an editing of the given entity's KB entry.

@@ -223,7 +209,7 @@ All of the studies used filtering as their first step to generate a smaller set

Moreover, there has not been a chance to study at this scale and/or a study into what type of documents defy filtering and why? In this paper, we conduct a manual examination of the documents that are missing and classify them into different categories. We also estimate the general upper bound of recall using the different entities profiles and choose the best profile that results in an increased over all performance as measured by F-measure.

\section{Method}

We work with the subset of stream corpus documents  for whom there exist  annotation. For this purpose, we extracted the documents that have annotation from the big corpus. All our experiments are based on this smaller subset.   We experiment with all KB entities.  For each KB entity, we extract different name variants from DBpedia and Twitter.

We work with the docuemnts have relavance assessments. For this purpose, we extracted those docuemnts from the big corpus.    We experiment with all KB entities.  For each KB entity, we extract different name variants from DBpedia and Twitter.

\subsection{Entity Profiling}

@@ -233,15 +219,15 @@ We build profiles for the KB entities of interest. We have two types: Twitter an

\begin{center}

 \begin{tabular}{l*{4}{c}l}

 &Name variant& Number of strings \\

 Name variant& No. of strings  \\

\hline

 Name  &82&82\\

 Label   &121 &121\\

Redirect  &49&96 \\

 Birth Name &6&6\\

 Nickname & 1&1&\\

 Alias &1 &1\\

 Alternative Names &4&4\\

 Name  &82\\

 Label   &121\\

Redirect  &49 \\

 Birth Name &6\\

 Nickname & 1&\\

 Alias &1 \\

 Alternative Names &4\\

\hline

\end{tabular}

@@ -252,10 +238,10 @@ Redirect  &49&96 \\

We have a total of 121 Wikipedia entities.  Every entity has a DBpedia label.  Only 82 entities have a name string and only 49 entities have redirect strings. Most of the entities have only one string, but some have several redirect sterings. One entity, Buddy\_MacKay, has the highest (12) number of redirect strings. 6 entities have  birth names, 1 entity has a nick name, 1 entity has alias and  4 entities have alternative names.

We combined the different name  we extracted to form a set of strings for each KB entity. Specifically, we merged the names, labels, redirects, birth names, nick names and alternative names of each entity. For Twitter entities, we used the display names that we collected. We consider the names of the entities that are part of the URL as canonical. For example in http://en.wikipedia.org/wiki/Benjamin\_Bronfman, Benjamin Bronfman is a canonical name of the entity. From the combined name variants and the canonical names, we  created four sets of profiles for each entity: canonical (cao) canonical partial (cano-part), all name variants combined (all) and partial names of all name variants(all-part). the names in paranthesis are used in table captions.

We combined the different name variants  we extracted to form a set of strings for each KB entity.  For Twitter entities, we used the display names that we collected . We consider the names of the entities that are part of the URL as canonical. For example in http://en.wikipedia.org/wiki/Benjamin\_Bronfman, Benjamin Bronfman is a canonical name of the entity. From the combined name variants and the canonical names, we  created four sets of profiles for each entity: canonical(cano) canonical partial (cano-part), all name variants combined (all) and partial names of all name variants(all-part). We refer to the last two profiles as name-variant and name-variant partial. The names in paranthesis are used in table captions.

\subsection{Annotation Corpus}

%The annotation set is a combination of the annotations from before the Training Time Range(TTR) and Evaluation Time Range (ETR) and consists of 68405 annotations.  Its breakdown into training and test sets is  shown in Table \ref{tab:breakdown}.

The annotation set is a combination of the annotations from before the Training Time Range(TTR) and Evaluation Time Range (ETR) and consists of 68405 annotations.  Its breakdown into training and test sets is  shown in Table \ref{tab:breakdown}.

\begin{table}

@@ -438,7 +424,7 @@ Overall, across all entities types and all entity profiles, others achieve highe

We computed four percentage increases in recall (deltas)  between the different entity profiles (Table \ref{tab:source-delta2}). The first delta is the recall percentage between canonical partial  and canonical. The second  is  between name= variant and canonical. The third is the difference between name-variant partial  and canonical partial and the fourth between name-variant partial and name-variant. we believe these four deltas offer a clear meaning. The delta between name-variant and canonical measn the percentage of documents that the new name variants retrieve, but the canonical name does not. Similarly, the delta between  name-variant partial and partial canonical-partial means the percentage of document-entity pairs that can be gained by the partial names of the name variants.

% The  biggest delta  observed is in Twitter entities between partials of all name variants and partials of canonicals (93\%). delta. Both of them are for news category.  For Wikipedia entities, the highest delta observed is 19.5\% in cano\_part - cano followed by 17.5\% in all\_part in relevant.

  \subsection{Entity Types: Wikipedia and Twitter)}

  \subsection{Entity Types: Wikipedia and Twitter}

Table \ref{tab:name} shows the difference between Wikipedia and Twitter entities.  Wikipedia entities' canonical achieves a recall of 70\%, and canonical partial  achieves a recall of 86.1\%. This is an increase in recall of 16.1\%. By contrast, the increase in recall of name-variant partial over name-variant is 8.3.

%The high increase in recall when moving from canonical names  to their partial names, in comparison to the lower increase when moving from all name variants to their partial names can be explained by saturation. This is to mean that documents have already been extracted by the different name variants and thus using their partial names does not bring in many new relevant documents.

One interesting observation is that, For Wikipedia entities, canonical partial achieves better recall than name-variant in both cleansed and raw corpus.  %In the raw extraction, the difference is about 3.7.

@@ -536,7 +522,7 @@ In vital-relevant category (Table \ref{tab:class-vital-relevant}), the performan

\subsection{Missing vital-relevant documents \label{miss}}

 The use of name-variant partial for filtering is an aggressive attempt to retrieve as many relevant documents as possible at the cost of retrieving irrelevant documents. However, we still miss about  2363(10\%) of the vital-relevant documents.  Why are these documents missed? If they are not mentioned by partial names of name variants, what are they mentioned by? Table \ref{tab:miss} shows the documents that we miss with respect to cleansed and raw corpus.  The upper part shows the number of documents missing from cleansed and raw versions of the corpus. The lower part of the table shows the intersections and exclusions in each corpus.

@@ -686,6 +672,7 @@ There is a trade-off between using a richer entity-profile and retrieval of irre

 We  observed  that among the missing documents, different document ids can have the same content, and be judged multiple times for a given entity.  %In the vital annotation, there are 88 news, and 409 weblog.

 Avoiding duplicates, we randomly selected 35 documents, one for each entity.   The documents are 13 news and  22  social. Here below we have classified the situation under which a document can be vital for an entity without mentioning the entities with the different entity  profiles we used for filtering.

\paragraph{Outgoing link mentions} A post (tweet) with an outgoing link which mentions the entity.

\paragraph{Event place - Event} A document that talks about an event is vital to the location entity where it takes place.  For example Maha Music Festival takes place in Lewis and Clark\_Landing, and a document talking about the festival is vital for the park. There are also cases where an event's address places the event in a park and due to that the document becomes vital to the park. This is basically being mentioned by address which belongs to alarger space.

\paragraph{Entity -related entity} A document about an important figure such as artist, athlete  can be vital to another. This is specially true if the two are contending for the same title, one has snatched a title, or award from the other.