HCDA/cikm-paper Changeset - e26f865b0658 · Centrum Wiskunde & Informatica (CWI)

@@ -93,47 +93,27 @@

% }

% There's nothing stopping you putting the seventh, eighth, etc.

% author on the opening page (as the 'third row') but we ask,

% for aesthetic reasons that you place these 'additional authors'

% in the \additional authors block, viz.

% Just remember to make sure that the TOTAL number of authors

% is the number that will appear on the first page PLUS the

% number that will appear in the \additionalauthors section.

\maketitle

\begin{abstract}

Entity-centric information processing requires complex pipelines

involving both natural language processing and information retrieval

components. In entity-centric stream filtering and ranking, the

pipeline involves four stages: filtering, classification,

ranking (scoring) and evaluation. Filtering is an initial step, that

extracts a working-set of documents from the web-scale corpus, aiming

for a smaller size collection that would be more manageable in the

subsequent stages of the pipeline. This filtering step therefore

determines the maximally attainable performance of the overall system.

This paper investigates the filtering stage in isoltation, in context

of a cumulative citation recommendation problem. We conduct an

in-depth analysis of the main factors that determine filtering

effectiveness: cleansing noisy web data, methods to create entity

profiles, the types of entity of interest, document category, and the

relevance level of the entity-document pair under consideration.

We analyze how these factors (and the design choices made in their

corresponding system components) affect filtering performance.

We identify and characterize the relevant documents that do not pass the

filtering stage, and conduct a manual examination into their

contents. The paper classifies the ways unfilterable documents

are mentioned in text and estimates the practical upper-bound of

recall in entity-based filtering.

Entity-centric information processing requires complex pipelines involving both natural language processing and information retrieval components. In entity-centric stream filtering and ranking, the pipeline involves four  important stages: filtering, classification, ranking(scoring)  and evaluation. Filtering is an important step  that creates a manageable working set of documents  from a  web-scale corpus for the next stages.  It thus  determines the performance of the overall system.  Keeping the subsequent steps constant, we  zoom in on the filtering stage and conduct an in-depth analysis of the  main components of cleansing, entity profiles, relevance levels, category of documents and entity types with a view to understanding  the factors and choices that affect filtering performance. The study demonstrates the most  effective entity profiling,  identifies those relevant documents that defy filtering and conducts manual examination into their contents. The paper classifies the ways unfilterable documents

are mentioned in text and estimates the practical upper-bound of recall in  entity-based filtering.

\end{abstract}

% A category with the (minimum) three required fields

\category{H.4}{Information Filtering}{Miscellaneous}

%A category including the fourth, optional field follows...

%\category{D.2.8}{Software Engineering}{Metrics}[complexity measures, performance measures]

\terms{Theory}

\keywords{Information Filtering; Cumulative Citation Recommendation; knowledge maintenance; Stream Filtering;  emerging entities} % NOT required for Proceedings

@@ -187,84 +167,90 @@ We use TREC-KBA 2013 dataset \footnote{http://http://trec-kba.org/trec-kba-2013.

\label{tab:streams}

\end{table}

\subsection{KB entities}

 The KB entities consist of 20 Twitter entities and 121 Wikipedia entities. The selected entities are, on purpose, sparse. The entities consist of 71 people, 1 organization, and 24 facilities.

\subsection{Relevance judgments}

TREC-KBA provided relevance judgments for training and testing. Relevance judgments are given to a document-entity pairs. Documents with citation-worthy content to a given entity are annotated  as \emph{vital},  while documents with tangentially relevant content, or documents that lack freshliness o  with content that can be useful for initial KB-dossier are annotated as \emph{relevant}. Documents with no relevant content are labeled \emph{neutral} and spam is labeled as \emph{garbage}.  The inter-annotator agreement on vital in 2012 was 70\% while in 2013 it is 76\%. This is due to the more refined definition of vital and the distinction made between vital and relevant.

 \subsection{Stream Filtering}

 Given a stream of documents of news items, blogs and social media on one hand and KB entities (Wikipedia, Twitter)  on the other,  we study the factors and choices that affect filtering performance. Specifically, we conduct in-depth analysis on the cleansing step, the entity-profile construction, the document category of the stream items, and the type of entities (Wikipedia or Twitter). We also study the impact of choices on classification performance. Finally, we conduct manual examination of the relevant documents that defy filtering. We strive to answer the following research questions:

 \section{Stream Filtering}

 The TREC Filtering track defines filtering as a ``system that sifts through stream of incoming information to find documents that are relevant to a set of user needs represented by profiles'' \cite{robertson2002trec}. Its information needs are long-term and are reprsented persistent profiles  unlike the traditional search system whose adhoc information need is represented by a search query. Adaptive Filtering, one task of the filtering track,  starts with  a persistent user profile and a very small number of positive examples. A filtering system can improve its user profiles with a feedback obtained from interaction with users, and thereby improve its performance. The  filtering stage of entity-based stream filtering and ranking can be likened to the adaptive filtering task of the filtering track. The persistent information needs are the KB entities, and the relevance judgments are the small number of postive examples.

 Stream filtering: given a stream of documents of news items, blogs and social media on one hand and KB entities  on the other, filter the stream for  potentially relevant documents  such that the relevance classifier(ranker) achieves as maximum performance as possible.  Specifically, we conduct in-depth analysis on the choices and factors affecting the cleansing step, the entity-profile construction, the document category of the stream items, and the type of entities (Wikipedia or Twitter) , and finally their impact  overall performance of the pipeline. Finally, we conduct manual examination of the vital documents that defy filtering. We strive to answer the following research questions:

 \begin{enumerate}

  \item Does cleansing affect filtering and subsequent performance

  \item What is the most effective way of entity profile representation

  \item Is filtering different for Wikipedia and Twitter entities?

  \item Are some type of documents easily filterable and others not ?

  \item Does a gain in recall at filtering step translate to a gain in F-measure at the end of the pipeline?

  \item What are the vital(relevant) documents that are not retrievable by a system?

\end{enumerate}

\subsection{Evaluation}

The TREC filtering track

The TREC filtering and the filtering as part of the entity-centric stream filtering and ranking pipepline have different purposes. The TREC filtering track's goal is the binary classification of documents: for each incoming docuemnt, it decides whether the incoming document is relevant or not for a given profile. The docuemnts are either relevant or not. In our case, the documents have relevance ranking and the goal of the filtering stage is to filter as many potentially relevant documents as possible, but less  irrelevant documents as possible not to obfuscate the later stages of the piepline.  Filtering as part of the pipeline needs that delicate balance between retrieving relavant documents and irrrelevant documensts. Bcause of this, filtering in this case can only be studied by binding it to the later stages of the entity-centric pipeline. This bond influnces how we do evaluation.

\subsection{Literature Review}

To achieve this,  we use recall percentages in the filtering stage for the different choices of entity profiles. However, we use the overall performance to select the best entity profiles.To generate the overall pipeline performance we use the official TREC KBA evaluation metrics and scripts \cite{frank2013stream}. The primary metric is pick F-score at different relevance cut-off, and the secondary metric is scaled utility(SU) which measures how much irrelevant docuemnts it filters.

\section{Literature Review}

There has been a great deal of interest  as of late on entity-based filtering and ranking. One manifestation of that is the introduction of TREC KBA in 2012. Following that, there have been a number of research works done on the topic \cite{frank2012building, ceccarelli2013learning, taneva2013gem, wang2013bit, balog2013multi}.  These works are based on KBA 2012 task and dataset  and they address the whole problem of entity filtering and ranking.  TREC KBA continued in 2013, but the task underwent some changes. The main change between  the 2012 and 2013 are in the number of entities, the type of entities, the corpus and the relevance rankings.

The number of entities increased from 29 to 141, and it included 20 Twitter entities. The TREC KBA 2012 corpus is 1.9TB after xz-compression and has  400M documents. By contrast, the KBA 2013 corpus is 6.45 after XZ-compression and GPG encryption. A version with all-non English documented removed  is 4.5 TB and consists of 1 Billion documents. The 2013 corpus subsumed the 2012 corpus and added others from spinn3r, namely main-stream news, forum, arxiv, classified, reviews and meme-tracker.  A more important difference is, however, a change in the definitions of relevance ratings vital and relevant. While in KBA 2012, a document was judged vital if it has citation-worthy content for a given entity, in 2013 it must have the freshliness, that is the content must trigger an editing of the given entity's KB entry.

While the tasks of 2012 and 2013 are fundamentally the same, the approaches  varied due  to the size of the corpus. In 2013, all participants used filtering to reduce the size of the big corpus.   They used different ways of filtering: many of them used two or more of different name variants from DBpedia such as labels, names, redirects, birth names, alias, nicknames, same-as and alternative names \cite{wang2013bit,dietzumass,liu2013related, zhangpris}.  Although most of the participants used DBpedia name variants none of them used all the name variants.  A few other participants used bold words in the first paragraph of the Wikipedia entity's profiles and anchor texts from other Wikipedia pages  \cite{bouvierfiltering, niauniversity}. One participant used Boolean \emph{and} built from the tokens of the canonical names \cite{illiotrec2013}.

All of the studies used filtering as their first step to generate a smaller set of documents. And many systems suffered from poor recall and their system performances were highly affected \cite{frank2012building}. Although  systems  used different entity profiles to filter the stream, and achieved different performance levels, there is no study on and the factors and choices that affect the filtering step itself. Of course filtering has been extensively examined in TREC Filtering \cite{robertson2002trec}. However, those studies were isolated in the sense that they were intended to optimize recall. What we have here is a different scenario. Documents have relevance rating. Thus we want to study filtering in connection to  relevance to the entities and thus can be done by coupling filtering to the later stages of the pipeline. This is new to the best of our knowledge and the TREC KBA problem setting and data-sets offer a good opportunity to examine this aspect of filtering.

Moreover, there has not been a chance to study at this scale and/or a study into what type of documents defy filtering and why? In this paper, we conduct a manual examination of the documents that are missing and classify them into different categories. We also estimate the general upper bound of recall using the different entities profiles and choose the best profile that results in an increased over all performance as measured by F-measure.

\section{Method}

We work with the subset of stream corpus documents  for whom there exist  annotation. For this purpose, we extracted the documents that have annotation from the big corpus. All our experiments are based on this smaller subset.   We experiment with all KB entities.  For each KB entity, we extract different name variants from DBpedia and Twitter.

We work with the docuemnts have relavance assessments. For this purpose, we extracted those docuemnts from the big corpus.    We experiment with all KB entities.  For each KB entity, we extract different name variants from DBpedia and Twitter.

\subsection{Entity Profiling}

We build profiles for the KB entities of interest. We have two types: Twitter and Wikipedia. Both Entities are selected, on purpose, to be sparse, less-documented.  For the Twitter entities, we visit their respective Twitter pages  and  manually fetch their display names. For the Wikipedia entities, we fetch different name variants from DBpedia, namely  name, label, birth name, alternative names, redirects, nickname, or alias.  The extraction results are in Table \ref{tab:sources}.

\begin{table}

\caption{Number of different DBpedia name variants}

\begin{center}

 \begin{tabular}{l*{4}{c}l}

 &Name variant& Number of strings \\

 Name variant& No. of strings  \\

\hline

 Name  &82&82\\

 Label   &121 &121\\

Redirect  &49&96 \\

 Birth Name &6&6\\

 Nickname & 1&1&\\

 Alias &1 &1\\

 Alternative Names &4&4\\

 Name  &82\\

 Label   &121\\

Redirect  &49 \\

 Birth Name &6\\

 Nickname & 1&\\

 Alias &1 \\

 Alternative Names &4\\

\hline

\end{tabular}

\end{center}

\label{tab:sources}

\end{table}

We have a total of 121 Wikipedia entities.  Every entity has a DBpedia label.  Only 82 entities have a name string and only 49 entities have redirect strings. Most of the entities have only one string, but some have several redirect sterings. One entity, Buddy\_MacKay, has the highest (12) number of redirect strings. 6 entities have  birth names, 1 entity has a nick name, 1 entity has alias and  4 entities have alternative names.

We combined the different name  we extracted to form a set of strings for each KB entity. Specifically, we merged the names, labels, redirects, birth names, nick names and alternative names of each entity. For Twitter entities, we used the display names that we collected. We consider the names of the entities that are part of the URL as canonical. For example in http://en.wikipedia.org/wiki/Benjamin\_Bronfman, Benjamin Bronfman is a canonical name of the entity. From the combined name variants and the canonical names, we  created four sets of profiles for each entity: canonical (cao) canonical partial (cano-part), all name variants combined (all) and partial names of all name variants(all-part). the names in paranthesis are used in table captions.

We combined the different name variants  we extracted to form a set of strings for each KB entity.  For Twitter entities, we used the display names that we collected . We consider the names of the entities that are part of the URL as canonical. For example in http://en.wikipedia.org/wiki/Benjamin\_Bronfman, Benjamin Bronfman is a canonical name of the entity. From the combined name variants and the canonical names, we  created four sets of profiles for each entity: canonical(cano) canonical partial (cano-part), all name variants combined (all) and partial names of all name variants(all-part). We refer to the last two profiles as name-variant and name-variant partial. The names in paranthesis are used in table captions.

\subsection{Annotation Corpus}

%The annotation set is a combination of the annotations from before the Training Time Range(TTR) and Evaluation Time Range (ETR) and consists of 68405 annotations.  Its breakdown into training and test sets is  shown in Table \ref{tab:breakdown}.

The annotation set is a combination of the annotations from before the Training Time Range(TTR) and Evaluation Time Range (ETR) and consists of 68405 annotations.  Its breakdown into training and test sets is  shown in Table \ref{tab:breakdown}.

\begin{table}

\caption{Number of annotation documents with respect to different categories(relevance rating, training and testing)}

\begin{center}

\begin{tabular}{l*{3}{c}r}

 &&Vital&Relevant  &Total \\

\hline

\multirow{2}{*}{Training}  &Wikipedia & 1932  &2051& 3672\\

			  &Twitter&189   &314&488 \\

			   &All Entities&2121&2365&4160\\

@@ -429,25 +415,25 @@ The recall for Wikipedia entities in Table \ref{tab:name} ranged from 61.8\% (ca

%This indicates that social documents refer to Twitter entities by their canonical names (user names) more than news do. In name- variant partial, news achieve better results than social. The difference in recall between canonicals and name-variants show that news do not refer to Twitter entities by their user names, they refer to them by their display names.

Overall, across all entities types and all entity profiles, others achieve higher recall than news, and  news, in turn, achieve higher recall than social documents.

% This suggests that social documents are the hardest  to retrieve.  This  makes sense since social posts such as tweets and blogs are short and are more likely to point to other resources, or use short informal names.

We computed four percentage increases in recall (deltas)  between the different entity profiles (Table \ref{tab:source-delta2}). The first delta is the recall percentage between canonical partial  and canonical. The second  is  between name= variant and canonical. The third is the difference between name-variant partial  and canonical partial and the fourth between name-variant partial and name-variant. we believe these four deltas offer a clear meaning. The delta between name-variant and canonical measn the percentage of documents that the new name variants retrieve, but the canonical name does not. Similarly, the delta between  name-variant partial and partial canonical-partial means the percentage of document-entity pairs that can be gained by the partial names of the name variants.

% The  biggest delta  observed is in Twitter entities between partials of all name variants and partials of canonicals (93\%). delta. Both of them are for news category.  For Wikipedia entities, the highest delta observed is 19.5\% in cano\_part - cano followed by 17.5\% in all\_part in relevant.

  \subsection{Entity Types: Wikipedia and Twitter)}

  \subsection{Entity Types: Wikipedia and Twitter}

Table \ref{tab:name} shows the difference between Wikipedia and Twitter entities.  Wikipedia entities' canonical achieves a recall of 70\%, and canonical partial  achieves a recall of 86.1\%. This is an increase in recall of 16.1\%. By contrast, the increase in recall of name-variant partial over name-variant is 8.3.

%The high increase in recall when moving from canonical names  to their partial names, in comparison to the lower increase when moving from all name variants to their partial names can be explained by saturation. This is to mean that documents have already been extracted by the different name variants and thus using their partial names does not bring in many new relevant documents.

One interesting observation is that, For Wikipedia entities, canonical partial achieves better recall than name-variant in both cleansed and raw corpus.  %In the raw extraction, the difference is about 3.7.

In Twitter entities, however, it is different. Both canonical and their partials perform the same and the recall is very low. Canonical  and canonical partial are the same for Twitter entities because they are one word strings. For example in https://twitter.com/roryscovel, ``roryscovel`` is the canonical name and its partial is also the same.

%The low recall is because the canonical names of Twitter entities are not really names; they are usually arbitrarily created user names. It shows that  documents  refer to them by their display names, rarely by their user name, which is reflected in the name-variant recall (67.9\%). The use of name-variant partial increases the recall to 88.2\%.

The tables in \ref{tab:name} and \ref{tab:source-delta} show recall for Wikipedia entities are higher than for Twitter. Generally, at both aggregate and document category levels, we observe that recall increases as we move from canonicals to canonical partial, to name-variant, and to name-variant partial. The only case where this does not hold is in the transition from Wikipedia's canonical partial to name-variant. At the aggregate level(as can be inferred from Table \ref{tab:name}), the difference in performance between  canonical  and name-variant partial is 31.9\% on all entities, 20.7\% on Wikipedia entities, and 79.5\% on Twitter entities. This is a significant performance difference.

@@ -527,25 +513,25 @@ The tables in \ref{tab:name} and \ref{tab:source-delta} show recall for Wikipedi

Table \ref{tab:class-vital} shows the recall performance for vitally judged documents.  On Wikipedia entities, except in the canonical profile, the cleansed version achieves  better results than the raw version.  However, on Twitter entities, the raw corpus achieves  better  in all entity profiles (except  in name-variant partial).  At an aggregate (both Wikipedia and Twitter) level, we see that in three profiles, cleansed achieves better.  Only in canonical partial, does raw perform better. Overall cleansed achieves better results than raw.  This result is interesting because we saw in previous sections that the raw corpus achieves  higher recall than cleansed. In the case name-variant partial, for example, 10\% more relevant documents are retrieved in the raw corpus. The gain in recall in raw corpus does not translate into a gain in F\_measure. In fact, in most cases F\_measure decreased. % One explanation for this is that it brings in many false positives from, among related links, adverts, etc.

For Wikipedia entities,  canonical partial  achieves the highest performance. For Twitter, name-variant partial achieves  better results.

In vital-relevant category (Table \ref{tab:class-vital-relevant}), the performances are different.  Except in canonical partial,  raw achieves better results in all cases. For Twitter entities, the raw corpus achieves better results in all cases.  In terms of  entity profiles, Wikipedia's canonical partial  achieves  the best F-score. For Twitter, as before, canonical partial. The raw corpus has more effect on relevant documents and Twitter entities.

%The fact that canonical partial names achieve better results is interesting.  We know that partial names were used as a baseline in TREC KBA 2012, but no one of the KBA participants actually used partial names for filtering.

\subsection{Missing vital-relevant documents \label{miss}}

 The use of name-variant partial for filtering is an aggressive attempt to retrieve as many relevant documents as possible at the cost of retrieving irrelevant documents. However, we still miss about  2363(10\%) of the vital-relevant documents.  Why are these documents missed? If they are not mentioned by partial names of name variants, what are they mentioned by? Table \ref{tab:miss} shows the documents that we miss with respect to cleansed and raw corpus.  The upper part shows the number of documents missing from cleansed and raw versions of the corpus. The lower part of the table shows the intersections and exclusions in each corpus.

\begin{table}

\caption{The number of documents missing  from raw and cleansed extractions. }

\begin{center}

\begin{tabular}{l@{\quad}llllll}

\hline

\multicolumn{1}{l}{\rule{0pt}{12pt}category}&\multicolumn{1}{l}{\rule{0pt}{12pt}Vital }&\multicolumn{1}{l}{\rule{0pt}{12pt}Relevant }&\multicolumn{1}{l}{\rule{0pt}{12pt}Total }\\[5pt]

\hline

@@ -677,24 +663,25 @@ Across document categories, we observe a pattern in recall of others, followed b

\section{Unfilterable documents}

There is a trade-off between using a richer entity-profile and retrieval of irrelevant documents. The richer the profile, the more relevant documents it retrieves, but also the more irrelevant documents. To put it into perspective, lets compare the number of documents that are retrieved with  canonical partial and with name-variant partial. Using the raw corpus, the former retrieves a total of 2547487 documents and achieves a recall of 72.2\%. By contrast, the later retrieves a total of 4735318 documents and achieves a recall of 90.2\%. The total number of documents extracted increases by 85.9\% for a recall gain of 18\%. The rest of the documents, that is 67.9\%, are newly introduced irrelevant documents.

 We observed that there are vital-relevant documents that we miss from raw only, and similarly from cleansed only. The reason for this is transformation from one format to another. The most interesting documents are those that we miss from both raw and cleansed corpus. We first identified the number of KB entities who have a vital relevance judgment and  whose documents can not be retrieved (they were 35 in total) and conducted a manual examination into their content to find out why they are missing.

 We  observed  that among the missing documents, different document ids can have the same content, and be judged multiple times for a given entity.  %In the vital annotation, there are 88 news, and 409 weblog.

 Avoiding duplicates, we randomly selected 35 documents, one for each entity.   The documents are 13 news and  22  social. Here below we have classified the situation under which a document can be vital for an entity without mentioning the entities with the different entity  profiles we used for filtering.

\paragraph{Outgoing link mentions} A post (tweet) with an outgoing link which mentions the entity.

\paragraph{Event place - Event} A document that talks about an event is vital to the location entity where it takes place.  For example Maha Music Festival takes place in Lewis and Clark\_Landing, and a document talking about the festival is vital for the park. There are also cases where an event's address places the event in a park and due to that the document becomes vital to the park. This is basically being mentioned by address which belongs to alarger space.

\paragraph{Entity -related entity} A document about an important figure such as artist, athlete  can be vital to another. This is specially true if the two are contending for the same title, one has snatched a title, or award from the other.

\paragraph{Organization - main activity} A document that talks about about an area on which the company is active is vital for the organization. For example, Atacocha is a mining company  and a news item on mining waste was annotated vital.

\paragraph{Entity - group} If an entity belongs to a certain group (class),  a news item about the group can be vital for the individual members. FrankandOak is  named innovative company and a news item that talks about the group  of innovative companies is relevant for a  it. Other examples are: a  big event  of which an entity is related such an Film awards for actors.

\paragraph{Artist - work} Documents that discuss the work of artists can be relevant to the artists. Such cases include  books or films being vital for the book author or the director (actor) of the film. Robocop is film whose screenplay is by Joshua Zetumer. A blog that talks about the film was judged vital for Joshua Zetumer.

\paragraph{Politician - constituency} A major political event in a certain constituency is vital for the politician from that constituency.

 A good example is a weblog that talks about two north Dakota counties being drought disasters. The news is vital for Joshua Boschee, a politician, a member of North Dakota democratic party.

\paragraph{head - organization} A document that talks about an organization of which the entity is the head can be vital for the entity.  Jasper\_Schneider is USDA Rural Development state director for North Dakota and an article about problems of primary health centers in North Dakota is judged vital for him.

\paragraph{World Knowledge} Some things are impossible to know without your world knowledge. For example ''refreshments, treats, gift shop specials, "bountiful, fresh and fabulous holiday decor," a demonstration of simple ways to create unique holiday arrangements for any home; free and open to the public`` is judged relevant to Hjemkomst\_Center. This is a social media post, and unless one knows the person posting it, there is no way that this text shows that. Similarly ''learn about the gray wolf's hunting and feeding behaviors and watch the wolves have their evening meal of a full deer carcass; $15 for members, $20 for nonmembers`` is judged vital to Red\_River\_Zoo.

\paragraph{No document content} Some documents were found to have no content

\paragraph{Not clear why} It is not clear why some documents are annotated vital for some entities.