Changeset - 60fbfbab0287
[Not reviewed]
0 1 0
Arjen de Vries (arjen) - 11 years ago 2014-06-12 02:03:44
arjen.de.vries@cwi.nl
abstract new
1 file changed with 24 insertions and 3 deletions:
0 comments (0 inline, 0 general)
mypaper-final.tex
Show inline comments
 
@@ -85,51 +85,72 @@
 
%        \affaddr{Institute for Clarity in Documentation}\\
 
%        \affaddr{1932 Wallamaloo Lane}\\
 
%        \affaddr{Wallamaloo, New Zealand}\\
 
%        \email{trovato@corporation.com}
 
% % 2nd. author
 
% \alignauthor
 
% G.K.M. Tobin\titlenote{The secretary disavows
 
% any knowledge of this author's actions.}\\
 
%        \affaddr{Institute for Clarity in Documentation}\\
 
%        \affaddr{P.O. Box 1212}\\
 
%        \affaddr{Dublin, Ohio 43017-6221}\\
 
%        \email{webmaster@marysville-ohio.com}
 
% }
 
% There's nothing stopping you putting the seventh, eighth, etc.
 
% author on the opening page (as the 'third row') but we ask,
 
% for aesthetic reasons that you place these 'additional authors'
 
% in the \additional authors block, viz.
 
% Just remember to make sure that the TOTAL number of authors
 
% is the number that will appear on the first page PLUS the
 
% number that will appear in the \additionalauthors section.
 
 
\maketitle
 
\begin{abstract}
 
 
 
Entity-centric information processing requires complex pipelines involving both natural language processing and information retrieval components. In entity-centric stream filtering and ranking, the pipeline involves four  important stages: filtering, classification, ranking(scoring)  and evaluation. Filtering is an important step  that creates a manageable working set of documents  from a  web-scale corpus for the next stages.  It thus  determines the performance of the overall system.  Keeping the subsequent steps constant, we  zoom in on the filtering stage and conduct an in-depth analysis of the  main components of cleansing, entity profiles, relevance levels, category of documents and entity types with a view to understanding  the factors and choices that affect filtering performance. The study demonstrates the most  effective entity profiling,  identifies those relevant documents that defy filtering and conducts manual examination into their contents. The paper classifies the ways unfilterable documents 
 
are mentioned in text and estimates the practical upper-bound of recall in  entity-based filtering.  
 
Cumulative citation recommendation refers to the problem faced by
 
knowledge base curators, who need to continuously screen the media for
 
updates regarding the knowledge base entries they manage. Automatic
 
system support for this entity-centric information processing problem
 
requires complex pipe\-lines involving both natural language
 
processing and information retrieval components. The default pipeline
 
involves four stages: filtering, classification, ranking (or scoring),
 
and evaluation. Filtering is an initial step, that reduces the
 
web-scale corpus of news and other relevant information sources that
 
may contain entity mentions into a working set of documents that should
 
be more manageable for the subsequent stages.
 
This step has a large impact on the recall that can be achieved.
 
Keeping the subsequent steps constant, we therefore zoom in into the
 
filtering stage, and conduct an in-depth analysis of the main design
 
decisions here:
 
cleansing noisy web data, the methods to create entity profiles, the
 
types of entities of interest, document type, and the grade of
 
relevance of the document-entity pair under consideration.
 
We analyze how these factors (and the design choices made in their
 
corresponding system components) affect filtering performance.
 
We identify and characterize the relevant documents that do not pass
 
the filtering stage by examing their contents. This way, we give
 
estimate a practical upper-bound of recall for entity-centric stream
 
filtering.  
 
 
\end{abstract}
 
% A category with the (minimum) three required fields
 
\category{H.4}{Information Filtering}{Miscellaneous}
 
 
%A category including the fourth, optional field follows...
 
%\category{D.2.8}{Software Engineering}{Metrics}[complexity measures, performance measures]
 
 
\terms{Theory}
 
 
\keywords{Information Filtering; Cumulative Citation Recommendation; knowledge maintenance; Stream Filtering;  emerging entities} % NOT required for Proceedings
 
 
\section{Introduction}
 
  In 2012, the Text REtrieval Conferences (TREC) introduced the Knowledge Base Acceleration (KBA) track  to help Knowledge Bases(KBs) curators. The track is crucial to address a critical need of KB curators: given KB (Wikipedia or Twitter) entities, filter  a stream  for relevant documents, rank the retrieved documents and recommend them to the KB curators. The track is crucial and timely because  the number of entities in a KB on one hand, and the huge amount of new information content on the Web on the other hand make the task of manual KB maintenance challenging.   TREC KBA's main task, Cumulative Citation Recommendation (CCR), aims at filtering a stream to identify   citation-worthy  documents, rank them,  and recommend them to KB curators.
 
  
 
   
 
 Filtering is a crucial step in CCR for selecting a potentially relevant set of working documents for subsequent steps of the pipeline out of a big collection of stream documents. The TREC Filtering track defines filtering as a ``system that sifts through stream of incoming information to find documents that are relevant to a set of user needs represented by profiles'' \cite{robertson2002trec}. Adaptive Filtering, one task of the filtering track,  starts with   a persistent user profile and a very small number of positive examples. The  filtering step used in CCR systems fits under adaptive filtering: the profiles are represented by persistent KB (Wikipedia or Twitter) entities and there is a small set of relevance judgments representing positive examples. 
 
 
 
 TREC-KBA 2013's participants applied Filtering as a first step  to produce a smaller working set for subsequent experiments. As the subsequent steps of the pipeline use the output of the filter, the final performance of the system is dependent on this important step.  The filtering step particularly determines the recall of the overall system. However, all submitted systems suffered from poor recall \cite{frank2013stream}.  The most important components of the filtering step are cleansing, and entity profiling. Each component has choices to make. For example, there are two versions of corpus: cleansed and raw. Different approaches used different entity profiles for filtering. These entity profiles varied from  KB entities' canonical names, to  DBpedia name variants, to using bold words in the first paragraph of the Wikipedia entities’ profiles and anchor texts from other Wikipedia pages, to using exact name and wordNet synonyms. Moreover, the Type of entities (Wikipedia or Twitter), the category of 
 
documents (news, blog, tweets) can influence filtering.
 
 
 
 
 A variety of approaches are employed  to solve the CCR challenge. Each participant reports the steps of the pipeline and the final results in comparison to other systems.  A typical TREC KBA poster presentation or talk explains the system pipeline and reports the final results. The systems may employ similar (even the same) steps  but the choices they make at every step are usually different. In such a situation, it becomes hard to identify the factors that result in improved performance. There is  a lack of insight across different approaches. This makes  it hard to know  whether the improvement in performance of a particular approach is due to preprocessing, filtering, classification, scoring  or any of the sub-components of the pipeline. 
 
 
0 comments (0 inline, 0 general)