From 16cf454c73fada6ab8df9727c73eb5752870a201 2014-06-11 20:00:40 From: Gebrekirstos Gebremeskel Date: 2014-06-11 20:00:40 Subject: [PATCH] Merge branch 'master' of https://scm.cwi.nl/IA/cikm-paper --- diff --git a/mypaper-final.tex b/mypaper-final.tex index ac2f10bd55e18a1933db7b28580fde2c3ae66721..f6fbc7491693ca4b95b13603fd87455b3cb8782d 100644 --- a/mypaper-final.tex +++ b/mypaper-final.tex @@ -102,9 +102,29 @@ \maketitle \begin{abstract} - -Entity-centric information processing requires complex pipelines involving both natural language processing and information retrieval components. In entity-centric stream filtering and ranking, the pipeline involves four important stages: filtering, classification, ranking(scoring) and evaluation. Filtering is an important step that creates a manageable working set of documents from a web-scale corpus for the next stages. It thus determines the performance of the overall system. Keeping the subsequent steps constant, we zoom in on the filtering stage and conduct an in-depth analysis of the main components of cleansing, entity profiles, relevance levels, category of documents and entity types with a view to understanding the factors and choices that affect filtering performance. The study demonstrates the most effective entity profiling, identifies those relevant documents that defy filtering and conducts manual examination into their contents. The paper classifies the ways unfilterable documents -are mentioned in text and estimates the practical upper-bound of recall in entity-based filtering. +Entity-centric information processing requires complex pipelines +involving both natural language processing and information retrieval +components. In entity-centric stream filtering and ranking, the +pipeline involves four stages: filtering, classification, +ranking (scoring) and evaluation. Filtering is an initial step, that +extracts a working-set of documents from the web-scale corpus, aiming +for a smaller size collection that would be more manageable in the +subsequent stages of the pipeline. This filtering step therefore +determines the maximally attainable performance of the overall system. + +This paper investigates the filtering stage in isoltation, in context +of a cumulative citation recommendation problem. We conduct an +in-depth analysis of the main factors that determine filtering +effectiveness: cleansing noisy web data, methods to create entity +profiles, the types of entity of interest, document category, and the +relevance level of the entity-document pair under consideration. +We analyze how these factors (and the design choices made in their +corresponding system components) affect filtering performance. +We identify and characterize the relevant documents that do not pass the +filtering stage, and conduct a manual examination into their +contents. The paper classifies the ways unfilterable documents +are mentioned in text and estimates the practical upper-bound of +recall in entity-based filtering. \end{abstract} % A category with the (minimum) three required fields