HCDA/cikm-paper Changeset - 72c469e5cb90 · Centrum Wiskunde & Informatica (CWI)

Changeset - 72c469e5cb90

Parent rev.

Child rev.

[Not reviewed]

0 1 0

Arjen de Vries (arjen) - 11 years ago 2014-06-12 03:39:03
arjen.de.vries@cwi.nl

conflict traces removed

1 file changed with 0 insertions and 5 deletions:

mypaper-final.tex

0 comments (0 inline, 0 general)

mypaper-final.tex

➞

Show inline comments

@@ @@ -35,199 +35,194 @@ @@
 \title{Entity-Centric Stream Filtering and ranking: Filtering and Unfilterable Documents
+}
 %SUGGESTION:
 %\title{The Impact of Entity-Centric Stream Filtering on Recall and
 %  Missed Documents}
+%
 % You need the command \numberofauthors to handle the 'placement
 % and alignment' of the authors beneath the title.
+%
 % For aesthetic reasons, we recommend 'three authors at a time'
 % i.e. three 'name/affiliation blocks' be placed beneath the title.
+%
 % NOTE: You are NOT restricted in how many 'rows' of
 % "name/affiliations" may appear. We just ask that you restrict
 % the number of 'columns' to three.
+%
 % Because of the available 'opening page real-estate'
 % we ask you to refrain from putting more than six authors
 % (two rows with three columns) beneath the article title.
 % More than six makes the first-page appear very cluttered indeed.
+%
 % Use the \alignauthor commands to handle the names
 % and affiliations for an 'aesthetic maximum' of six authors.
 % Add names, affiliations, addresses for
 % the seventh etc. author(s) as the argument for the
 % \additionalauthors command.
 % These 'additional authors' will be output/set for you
 % without further effort on your part as the last section in
 % the body of your article BEFORE References or any Appendices.
 \numberofauthors{8} %  in this sample file, there are a *total*
 % of EIGHT authors. SIX appear on the 'first-page' (for formatting
 % reasons) and the remaining two appear in the \additionalauthors section.
+%
 % \author{
 % % You can go ahead and credit any number of authors here,
 % % e.g. one 'row of three' or two rows (consisting of one row of three
 % % and a second row of one, two or three).
 % %
 % % The command \alignauthor (no curly braces needed) should
 % % precede each author name, affiliation/snail-mail address and
 % % e-mail address. Additionally, tag each line of
 % % affiliation/address with \affaddr, and tag the
 % % e-mail address with \email.
 % %
 % % 1st. author
 % \alignauthor
 % Ben Trovato\titlenote{Dr.~Trovato insisted his name be first.}\\
 %        \affaddr{Institute for Clarity in Documentation}\\
 %        \affaddr{1932 Wallamaloo Lane}\\
 %        \affaddr{Wallamaloo, New Zealand}\\
 %        \email{trovato@corporation.com}
 % % 2nd. author
 % \alignauthor
 % G.K.M. Tobin\titlenote{The secretary disavows
 % any knowledge of this author's actions.}\\
 %        \affaddr{Institute for Clarity in Documentation}\\
 %        \affaddr{P.O. Box 1212}\\
 %        \affaddr{Dublin, Ohio 43017-6221}\\
 %        \email{webmaster@marysville-ohio.com}
 % }
 % There's nothing stopping you putting the seventh, eighth, etc.
 % author on the opening page (as the 'third row') but we ask,
 % for aesthetic reasons that you place these 'additional authors'
 % in the \additional authors block, viz.
 % Just remember to make sure that the TOTAL number of authors
 % is the number that will appear on the first page PLUS the
 % number that will appear in the \additionalauthors section.
 \maketitle
 \begin{abstract}
 Cumulative citation recommendation refers to the problem faced by
 knowledge base curators, who need to continuously screen the media for
 updates regarding the knowledge base entries they manage. Automatic
 system support for this entity-centric information processing problem
 requires complex pipe\-lines involving both natural language
 processing and information retrieval components. The pipeline
 encountered in a variety of systems that approach this problem
 involves four stages: filtering, classification, ranking (or scoring),
 and evaluation. Filtering is only an initial step, that reduces the
 web-scale corpus of news and other relevant information sources that
 may contain entity mentions into a working set of documents that should
 be more manageable for the subsequent stages.
 Nevertheless, this step has a large impact on the recall that can be
 maximally attained! Therefore, in this study, we have focused on just
 this filtering stage and conduct an in-depth analysis of the main design
 decisions here: how to cleans the noisy text obtained online,
 the methods to create entity profiles, the
 types of entities of interest, document type, and the grade of
 relevance of the document-entity pair under consideration.
 We analyze how these factors (and the design choices made in their
 corresponding system components) affect filtering performance.
 We identify and characterize the relevant documents that do not pass
 <<<<<<< HEAD
 the filtering stage by examining their contents. This way, we give
 estimate of a practical upper-bound of recall for entity-centric stream
 =======
 the filtering stage by examing their contents. This way, we
 estimate a practical upper-bound of recall for entity-centric stream
 >>>>>>> 68fbea2f0372ab9b4199b88f980dbf5e97f49063
 filtering.
 \end{abstract}
 % A category with the (minimum) three required fields
 \category{H.4}{Information Filtering}{Miscellaneous}
 %A category including the fourth, optional field follows...
 %\category{D.2.8}{Software Engineering}{Metrics}[complexity measures, performance measures]
 \terms{Theory}
 \keywords{Information Filtering; Cumulative Citation Recommendation; knowledge maintenance; Stream Filtering;  emerging entities} % NOT required for Proceedings
 \section{Introduction}
 In 2012, the Text REtrieval Conferences (TREC) introduced the Knowledge Base Acceleration (KBA) track  to help Knowledge Bases(KBs) curators. The track is crucial to address a critical need of KB curators: given KB (Wikipedia or Twitter) entities, filter  a stream  for relevant documents, rank the retrieved documents and recommend them to the KB curators. The track is crucial and timely because  the number of entities in a KB on one hand, and the huge amount of new information content on the Web on the other hand make the task of manual KB maintenance challenging.   TREC KBA's main task, Cumulative Citation Recommendation (CCR), aims at filtering a stream to identify   citation-worthy  documents, rank them,  and recommend them to KB curators.
  Filtering is a crucial step in CCR for selecting a potentially
  relevant set of working documents for subsequent steps of the
  pipeline out of a big collection of stream documents. The TREC
  Filtering track defines filtering as a ``system that sifts through
  stream of incoming information to find documents that are relevant to
  a set of user needs represented by profiles''
  \cite{robertson2002trec}.
 In the specific setting of CCR, these profiles are
 represented by persistent KB entities (Wikipedia pages or Twitter
 users, in the TREC scenario).
  TREC-KBA 2013's participants applied Filtering as a first step  to
  produce a smaller working set for subsequent experiments. As the
  subsequent steps of the pipeline use the output of the filter, the
  final performance of the system is dependent on this step.  The
  filtering step particularly determines the recall of the overall
  system. However, all 141 runs submitted by 13 teams did suffer from
  poor recall, as pointed out in the track's overview paper
  \cite{frank2013stream}.
 The most important components of the filtering step are cleansing
 (referring to pre-processing noisy web text into a canonical ``clean''
 text format), and
 entity profiling (creating a representation of the entity that can be
 used to match the stream documents to). For each component, different
 choices can be made. In the specific case of TREC KBA, organisers have
 provided two different versions of the corpus: one that is already cleansed,
 and one that is the raw data as originally collected by the organisers.
 Also, different
 approaches use different entity profiles for filtering, varying from
 using just the KB entities' canonical names to looking up DBpedia name
 variants, and from using the bold words in the first paragraph of the Wikipedia
 entities’ page to using anchor texts from other Wikipedia pages, and from
 using the exact name as given to WordNet derived synonyms. The type of entities
 (Wikipedia or Twitter) and the category of documents in which they
 occur (news, blogs, or tweets) cause further variations.
 % A variety of approaches are employed  to solve the CCR
 % challenge. Each participant reports the steps of the pipeline and the
 % final results in comparison to other systems.  A typical TREC KBA
 % poster presentation or talk explains the system pipeline and reports
 % the final results. The systems may employ similar (even the same)
 % steps  but the choices they make at every step are usually
 % different.
 In such a situation, it becomes hard to identify the factors that
 result in improved performance. There is  a lack of insight across
 different approaches. This makes  it hard to know whether the
 improvement in performance of a particular approach is due to
 preprocessing, filtering, classification, scoring  or any of the
 sub-components of the pipeline.
 In this paper, we therefore fix the subsequent steps of the pipeline,
 and zoom in on \emph{only} the filtering step; and conduct an in-depth analysis of its
 main components.  In particular, we study the effect of cleansing,
 entity profiling, type of entity filtered for (Wikipedia or Twitter), and
 document category (social, news, etc) on the filtering components'
 performance. The main contribution of the
 paper are an in-depth analysis of the factors that affect entity-based
 stream filtering, identifying optimal entity profiles without
 compromising precision, describing and classifying relevant documents
 that are not amenable to filtering , and estimating the upper-bound
 of recall on entity-based filtering.
 The rest of the paper is is organized as follows:
 \textbf{TODO!!}
  \section{Data Description}
 We base this analysis on the TREC-KBA 2013 dataset%
 \footnote{http://http://trec-kba.org/trec-kba-2013.shtml}
 that consists of three main parts: a time-stamped stream corpus, a set of
 KB entities to be curated, and a set of relevance judgments. A CCR
 system now has to identify for each KB entity which documents in the
 stream corpus are to be considered by the human curator.
 \subsection{Stream corpus} The stream corpus comes in two versions:
 raw and cleaned. The raw and cleansed versions are 6.45TB and 4.5TB
 respectively,  after xz-compression and GPG encryption. The raw data
 is a  dump of  raw HTML pages. The cleansed version is the raw data
 after its HTML tags are stripped off and only English documents

0 comments (0 inline, 0 general)