HCDA/cikm-paper Changeset - a6c88c797216 · Centrum Wiskunde & Informatica (CWI)

Changeset - a6c88c797216

Parent rev.

Child rev.

[Not reviewed]

Merge

0 1 0

Gebrekirstos Gebremeskel - 11 years ago 2014-06-12 06:53:59
destinycome@gmail.com

Merge branch 'master' of https://scm.cwi.nl/IA/cikm-paper

1 file changed with 17 insertions and 10 deletions:

mypaper-final.tex

0 comments (0 inline, 0 general)

mypaper-final.tex

➞

Show inline comments

  pipeline out of a big collection of stream documents. Filtering  sifts  an incoming information for information relevant to user profiles \cite{robertson2002trec}. In the specific setting of CCR, these profiles are
 represented by persistent KB entities (Wikipedia pages or Twitter
 users, in the TREC scenario).
  TREC-KBA 2013's participants applied Filtering as a first step  to
  produce a smaller working set for subsequent experiments. As the
  subsequent steps of the pipeline use the output of the filter, the
  final performance of the system is dependent on this step.  The
  filtering step particularly determines the recall of the overall
  system. However, all 141 runs submitted by 13 teams did suffer from
  poor recall, as pointed out in the track's overview paper
  \cite{frank2013stream}.
 The most important components of the filtering step are cleansing
 (referring to pre-processing noisy web text into a canonical ``clean''
 text format), and
 entity profiling (creating a representation of the entity that can be
 used to match the stream documents to). For each component, different
 choices can be made. In the specific case of TREC KBA, organisers have
 provided two different versions of the corpus: one that is already cleansed,
 and one that is the raw data as originally collected by the organisers.
 Also, different
 approaches use different entity profiles for filtering, varying from
 using just the KB entities' canonical names to looking up DBpedia name
 variants, and from using the bold words in the first paragraph of the Wikipedia
 entities' page to using anchor texts from other Wikipedia pages, and from
 using the exact name as given to WordNet derived synonyms. The type of entities
 (Wikipedia or Twitter) and the category of documents in which they
 occur (news, blogs, or tweets) cause further variations.
 % A variety of approaches are employed  to solve the CCR
 % challenge. Each participant reports the steps of the pipeline and the
 % final results in comparison to other systems.  A typical TREC KBA
 % poster presentation or talk explains the system pipeline and reports
 % the final results. The systems may employ similar (even the same)
 % steps  but the choices they make at every step are usually
 % different.
 In such a situation, it becomes hard to identify the factors that
 result in improved performance. There is  a lack of insight across
 different approaches. This makes  it hard to know whether the
 improvement in performance of a particular approach is due to
 preprocessing, filtering, classification, scoring  or any of the
 sub-components of the pipeline.
 In this paper, we therefore fix the subsequent steps of the pipeline,
 and zoom in on \emph{only} the filtering step; and conduct an in-depth analysis of its
 main components.  In particular, we study the effect of cleansing,
 entity profiling, type of entity filtered for (Wikipedia or Twitter), and
 document category (social, news, etc) on the filtering components'
-performance. The main contribution of the
+performance. The main contributions of the
 paper are an in-depth analysis of the factors that affect entity-based
 stream filtering, identifying optimal entity profiles without
 compromising precision, describing and classifying relevant documents
 that are not amenable to filtering, and estimating the upper-bound
 of recall on entity-based filtering.
 The rest of the paper  is organized as follows. Section \ref{sec:desc}
 describes the dataset and section \ref{sec:fil} defines the task. In
 section  \ref{sec:lit}, we discuss related literature followed by a
 discussion of our method in \ref{sec:mthd}. Following that,  we
 present the experimental results in \ref{sec:expr}, and discuss and
 analyze them in \ref{sec:analysis}. Towards the end, we discuss the
 impact of filtering choices on classification in section
 \ref{sec:impact}, examine and categorize unfilterable documents in
 section \ref{sec:unfil}. Finally, we present our conclusions in
 \ref{sec:conc}.
  \section{Data Description}\label{sec:desc}
-We base this analysis on the TREC-KBA 2013 dataset%
+Our study is carried out using the TREC-KBA 2013 dataset%
 \footnote{\url{http://trec-kba.org/trec-kba-2013.shtml}}
 that consists of three main parts: a time-stamped stream corpus, a set of
 KB entities to be curated, and a set of relevance judgments. A CCR
 system now has to identify for each KB entity which documents in the
 stream corpus are to be considered by the human curator.
 \subsection{Stream corpus} The stream corpus comes in two versions:
 raw and cleaned. The raw and cleansed versions are 6.45TB and 4.5TB
 respectively,  after xz-compression and GPG encryption. The raw data
 is a  dump of  raw HTML pages. The cleansed version is the raw data
 after its HTML tags are stripped off and only English documents
 identified with Chromium Compact Language Detector
 \footnote{\url{https://code.google.com/p/chromium-compact-language-detector/}}
 are included.  The stream corpus is organized in hourly folders each
 of which contains many  chunk files. Each chunk file contains between
 hundreds and hundreds of thousands of serialized  thrift objects. One
 thrift object is one document. A document could be a blog article, a
 are included.  The stream corpus is organized in hourly folders, each
 of which contains many ``chunk files''. The chunk files contain
 hunderds up to hundreds of thousands of semi-structured documents,
 serialized as thrift objects (one thrift object corresponding to one
 document). A document could be a blog article, a
 news article, or a social media post (including tweet). The stream
 corpus comes from three sources: TREC KBA 2012 (social, news and
 linking) \footnote{\url{http://trec-kba.org/kba-stream-corpus-2012.shtml}},
 arxiv\footnote{\url{http://arxiv.org/}}, and
 corpus has been derived from three main sources: TREC KBA 2012
 (containing blogs, news and web pages with shortened urls)%
 \footnote{\url{http://trec-kba.org/kba-stream-corpus-2012.shtml}%
 },
 ArXiV\footnote{\url{http://arxiv.org/}}, and
 spinn3r\footnote{\url{http://spinn3r.com/}}.
 Table \ref{tab:streams} shows the sources, the number of hourly
 directories, and the number of chunk files.
 \begin{table}
 \caption{Retrieved documents to different sources }
 \begin{center}
  \begin{tabular}{l*{4}{l}l}
  documents     &   chunk files    &    Sub-stream \\
 \hline
 ,952         &11,851         &arxiv \\
 ,381,405      &   688,974        & social \\
 ,933,117       &  280,658       &  news \\
 ,448,875         &12,946         &linking \\
 ,391,714         &164,160      &   MAINSTREAM\_NEWS (spinn3r)\\
 ,559,578         &85,769      &   FORUM (spinn3r)\\
 ,755,278         &36,272     &    CLASSIFIED (spinn3r)\\
 ,412         &9,499         &REVIEW (spinn3r)\\
 ,637         &5,168         &MEMETRACKER (spinn3r)\\
 ,040,520,595   &      2,222,554 &        Total\\
 \end{tabular}
 \end{center}
 \label{tab:streams}
 \end{table}
 \subsection{KB entities}
  The KB entities consist of 20 Twitter entities and 121 Wikipedia entities. The selected entities are, on purpose, sparse. The entities consist of 71 people, 1 organization, and 24 facilities.
 The KB entities in the dataset consist of 20 Twitter entities and 121
 Wikipedia entities. These selected entities are, on purpose, ``sparse
 entities'' in the sense that they only occur in relatively few
 documents. The collection of entities consists of 71 people entities,
 organization entity, and 24 facilities entities.
 \subsection{Relevance judgments}
 TREC-KBA provided relevance judgments for training and
 testing. Relevance judgments are given as a document-entity
 pairs. Documents with citation-worthy content to a given entity are
 annotated  as \emph{vital},  while documents with tangentially
 relevant content, or documents that lack freshliness o  with content
 that can be useful for initial KB-dossier are annotated as
 \emph{relevant}. Documents with no relevant content are labeled
 \emph{neutral} and spam is labeled as \emph{garbage}.
 %The inter-annotator agreement on vital in 2012 was 70\% while in 2013 it
 %is 76\%. This is due to the more refined definition of vital and the
 %distinction made between vital and relevant.
 \subsection{Breakdown of results by document source category}
 %The results of the different entity profiles on the raw corpus are
 %broken down by source categories and relevance rank% (vital, or
 %relevant).
 In total, the dataset contains 24162 unique entity-document
 pairs, vital or relevant; 9521 of these have been labeled as vital,
 and the remaining 17424 as relevant.
 All documents are categorized into 8 source categories: 0.98\%
 arxiv(a), 0.034\% classified(c), 0.34\% forum(f), 5.65\% linking(l),
 .53\% mainstream-news(m-n), 18.40\% news(n), 12.93\% social(s) and
 .2\% weblog(w). We have regrouped these source categories into three
 groups ``news'', ``social'', and ``other'', for two reasons: 1) some groups
 are very similar to each other. Mainstream-news and news are
 similar. The reason they exist separately, in the first place,  is
 because they were collected from two different sources, by different
 groups and at different times. we call them news from now on.  The
 same is true with weblog and social, and we call them social from now
 on.   2) some groups have so small number of annotations that treating
 them independently does not make much sense. Majority of vital or
 relevant annotations are social (social and weblog) (63.13\%). News
 (mainstream +news) make up 30\%. Thus, news and social make up about
 \% of all annotations.  The rest make up about 7\% and are all
 grouped as others.
  \section{Stream Filtering}\label{sec:fil}
  The TREC Filtering track defines filtering as a ``system that sifts
  through stream of incoming information to find documents that are
  relevant to a set of user needs represented by profiles''
  \cite{robertson2002trec}. Its information needs are long-term and are
  represented by persistent profiles, unlike the traditional search system
  whose adhoc information need is represented by a search

0 comments (0 inline, 0 general)