HCDA/cikm-paper Changeset - 94186893ebc1 · Centrum Wiskunde & Informatica (CWI)

Changeset - 94186893ebc1

Parent rev.

Child rev.

[Not reviewed]

0 1 0

Arjen de Vries (arjen) - 11 years ago 2014-06-12 06:37:05
arjen.de.vries@cwi.nl

ref error

1 file changed with 10 insertions and 3 deletions:

mypaper-final.tex

0 comments (0 inline, 0 general)

mypaper-final.tex

➞

Show inline comments

@@ @@ -185,51 +185,58 @@ occur (news, blogs, or tweets) cause further variations. @@
 % poster presentation or talk explains the system pipeline and reports
 % the final results. The systems may employ similar (even the same)
 % steps  but the choices they make at every step are usually
 % different.
 In such a situation, it becomes hard to identify the factors that
 result in improved performance. There is  a lack of insight across
 different approaches. This makes  it hard to know whether the
 improvement in performance of a particular approach is due to
 preprocessing, filtering, classification, scoring  or any of the
 sub-components of the pipeline.
 In this paper, we therefore fix the subsequent steps of the pipeline,
 and zoom in on \emph{only} the filtering step; and conduct an in-depth analysis of its
 main components.  In particular, we study the effect of cleansing,
 entity profiling, type of entity filtered for (Wikipedia or Twitter), and
 document category (social, news, etc) on the filtering components'
 performance. The main contribution of the
 paper are an in-depth analysis of the factors that affect entity-based
 stream filtering, identifying optimal entity profiles without
 compromising precision, describing and classifying relevant documents
 that are not amenable to filtering , and estimating the upper-bound
 of recall on entity-based filtering.
 The rest of the paper  is organized as follows. Section \ref{sec:desc} describes the dataset and section \ref{sec:fil} defines the task. In section  \ref{sec:lit}, we discuss related litrature folowed by a discussion of our method in \ref{sec:mthd}. Following that,  we present the experimental resulsy in \ref{sec:expr}, and discuss and analyze them in \ref{sec:analysis}. Towards the end, we discuss the impact of filtering choices on classification in section \ref{sec:impact}, examine and categorize unfilterable docuemnts in section \ref{sec:unfil}. Finally, we present our conclusions in \ref{sec:conc}.
 The rest of the paper  is organized as follows. Section \ref{sec:desc} describes the dataset and section \ref{sec:fil} defines the task. In section  \ref{sec:lit}, we discuss related literature folowed by a discussion of our method in \ref{sec:mthd}. Following that,  we present the experimental resulsy in \ref{sec:expr}, and discuss and analyze them in \ref{sec:analysis}. Towards the end, we discuss the impact of filtering choices on classification in section \ref{sec:impact}, examine and categorize unfilterable documents in section \ref{sec:unfil}. Finally, we present our conclusions in \ref{}{sec:conc}.
 The rest of the paper  is organized as follows. Section \ref{sec:desc}
 describes the dataset and section \ref{sec:fil} defines the task. In
 section  \ref{sec:lit}, we discuss related literature folowed by a
 discussion of our method in \ref{sec:mthd}. Following that,  we
 present the experimental resulsy in \ref{sec:expr}, and discuss and
 analyze them in \ref{sec:analysis}. Towards the end, we discuss the
 impact of filtering choices on classification in section
 \ref{sec:impact}, examine and categorize unfilterable documents in
 section \ref{sec:unfil}. Finally, we present our conclusions in
 \ref{sec:conc}.
  \section{Data Description}\label{sec:desc}
 We base this analysis on the TREC-KBA 2013 dataset%
 \footnote{\url{http://trec-kba.org/trec-kba-2013.shtml}}
 that consists of three main parts: a time-stamped stream corpus, a set of
 KB entities to be curated, and a set of relevance judgments. A CCR
 system now has to identify for each KB entity which documents in the
 stream corpus are to be considered by the human curator.
 \subsection{Stream corpus} The stream corpus comes in two versions:
 raw and cleaned. The raw and cleansed versions are 6.45TB and 4.5TB
 respectively,  after xz-compression and GPG encryption. The raw data
 is a  dump of  raw HTML pages. The cleansed version is the raw data
 after its HTML tags are stripped off and only English documents
 identified with Chromium Compact Language Detector
 \footnote{\url{https://code.google.com/p/chromium-compact-language-detector/}}
 are included.  The stream corpus is organized in hourly folders each
 of which contains many  chunk files. Each chunk file contains between
 hundreds and hundreds of thousands of serialized  thrift objects. One
 thrift object is one document. A document could be a blog article, a
 news article, or a social media post (including tweet).  The stream
 corpus comes from three sources: TREC KBA 2012 (social, news and

0 comments (0 inline, 0 general)