Changeset - cc006e719ae7
[Not reviewed]
0 2 0
Gebrekirstos Gebremeskel - 11 years ago 2014-06-12 05:15:00
destinycome@gmail.com
updated
2 files changed with 8 insertions and 1 deletions:
0 comments (0 inline, 0 general)
mypaper-final.tex
Show inline comments
 
@@ -160,97 +160,97 @@ users, in the TREC scenario).
 
 final performance of the system is dependent on this step.  The
 
 filtering step particularly determines the recall of the overall
 
 system. However, all 141 runs submitted by 13 teams did suffer from
 
 poor recall, as pointed out in the track's overview paper 
 
 \cite{frank2013stream}. 
 
 
The most important components of the filtering step are cleansing
 
(referring to pre-processing noisy web text into a canonical ``clean''
 
text format), and
 
entity profiling (creating a representation of the entity that can be
 
used to match the stream documents to). For each component, different
 
choices can be made. In the specific case of TREC KBA, organisers have
 
provided two different versions of the corpus: one that is already cleansed,
 
and one that is the raw data as originally collected by the organisers. 
 
Also, different
 
approaches use different entity profiles for filtering, varying from
 
using just the KB entities' canonical names to looking up DBpedia name
 
variants, and from using the bold words in the first paragraph of the Wikipedia
 
entities' page to using anchor texts from other Wikipedia pages, and from
 
using the exact name as given to WordNet derived synonyms. The type of entities
 
(Wikipedia or Twitter) and the category of documents in which they
 
occur (news, blogs, or tweets) cause further variations.
 
% A variety of approaches are employed  to solve the CCR
 
% challenge. Each participant reports the steps of the pipeline and the
 
% final results in comparison to other systems.  A typical TREC KBA
 
% poster presentation or talk explains the system pipeline and reports
 
% the final results. The systems may employ similar (even the same)
 
% steps  but the choices they make at every step are usually
 
% different. 
 
In such a situation, it becomes hard to identify the factors that
 
result in improved performance. There is  a lack of insight across
 
different approaches. This makes  it hard to know whether the
 
improvement in performance of a particular approach is due to
 
preprocessing, filtering, classification, scoring  or any of the
 
sub-components of the pipeline.
 
 
 
In this paper, we therefore fix the subsequent steps of the pipeline,
 
and zoom in on \emph{only} the filtering step; and conduct an in-depth analysis of its
 
main components.  In particular, we study the effect of cleansing,
 
entity profiling, type of entity filtered for (Wikipedia or Twitter), and
 
document category (social, news, etc) on the filtering components'
 
performance. The main contribution of the
 
paper are an in-depth analysis of the factors that affect entity-based
 
stream filtering, identifying optimal entity profiles without
 
compromising precision, describing and classifying relevant documents
 
that are not amenable to filtering , and estimating the upper-bound
 
of recall on entity-based filtering.
 
 
The rest of the paper  is organized as follows. Section \ref{sec:desc} describes the dataset and section \ref{sec:fil} defines the task. In section  \ref{sec:lit}, we discuss related litrature folowed by a discussion of our method in \ref{sec:mthd}. Following that,  we present the experimental resulsy in \ref{sec:expr}, and discuss and analyze them in \ref{sec:analysis}. Towards the end, we discuss the impact of filtering choices on classification in section \ref{sec:impact}, examine and categorize unfilterable docuemnts in section \ref{sec:unfil}. Finally, we present our conclusions in \ref{}{sec:conc}.
 
The rest of the paper  is organized as follows. Section \ref{sec:desc} describes the dataset and section \ref{sec:fil} defines the task. In section  \ref{sec:lit}, we discuss related litrature folowed by a discussion of our method in \ref{sec:mthd}. Following that,  we present the experimental resulsy in \ref{sec:expr}, and discuss and analyze them in \ref{sec:analysis}. Towards the end, we discuss the impact of filtering choices on classification in section \ref{sec:impact}, examine and categorize unfilterable docuemnts in section \ref{sec:unfil}. Finally, we present our conclusions in \ref{sec:conc}.
 
 
 
 \section{Data Description}\label{sec:desc}
 
We base this analysis on the TREC-KBA 2013 dataset%
 
\footnote{\url{http://trec-kba.org/trec-kba-2013.shtml}}
 
that consists of three main parts: a time-stamped stream corpus, a set of
 
KB entities to be curated, and a set of relevance judgments. A CCR
 
system now has to identify for each KB entity which documents in the
 
stream corpus are to be considered by the human curator.
 
 
\subsection{Stream corpus} The stream corpus comes in two versions:
 
raw and cleaned. The raw and cleansed versions are 6.45TB and 4.5TB
 
respectively,  after xz-compression and GPG encryption. The raw data
 
is a  dump of  raw HTML pages. The cleansed version is the raw data
 
after its HTML tags are stripped off and only English documents
 
identified with Chromium Compact Language Detector
 
\footnote{\url{https://code.google.com/p/chromium-compact-language-detector/}}
 
are included.  The stream corpus is organized in hourly folders each
 
of which contains many  chunk files. Each chunk file contains between
 
hundreds and hundreds of thousands of serialized  thrift objects. One
 
thrift object is one document. A document could be a blog article, a
 
news article, or a social media post (including tweet).  The stream
 
corpus comes from three sources: TREC KBA 2012 (social, news and
 
linking) \footnote{\url{http://trec-kba.org/kba-stream-corpus-2012.shtml}},
 
arxiv\footnote{\url{http://arxiv.org/}}, and
 
spinn3r\footnote{\url{http://spinn3r.com/}}.
 
Table \ref{tab:streams} shows the sources, the number of hourly
 
directories, and the number of chunk files.
 
\begin{table}
 
\caption{Retrieved documents to different sources }
 
\begin{center}
 
 
 \begin{tabular}{l*{4}{l}l}
 
 documents     &   chunk files    &    Sub-stream \\
 
\hline
 
 
126,952         &11,851         &arxiv \\
 
394,381,405      &   688,974        & social \\
 
134,933,117       &  280,658       &  news \\
 
5,448,875         &12,946         &linking \\
 
57,391,714         &164,160      &   MAINSTREAM\_NEWS (spinn3r)\\
 
36,559,578         &85,769      &   FORUM (spinn3r)\\
 
14,755,278         &36,272     &    CLASSIFIED (spinn3r)\\
 
52,412         &9,499         &REVIEW (spinn3r)\\
 
7,637         &5,168         &MEMETRACKER (spinn3r)\\
 
1,040,520,595   &      2,222,554 &        Total\\
 
 
\end{tabular}
sigproc.bib
Show inline comments
 
@@ -89,48 +89,55 @@
 
  author={Dietz, Laura and Dalton, Jeffrey},
 
 journal={Proceedings of The 22th TREC},
 
  year={2013}
 
}
 
 
@article{zhangpris,
 
  title={PRIS at TREC2013 Knowledge Base Acceleration Track},
 
  author={Zhang, Chunyun and Xu, Weiran and Liu, Ruifang and Zhang, Weitai and Zhang, Dai and Ji, Janshu and Yang, Jing},
 
   journal={Proceedings of The 22th TREC},
 
  year={2013}
 
}
 
 
 
@article{illiotrec2013,
 
  title={ The University of Illinois' Graduate School of Library and Information Science at TREC 2013},
 
  author={ Efron,Miles and   Willis,Craig and   Organisciak, Peter and  Balsamo,Brian and  Lucic, Ana},
 
  journal={Proceedings of The 22th TREC},
 
  year={2013}
 
}
 
 
@article{uvakba2013,
 
  title={ Filtering Documents over Time for Evolving Topics -The University of Amsterdam at TREC 2013 KBA CCR},
 
  author={Kenter, Tom},
 
   journal={Proceedings of The 22th TREC},
 
  year={2013}
 
}
 
 
@article{bouvierfiltering,
 
  title={Filtering Entity Centric Documents using Numerics and Temporals features within RF Classifier},
 
  author={Bouvier, Vincent and Bellot, Patrice},
 
   journal={Proceedings of The 22th TREC},
 
  year={2013}
 
}
 
 
@article{niauniversity,
 
  title={University of Florida Knowledge Base Acceleration Notebook},
 
  author={Nia, Morteza Shahriari and Grant, Christan and Peng, Yang and Wang, Daisy Zhe and Petrovic, Milenko},
 
   journal={Proceedings of The 22th TREC},
 
  year={2013}
 
}
 
 
@article{frank2013stream,
 
  title={Evaluating Stream Filtering for Entity Profile Updates for TREC 2013},
 
  author={Frank, John R and Bauer, J and  Kleiman-Weiner, Max and Roberts, Daniel A and Tripuraneni, Nilesh  and  Zhang, Ce and R{\'e}, Christopher and Voohees, Ellen and Soboroff, Ian},
 
  journal={Proceedings of The 22th TREC},
 
  year={2013}
 
}
 
 
@inproceedings{spitkovsky2012cross,
 
  title={A Cross-Lingual Dictionary for English Wikipedia Concepts.},
 
  author={Spitkovsky, Valentin I and Chang, Angel X},
 
  booktitle={LREC},
 
  pages={3168--3175},
 
  year={2012}
 
}
 
\ No newline at end of file
0 comments (0 inline, 0 general)