HCDA/cikm-paper Changeset - e4e795a2cd60 · Centrum Wiskunde & Informatica (CWI)

Changeset - e4e795a2cd60

Parent rev.

Child rev.

[Not reviewed]

0 1 0

Arjen de Vries (arjen) - 11 years ago 2014-06-12 05:23:36
arjen.de.vries@cwi.nl

footnote lesson applied

1 file changed with 7 insertions and 3 deletions:

mypaper-final.tex

0 comments (0 inline, 0 general)

mypaper-final.tex

➞

Show inline comments

 We base this analysis on the TREC-KBA 2013 dataset%
 \footnote{\url{http://trec-kba.org/trec-kba-2013.shtml}}
 that consists of three main parts: a time-stamped stream corpus, a set of
 KB entities to be curated, and a set of relevance judgments. A CCR
 system now has to identify for each KB entity which documents in the
 stream corpus are to be considered by the human curator.
 \subsection{Stream corpus} The stream corpus comes in two versions:
 raw and cleaned. The raw and cleansed versions are 6.45TB and 4.5TB
 respectively,  after xz-compression and GPG encryption. The raw data
 is a  dump of  raw HTML pages. The cleansed version is the raw data
 after its HTML tags are stripped off and only English documents
-identified with Chromium Compact Language Detector
+identified with Chromium Compact Language Detector%
 \footnote{\url{https://code.google.com/p/chromium-compact-language-detector/}}
 are included.  The stream corpus is organized in hourly folders each
 of which contains many  chunk files. Each chunk file contains between
 hundreds and hundreds of thousands of serialized  thrift objects. One
 thrift object is one document. A document could be a blog article, a
 news article, or a social media post (including tweet).  The stream
 corpus comes from three sources: TREC KBA 2012 (social, news and
 linking) \footnote{\url{http://trec-kba.org/kba-stream-corpus-2012.shtml}},
 arxiv\footnote{\url{http://arxiv.org/}}, and
 linking)%
 \footnote{\url{http://trec-kba.org/kba-stream-corpus-2012.shtml}%
 },
 arxiv%
 \footnote{\url{http://arxiv.org/}%
 }, and
 spinn3r\footnote{\url{http://spinn3r.com/}}.
 Table \ref{tab:streams} shows the sources, the number of hourly
 directories, and the number of chunk files.
 \begin{table}
 \caption{Retrieved documents to different sources }
 \begin{center}
  \begin{tabular}{l*{4}{l}l}
  documents     &   chunk files    &    Sub-stream \\
 \hline
 ,952         &11,851         &arxiv \\

0 comments (0 inline, 0 general)