From 153537134253e564e7e8a9cf99ead6679e875dbb 2014-06-12 05:24:27 From: Gebrekirstos Gebremeskel Date: 2014-06-12 05:24:27 Subject: [PATCH] mergeMerge branch 'master' of https://scm.cwi.nl/IA/cikm-paper --- diff --git a/mypaper-final.tex b/mypaper-final.tex index 13077f056ee452137910c91969c62f7564d2a96f..83b176020f513891dc269b588939ffdf315feaa3 100644 --- a/mypaper-final.tex +++ b/mypaper-final.tex @@ -225,7 +225,7 @@ raw and cleaned. The raw and cleansed versions are 6.45TB and 4.5TB respectively, after xz-compression and GPG encryption. The raw data is a dump of raw HTML pages. The cleansed version is the raw data after its HTML tags are stripped off and only English documents -identified with Chromium Compact Language Detector +identified with Chromium Compact Language Detector% \footnote{\url{https://code.google.com/p/chromium-compact-language-detector/}} are included. The stream corpus is organized in hourly folders each of which contains many chunk files. Each chunk file contains between @@ -233,8 +233,12 @@ hundreds and hundreds of thousands of serialized thrift objects. One thrift object is one document. A document could be a blog article, a news article, or a social media post (including tweet). The stream corpus comes from three sources: TREC KBA 2012 (social, news and -linking) \footnote{\url{http://trec-kba.org/kba-stream-corpus-2012.shtml}}, -arxiv\footnote{\url{http://arxiv.org/}}, and +linking)% +\footnote{\url{http://trec-kba.org/kba-stream-corpus-2012.shtml}% +}, +arxiv% +\footnote{\url{http://arxiv.org/}% +}, and spinn3r\footnote{\url{http://spinn3r.com/}}. Table \ref{tab:streams} shows the sources, the number of hourly directories, and the number of chunk files.