Changeset - 153537134253
[Not reviewed]
Merge
0 1 0
Gebrekirstos Gebremeskel - 11 years ago 2014-06-12 05:24:27
destinycome@gmail.com
mergeMerge branch 'master' of https://scm.cwi.nl/IA/cikm-paper
1 file changed with 7 insertions and 3 deletions:
0 comments (0 inline, 0 general)
mypaper-final.tex
Show inline comments
 
@@ -225,7 +225,7 @@ raw and cleaned. The raw and cleansed versions are 6.45TB and 4.5TB
 
respectively,  after xz-compression and GPG encryption. The raw data
 
is a  dump of  raw HTML pages. The cleansed version is the raw data
 
after its HTML tags are stripped off and only English documents
 
identified with Chromium Compact Language Detector
 
identified with Chromium Compact Language Detector%
 
\footnote{\url{https://code.google.com/p/chromium-compact-language-detector/}}
 
are included.  The stream corpus is organized in hourly folders each
 
of which contains many  chunk files. Each chunk file contains between
 
@@ -233,8 +233,12 @@ hundreds and hundreds of thousands of serialized  thrift objects. One
 
thrift object is one document. A document could be a blog article, a
 
news article, or a social media post (including tweet).  The stream
 
corpus comes from three sources: TREC KBA 2012 (social, news and
 
linking) \footnote{\url{http://trec-kba.org/kba-stream-corpus-2012.shtml}},
 
arxiv\footnote{\url{http://arxiv.org/}}, and
 
linking)%
 
\footnote{\url{http://trec-kba.org/kba-stream-corpus-2012.shtml}%
 
},
 
arxiv%
 
\footnote{\url{http://arxiv.org/}%
 
}, and
 
spinn3r\footnote{\url{http://spinn3r.com/}}.
 
Table \ref{tab:streams} shows the sources, the number of hourly
 
directories, and the number of chunk files.
0 comments (0 inline, 0 general)