diff --git a/mypaper-final.tex b/mypaper-final.tex index 4fed872dee5eda82720b9ffa7e716c95a5b44946..2570fd89e80c91e50152cafd878b5ed28c6f1fcb 100644 --- a/mypaper-final.tex +++ b/mypaper-final.tex @@ -221,7 +221,7 @@ raw and cleaned. The raw and cleansed versions are 6.45TB and 4.5TB respectively, after xz-compression and GPG encryption. The raw data is a dump of raw HTML pages. The cleansed version is the raw data after its HTML tags are stripped off and only English documents -identified with Chromium Compact Language Detector +identified with Chromium Compact Language Detector% \footnote{\url{https://code.google.com/p/chromium-compact-language-detector/}} are included. The stream corpus is organized in hourly folders each of which contains many chunk files. Each chunk file contains between @@ -229,8 +229,12 @@ hundreds and hundreds of thousands of serialized thrift objects. One thrift object is one document. A document could be a blog article, a news article, or a social media post (including tweet). The stream corpus comes from three sources: TREC KBA 2012 (social, news and -linking) \footnote{\url{http://trec-kba.org/kba-stream-corpus-2012.shtml}}, -arxiv\footnote{\url{http://arxiv.org/}}, and +linking)% +\footnote{\url{http://trec-kba.org/kba-stream-corpus-2012.shtml}% +}, +arxiv% +\footnote{\url{http://arxiv.org/}% +}, and spinn3r\footnote{\url{http://spinn3r.com/}}. Table \ref{tab:streams} shows the sources, the number of hourly directories, and the number of chunk files.