HCDA/cikm-paper Changeset - f498db58227f · Centrum Wiskunde & Informatica (CWI)

Changeset - f498db58227f

Parent rev.

Child rev.

[Not reviewed]

0 1 0

Gebrekirstos Gebremeskel - 11 years ago 2014-06-12 05:44:17
destinycome@gmail.com

updated

1 file changed with 4 insertions and 8 deletions:

mypaper-final.tex

0 comments (0 inline, 0 general)

mypaper-final.tex

➞

Show inline comments

 respectively,  after xz-compression and GPG encryption. The raw data
 is a  dump of  raw HTML pages. The cleansed version is the raw data
 after its HTML tags are stripped off and only English documents
-identified with Chromium Compact Language Detector%
+identified with Chromium Compact Language Detector
 \footnote{\url{https://code.google.com/p/chromium-compact-language-detector/}}
 are included.  The stream corpus is organized in hourly folders each
 of which contains many  chunk files. Each chunk file contains between
 thrift object is one document. A document could be a blog article, a
 news article, or a social media post (including tweet).  The stream
 corpus comes from three sources: TREC KBA 2012 (social, news and
 linking)%
 \footnote{\url{http://trec-kba.org/kba-stream-corpus-2012.shtml}%
 },
 arxiv%
 \footnote{\url{http://arxiv.org/}%
 }, and
 linking) \footnote{\url{http://trec-kba.org/kba-stream-corpus-2012.shtml}},
 arxiv\footnote{\url{http://arxiv.org/}}, and
 spinn3r\footnote{\url{http://spinn3r.com/}}.
 Table \ref{tab:streams} shows the sources, the number of hourly
 directories, and the number of chunk files.
 \section{Conclusions} \label{sec:conc}
 In this paper, we examined the filtering stage of the entity-centric stream filtering and ranking  by holding the later stages of fixed. In particular, we studied the cleansing step, different entity profiles, type of entities(Wikipedia or Twitter), categories of documents(news, social, or others) and the relevance ratings. We attempted to address the following research questions: 1) does cleansing affect filtering and subsequent performance? 2) what is the most effective way of entity profiling? 3) is filtering different for Wikipedia and Twitter entities? 4) are some type of documents easily filterable and others not? 5) does a gain in recall at filtering step translate to a gain in F-measure at the end of the pipeline? and 6) what are the circumstances under which vital documents can not be retrieved?
-Cleansing does remove parts or entire contents of documents making them irretrievable. However, because of the introduction of false positives, recall gains by  raw corpus and some  richer entity profiles do not necessarily translate to overall performance gain. The results conclusion on this is mixed in the sense that cleansing helps improve the recall on vital documents and Wikipedia entities, but reduces the recall on Twitter entities and the relative category of relevance ranking. Vital and relevant documents show a difference in retrieval nonperformance documents are easier to filter than relevant.
+Cleansing does remove parts or entire contents of documents making them irretrievable. However, because of the introduction of false positives, recall gains by  raw corpus and some  richer entity profiles do not necessarily translate to overall performance gain. The  conclusion on this is mixed in the sense that cleansing helps improve the recall on vital documents and Wikipedia entities, but reduces the recall on Twitter entities and the relative category of relevance ranking. Vital and relevant documents show a difference in retrieval performance. Vital documents are easier to filter than relevant.
 Despite an aggressive attempt to filter as many vital-relevant documents as possible,  we observe that there are still documents that we miss. While some are possible to retrieve with some modifications, some others are not. There are some document that indicate that an information filtering system does not seem to get them no matter how rich representation of entities they use. These circumstances under which this happens are many. We found that some documents have no content at all, subjectivity(it is not clear why some are judged vital). However, the main circumstances under which vital  documents can defy filtering is: outgoing link mentions,

0 comments (0 inline, 0 general)