Changeset - 2049a41bcedf
[Not reviewed]
0 1 0
Arjen de Vries (arjen) - 11 years ago 2014-06-12 05:43:29
arjen.de.vries@cwi.nl
working on conclusions
1 file changed with 40 insertions and 8 deletions:
0 comments (0 inline, 0 general)
mypaper-final.tex
Show inline comments
 
@@ -1058,7 +1058,7 @@ and name-variants bring in new relevant documents that can not be retrieved by c
 
 
% 
 
 
 The use of name-variant partial for filtering is an aggressive attempt to retrieve as many relevant documents as possible at the cost of retrieving irrelevant documents. However, we still miss about  2363(10\%) of the vital-relevant documents.  Why are these documents missed? If they are not mentioned by partial names of name variants, what are they mentioned by? Table \ref{tab:miss} shows the documents that we miss with respect to cleansed and raw corpus.  The upper part shows the number of documents missing from cleansed and raw versions of the corpus. The lower part of the table shows the intersections and exclusions in each corpus.  
 
 The use of name-variant partial for filtering is an exhaustive attempt to retrieve as many relevant documents as possible at the cost of retrieving irrelevant documents. However, we still miss about  2363(10\%) of the vital-relevant documents.  Why are these documents missed? If they are not mentioned by partial names of name variants, what are they mentioned by? Table \ref{tab:miss} shows the documents that we miss with respect to cleansed and raw corpus.  The upper part shows the number of documents missing from cleansed and raw versions of the corpus. The lower part of the table shows the intersections and exclusions in each corpus.  
 
 
\begin{table}
 
\caption{The number of documents missing  from raw and cleansed extractions. }
 
@@ -1115,13 +1115,45 @@ We observed that there are vital-relevant documents that we miss from raw only,
 
 
 
\section{Conclusions} \label{sec:conc}
 
In this paper, we examined the filtering stage of the entity-centric stream filtering and ranking  by holding the later stages of fixed. In particular, we studied the cleansing step, different entity profiles, type of entities(Wikipedia or Twitter), categories of documents(news, social, or others) and the relevance ratings. We attempted to address the following research questions: 1) does cleansing affect filtering and subsequent performance? 2) what is the most effective way of entity profiling? 3) is filtering different for Wikipedia and Twitter entities? 4) are some type of documents easily filterable and others not? 5) does a gain in recall at filtering step translate to a gain in F-measure at the end of the pipeline? and 6) what are the circumstances under which vital documents can not be retrieved? 
 
 
Cleansing does remove parts or entire contents of documents making them irretrievable. However, because of the introduction of false positives, recall gains by  raw corpus and some  richer entity profiles do not necessarily translate to overall performance gain. The results conclusion on this is mixed in the sense that cleansing helps improve the recall on vital documents and Wikipedia entities, but reduces the recall on Twitter entities and the relative category of relevance ranking. Vital and relevant documents show a difference in retrieval nonperformance documents are easier to filter than relevant.  
 
 
 
Despite an aggressive attempt to filter as many vital-relevant documents as possible,  we observe that there are still documents that we miss. While some are possible to retrieve with some modifications, some others are not. There are some document that indicate that an information filtering system does not seem to get them no matter how rich representation of entities they use. These circumstances under which this happens are many. We found that some documents have no content at all, subjectivity(it is not clear why some are judged vital). However, the main circumstances under which vital  documents can defy filtering is: outgoing link mentions, 
 
venue-event, entity - related entity, organization - main area of operation, entity - group, artist - artist's work,  party-politician, and world knowledge.  
 
In this paper, we examined the filtering stage of the entity-centric
 
stream filtering and ranking  by holding the later stages of fixed. In
 
particular, we studied the cleansing step, different techniques to
 
construct entity profiles, and the effects of entity type (Wikipedia
 
or Twitter) and document category (news, social, or other). We attempted to address
 
the following research questions: 1) does cleansing affect filtering
 
and subsequent performance? 2) what is the most effective way of
 
entity profiling? 3) is filtering different for Wikipedia and Twitter
 
entities? 4) are some type of documents easily filterable and others
 
not? 5) does a gain in recall at filtering step translate to a gain in
 
max-F at the end of the pipeline? and 6) what are the
 
circumstances under which vital documents can not be retrieved?
 
 
Cleansing may remove (parts of) the contents of documents, making
 
them irretrievable. However, because of the introduction of false
 
positives, gaining recall by filtering the raw corpus instead of the
 
cleansed one and developing richer entity profiles, does not necessarily translate to overall
 
performance gains. The overall conclusion on this is mixed in the
 
sense that cleansing has helped to improve the recall on vital
 
documents and Wikipedia entities, but at the same time reduces the
 
recall on Twitter entities and the relative category of 
 
relevance ranking. Vital and relevant documents show a difference in
 
retrieval performance, where vital documents appear to be easier to filter than
 
relevant ones. Notice that in the context of the CCR task, the vital documents are
 
most important. 
 
 
 
Despite an exhaustive attempt to identify as many vital-relevant
 
documents as possible,  we observe that there are still documents that
 
we miss. While some can clearly be retrieved by modifying the
 
filtering procedure, some relevant and even vital documents can be
 
considered irretrievable. The circumstances under
 
which this happens are many. A few documents have no content, or it is
 
unclear why they have been judged vital. However, the main
 
circumstances under which vital documents 
 
can defy filtering include: outgoing link mentions,
 
venue-event, entity - related entity, organization - main area of
 
operation, entity - group, artist - artist's work,  party-politician,
 
and world knowledge.
 
 
 
%ACKNOWLEDGMENTS are optional
0 comments (0 inline, 0 general)