HCDA/cikm-paper Changeset - 2049a41bcedf · Centrum Wiskunde & Informatica (CWI)

Changeset - 2049a41bcedf

Parent rev.

Child rev.

[Not reviewed]

0 1 0

Arjen de Vries (arjen) - 11 years ago 2014-06-12 05:43:29
arjen.de.vries@cwi.nl

working on conclusions

1 file changed with 40 insertions and 8 deletions:

mypaper-final.tex

0 comments (0 inline, 0 general)

mypaper-final.tex

➞

Show inline comments

+%
- The use of name-variant partial for filtering is an aggressive attempt to retrieve as many relevant documents as possible at the cost of retrieving irrelevant documents. However, we still miss about  2363(10\%) of the vital-relevant documents.  Why are these documents missed? If they are not mentioned by partial names of name variants, what are they mentioned by? Table \ref{tab:miss} shows the documents that we miss with respect to cleansed and raw corpus.  The upper part shows the number of documents missing from cleansed and raw versions of the corpus. The lower part of the table shows the intersections and exclusions in each corpus.
+ The use of name-variant partial for filtering is an exhaustive attempt to retrieve as many relevant documents as possible at the cost of retrieving irrelevant documents. However, we still miss about  2363(10\%) of the vital-relevant documents.  Why are these documents missed? If they are not mentioned by partial names of name variants, what are they mentioned by? Table \ref{tab:miss} shows the documents that we miss with respect to cleansed and raw corpus.  The upper part shows the number of documents missing from cleansed and raw versions of the corpus. The lower part of the table shows the intersections and exclusions in each corpus.
 \begin{table}
 \caption{The number of documents missing  from raw and cleansed extractions. }
 \section{Conclusions} \label{sec:conc}
 In this paper, we examined the filtering stage of the entity-centric stream filtering and ranking  by holding the later stages of fixed. In particular, we studied the cleansing step, different entity profiles, type of entities(Wikipedia or Twitter), categories of documents(news, social, or others) and the relevance ratings. We attempted to address the following research questions: 1) does cleansing affect filtering and subsequent performance? 2) what is the most effective way of entity profiling? 3) is filtering different for Wikipedia and Twitter entities? 4) are some type of documents easily filterable and others not? 5) does a gain in recall at filtering step translate to a gain in F-measure at the end of the pipeline? and 6) what are the circumstances under which vital documents can not be retrieved?
 Cleansing does remove parts or entire contents of documents making them irretrievable. However, because of the introduction of false positives, recall gains by  raw corpus and some  richer entity profiles do not necessarily translate to overall performance gain. The results conclusion on this is mixed in the sense that cleansing helps improve the recall on vital documents and Wikipedia entities, but reduces the recall on Twitter entities and the relative category of relevance ranking. Vital and relevant documents show a difference in retrieval nonperformance documents are easier to filter than relevant.
 Despite an aggressive attempt to filter as many vital-relevant documents as possible,  we observe that there are still documents that we miss. While some are possible to retrieve with some modifications, some others are not. There are some document that indicate that an information filtering system does not seem to get them no matter how rich representation of entities they use. These circumstances under which this happens are many. We found that some documents have no content at all, subjectivity(it is not clear why some are judged vital). However, the main circumstances under which vital  documents can defy filtering is: outgoing link mentions,
 venue-event, entity - related entity, organization - main area of operation, entity - group, artist - artist's work,  party-politician, and world knowledge.
 In this paper, we examined the filtering stage of the entity-centric
 stream filtering and ranking  by holding the later stages of fixed. In
 particular, we studied the cleansing step, different techniques to
 construct entity profiles, and the effects of entity type (Wikipedia
 or Twitter) and document category (news, social, or other). We attempted to address
 the following research questions: 1) does cleansing affect filtering
 and subsequent performance? 2) what is the most effective way of
 entity profiling? 3) is filtering different for Wikipedia and Twitter
 entities? 4) are some type of documents easily filterable and others
 not? 5) does a gain in recall at filtering step translate to a gain in
 max-F at the end of the pipeline? and 6) what are the
 circumstances under which vital documents can not be retrieved?
 Cleansing may remove (parts of) the contents of documents, making
 them irretrievable. However, because of the introduction of false
 positives, gaining recall by filtering the raw corpus instead of the
 cleansed one and developing richer entity profiles, does not necessarily translate to overall
 performance gains. The overall conclusion on this is mixed in the
 sense that cleansing has helped to improve the recall on vital
 documents and Wikipedia entities, but at the same time reduces the
 recall on Twitter entities and the relative category of
 relevance ranking. Vital and relevant documents show a difference in
 retrieval performance, where vital documents appear to be easier to filter than
 relevant ones. Notice that in the context of the CCR task, the vital documents are
 most important.
 Despite an exhaustive attempt to identify as many vital-relevant
 documents as possible,  we observe that there are still documents that
 we miss. While some can clearly be retrieved by modifying the
 filtering procedure, some relevant and even vital documents can be
 considered irretrievable. The circumstances under
 which this happens are many. A few documents have no content, or it is
 unclear why they have been judged vital. However, the main
 circumstances under which vital documents
 can defy filtering include: outgoing link mentions,
 venue-event, entity - related entity, organization - main area of
 operation, entity - group, artist - artist's work,  party-politician,
 and world knowledge.
 %ACKNOWLEDGMENTS are optional

0 comments (0 inline, 0 general)