From 2049a41bcedf8135b5274df372347b10dd740d89 2014-06-12 05:43:29 From: Arjen P. de Vries Date: 2014-06-12 05:43:29 Subject: [PATCH] working on conclusions --- diff --git a/mypaper-final.tex b/mypaper-final.tex index 2570fd89e80c91e50152cafd878b5ed28c6f1fcb..097ed36ba4598b0cc7c96b266406af01a1a5e4f5 100644 --- a/mypaper-final.tex +++ b/mypaper-final.tex @@ -1058,7 +1058,7 @@ and name-variants bring in new relevant documents that can not be retrieved by c % - The use of name-variant partial for filtering is an aggressive attempt to retrieve as many relevant documents as possible at the cost of retrieving irrelevant documents. However, we still miss about 2363(10\%) of the vital-relevant documents. Why are these documents missed? If they are not mentioned by partial names of name variants, what are they mentioned by? Table \ref{tab:miss} shows the documents that we miss with respect to cleansed and raw corpus. The upper part shows the number of documents missing from cleansed and raw versions of the corpus. The lower part of the table shows the intersections and exclusions in each corpus. + The use of name-variant partial for filtering is an exhaustive attempt to retrieve as many relevant documents as possible at the cost of retrieving irrelevant documents. However, we still miss about 2363(10\%) of the vital-relevant documents. Why are these documents missed? If they are not mentioned by partial names of name variants, what are they mentioned by? Table \ref{tab:miss} shows the documents that we miss with respect to cleansed and raw corpus. The upper part shows the number of documents missing from cleansed and raw versions of the corpus. The lower part of the table shows the intersections and exclusions in each corpus. \begin{table} \caption{The number of documents missing from raw and cleansed extractions. } @@ -1115,13 +1115,45 @@ We observed that there are vital-relevant documents that we miss from raw only, \section{Conclusions} \label{sec:conc} -In this paper, we examined the filtering stage of the entity-centric stream filtering and ranking by holding the later stages of fixed. In particular, we studied the cleansing step, different entity profiles, type of entities(Wikipedia or Twitter), categories of documents(news, social, or others) and the relevance ratings. We attempted to address the following research questions: 1) does cleansing affect filtering and subsequent performance? 2) what is the most effective way of entity profiling? 3) is filtering different for Wikipedia and Twitter entities? 4) are some type of documents easily filterable and others not? 5) does a gain in recall at filtering step translate to a gain in F-measure at the end of the pipeline? and 6) what are the circumstances under which vital documents can not be retrieved? - -Cleansing does remove parts or entire contents of documents making them irretrievable. However, because of the introduction of false positives, recall gains by raw corpus and some richer entity profiles do not necessarily translate to overall performance gain. The results conclusion on this is mixed in the sense that cleansing helps improve the recall on vital documents and Wikipedia entities, but reduces the recall on Twitter entities and the relative category of relevance ranking. Vital and relevant documents show a difference in retrieval nonperformance documents are easier to filter than relevant. - - -Despite an aggressive attempt to filter as many vital-relevant documents as possible, we observe that there are still documents that we miss. While some are possible to retrieve with some modifications, some others are not. There are some document that indicate that an information filtering system does not seem to get them no matter how rich representation of entities they use. These circumstances under which this happens are many. We found that some documents have no content at all, subjectivity(it is not clear why some are judged vital). However, the main circumstances under which vital documents can defy filtering is: outgoing link mentions, -venue-event, entity - related entity, organization - main area of operation, entity - group, artist - artist's work, party-politician, and world knowledge. +In this paper, we examined the filtering stage of the entity-centric +stream filtering and ranking by holding the later stages of fixed. In +particular, we studied the cleansing step, different techniques to +construct entity profiles, and the effects of entity type (Wikipedia +or Twitter) and document category (news, social, or other). We attempted to address +the following research questions: 1) does cleansing affect filtering +and subsequent performance? 2) what is the most effective way of +entity profiling? 3) is filtering different for Wikipedia and Twitter +entities? 4) are some type of documents easily filterable and others +not? 5) does a gain in recall at filtering step translate to a gain in +max-F at the end of the pipeline? and 6) what are the +circumstances under which vital documents can not be retrieved? + +Cleansing may remove (parts of) the contents of documents, making +them irretrievable. However, because of the introduction of false +positives, gaining recall by filtering the raw corpus instead of the +cleansed one and developing richer entity profiles, does not necessarily translate to overall +performance gains. The overall conclusion on this is mixed in the +sense that cleansing has helped to improve the recall on vital +documents and Wikipedia entities, but at the same time reduces the +recall on Twitter entities and the relative category of +relevance ranking. Vital and relevant documents show a difference in +retrieval performance, where vital documents appear to be easier to filter than +relevant ones. Notice that in the context of the CCR task, the vital documents are +most important. + + +Despite an exhaustive attempt to identify as many vital-relevant +documents as possible, we observe that there are still documents that +we miss. While some can clearly be retrieved by modifying the +filtering procedure, some relevant and even vital documents can be +considered irretrievable. The circumstances under +which this happens are many. A few documents have no content, or it is +unclear why they have been judged vital. However, the main +circumstances under which vital documents +can defy filtering include: outgoing link mentions, +venue-event, entity - related entity, organization - main area of +operation, entity - group, artist - artist's work, party-politician, +and world knowledge. %ACKNOWLEDGMENTS are optional