Changeset - d9b84600c510
[Not reviewed]
0 1 0
Arjen de Vries (arjen) - 11 years ago 2014-06-12 05:48:34
arjen.de.vries@cwi.nl
conclusions "done"
1 file changed with 7 insertions and 4 deletions:
0 comments (0 inline, 0 general)
mypaper-final.tex
Show inline comments
 
@@ -1086,94 +1086,97 @@ Raw & 276 & 4951 & 5227 \\
 
One would  assume that  the set of document-entity pairs extracted from cleansed are a sub-set of those   that are extracted from the raw corpus. We find that that is not the case. There are 217  unique entity-document pairs that are retrieved from the cleansed corpus, but not from the raw. 57 of them are vital.    Similarly,  there are  3081 document-entity pairs that are missing  from cleansed, but are present in  raw. 1065 of them are vital.  Examining the content of the documents reveals that it is due to a missing part of text from a corresponding document.  All the documents that we miss from the raw corpus are social. These are documents such as tweets and blogs, posts from other social media. To meet the format of the raw data (binary byte array), some of them must have been converted later, after collection and on the way lost a part or the entire content. It is similar for the documents that we miss from cleansed: a part or the entire content  is lost in during the cleansing process (the removal of 
 
HTML tags and non-English documents).  In both cases the mention of the entity happened to be on the part of the text that is cut out during transformation. 
 
 
 
 
 The interesting set  of relevance judgments are those that  we miss from both raw and cleansed extractions. These are 2146 unique document-entity pairs, 219 of them are with vital relevance judgments.   The total number of entities in the missed vital annotations is  28 Wikipedia and 7  Twitter, making a total of 35. The  great majority (86.7\%) of the documents are social. This suggests that social (tweets and blogs) can talk about the entities without mentioning  them by name more than news and others do. This is, of course, inline with intuition. 
 
   
 
 
 
%%%%%%%%%%%%%%%%%%%%%%
 
 
We observed that there are vital-relevant documents that we miss from raw only, and similarly from cleansed only. The reason for this is transformation from one format to another. The most interesting documents are those that we miss from both raw and cleansed corpus. We first identified the number of KB entities who have a vital relevance judgment and  whose documents can not be retrieved (they were 35 in total) and conducted a manual examination into their content to find out why they are missing. 
 
 
 
 
 
 We  observed  that among the missing documents, different document ids can have the same content, and be judged multiple times for a given entity.  %In the vital annotation, there are 88 news, and 409 weblog. 
 
 Avoiding duplicates, we randomly selected 35 documents, one for each entity.   The documents are 13 news and  22  social. Here below we have classified the situation under which a document can be vital for an entity without mentioning the entities with the different entity  profiles we used for filtering. 
 
 
\paragraph*{Outgoing link mentions} A post (tweet) with an outgoing link which mentions the entity.
 
\paragraph*{Event place - Event} A document that talks about an event is vital to the location entity where it takes place.  For example Maha Music Festival takes place in Lewis and Clark\_Landing, and a document talking about the festival is vital for the park. There are also cases where an event's address places the event in a park and due to that the document becomes vital to the park. This is basically being mentioned by address which belongs to alarger space. 
 
\paragraph*{Entity -related entity} A document about an important figure such as artist, athlete  can be vital to another. This is specially true if the two are contending for the same title, one has snatched a title, or award from the other. 
 
\paragraph*{Organization - main activity} A document that talks about about an area on which the company is active is vital for the organization. For example, Atacocha is a mining company  and a news item on mining waste was annotated vital. 
 
\paragraph*{Entity - group} If an entity belongs to a certain group (class),  a news item about the group can be vital for the individual members. FrankandOak is  named innovative company and a news item that talks about the group  of innovative companies is relevant for a  it. Other examples are: a  big event  of which an entity is related such an Film awards for actors. 
 
\paragraph*{Artist - work} Documents that discuss the work of artists can be relevant to the artists. Such cases include  books or films being vital for the book author or the director (actor) of the film. Robocop is film whose screenplay is by Joshua Zetumer. A blog that talks about the film was judged vital for Joshua Zetumer. 
 
\paragraph*{Politician - constituency} A major political event in a certain constituency is vital for the politician from that constituency. 
 
 A good example is a weblog that talks about two north Dakota counties being drought disasters. The news is vital for Joshua Boschee, a politician, a member of North Dakota democratic party.  
 
\paragraph*{head - organization} A document that talks about an organization of which the entity is the head can be vital for the entity.  Jasper\_Schneider is USDA Rural Development state director for North Dakota and an article about problems of primary health centers in North Dakota is judged vital for him. 
 
\paragraph*{World Knowledge} Some things are impossible to know without your world knowledge. For example ''refreshments, treats, gift shop specials, "bountiful, fresh and fabulous holiday decor," a demonstration of simple ways to create unique holiday arrangements for any home; free and open to the public`` is judged relevant to Hjemkomst\_Center. This is a social media post, and unless one knows the person posting it, there is no way that this text shows that. Similarly ''learn about the gray wolf's hunting and feeding behaviors and watch the wolves have their evening meal of a full deer carcass; $15 for members, $20 for nonmembers`` is judged vital to Red\_River\_Zoo.  
 
\paragraph*{No document content} A small number of documents were found to have no content.
 
\paragraph*{Disagreement} For a few remaining documents, the authors disagree with the assessors as to why these are vital to the entity.
 
 
 
 
\section{Conclusions} \label{sec:conc}
 
In this paper, we examined the filtering stage of the entity-centric
 
stream filtering and ranking  by holding the later stages of fixed. In
 
particular, we studied the cleansing step, different techniques to
 
construct entity profiles, and the effects of entity type (Wikipedia
 
or Twitter) and document category (news, social, or other). We attempted to address
 
the following research questions: 1) does cleansing affect filtering
 
and subsequent performance? 2) what is the most effective way of
 
entity profiling? 3) is filtering different for Wikipedia and Twitter
 
entities? 4) are some type of documents easily filterable and others
 
not? 5) does a gain in recall at filtering step translate to a gain in
 
max-F at the end of the pipeline? and 6) what are the
 
circumstances under which vital documents can not be retrieved?
 
 
Cleansing may remove (parts of) the contents of documents, making
 
them irretrievable. However, because of the introduction of false
 
positives, gaining recall by filtering the raw corpus instead of the
 
cleansed one and developing richer entity profiles, does not necessarily translate to overall
 
performance gains. The overall conclusion on this is mixed in the
 
cleansed one, as well as developing richer entity profiles, does not necessarily translate to overall
 
performance gains. The conclusion is mixed in the
 
sense that cleansing has helped to improve the recall on vital
 
documents and Wikipedia entities, but at the same time reduces the
 
recall on Twitter entities and the relative category of 
 
relevance ranking. Vital and relevant documents show a difference in
 
retrieval performance, where vital documents appear to be easier to filter than
 
relevant ones. Notice that in the context of the CCR task, the vital documents are
 
most important. 
 
relevant ones. (Notice that in the context of the CCR task, the vital documents are
 
most important.) The bottom line is that improving the filtering step
 
with respect to recall has shown that current entity oriented
 
retrieval approaches need to be improved to better classify and rank the ``new''
 
documents that make it into the working set.  
 
 
 
Despite an exhaustive attempt to identify as many vital-relevant
 
documents as possible,  we observe that there are still documents that
 
we miss. While some can clearly be retrieved by modifying the
 
filtering procedure, some relevant and even vital documents can be
 
considered irretrievable. The circumstances under
 
which this happens are many. A few documents have no content, or it is
 
unclear why they have been judged vital. However, the main
 
circumstances under which vital documents 
 
can defy filtering include: outgoing link mentions,
 
venue-event, entity - related entity, organization - main area of
 
operation, entity - group, artist - artist's work,  party-politician,
 
and world knowledge.
 
 
 
%ACKNOWLEDGMENTS are optional
 
%\section{Acknowledgments}
 
 
%
 
% The following two commands are all you need in the
 
% initial runs of your .tex file to
 
% produce the bibliography for the citations in your paper.
 
\bibliographystyle{abbrv}
 
\bibliography{sigproc}  % sigproc.bib is the name of the Bibliography in this case
 
% You must have a proper ".bib" file
 
%  and remember to run:
 
% latex bibtex latex latex
 
% to resolve all references
 
%
 
% ACM needs 'a single self-contained file'!
 
%
 
%APPENDICES are optional
 
%\balancecolumns
 
 
 
\end{document}
0 comments (0 inline, 0 general)