Changeset - 027586725c55
[Not reviewed]
Merge
0 1 0
Arjen de Vries (arjen) - 11 years ago 2014-06-12 06:55:07
arjen.de.vries@cwi.nl
Merge branch 'master' of https://scm.cwi.nl/IA/cikm-paper
1 file changed with 18 insertions and 18 deletions:
0 comments (0 inline, 0 general)
mypaper-final.tex
Show inline comments
 
@@ -129,7 +129,7 @@ relevance of the document-entity pair under consideration.
 
We analyze how these factors (and the design choices made in their
 
corresponding system components) affect filtering performance.
 
We identify and characterize the relevant documents that do not pass
 
the filtering stage by examing their contents. This way, we
 
the filtering stage by examining their contents. This way, we
 
estimate a practical upper-bound of recall for entity-centric stream
 
filtering.
 
 
@@ -208,7 +208,7 @@ of recall on entity-based filtering.
 
 
The rest of the paper  is organized as follows. Section \ref{sec:desc}
 
describes the dataset and section \ref{sec:fil} defines the task. In
 
section  \ref{sec:lit}, we discuss related literature folowed by a
 
section  \ref{sec:lit}, we discuss related literature followed by a
 
discussion of our method in \ref{sec:mthd}. Following that,  we
 
present the experimental results in \ref{sec:expr}, and discuss and
 
analyze them in \ref{sec:analysis}. Towards the end, we discuss the
 
@@ -300,7 +300,7 @@ that can be useful for initial KB-dossier are annotated as
 
%broken down by source categories and relevance rank% (vital, or
 
%relevant).  
 
In total, the dataset contains 24162 unique entity-document
 
pairs, vital or relevant; 9521 of these have been labelled as vital,
 
pairs, vital or relevant; 9521 of these have been labeled as vital,
 
and the remaining 17424 as relevant.
 
All documents are categorized into 8 source categories: 0.98\%
 
arxiv(a), 0.034\% classified(c), 0.34\% forum(f), 5.65\% linking(l),
 
@@ -334,7 +334,7 @@ grouped as others.
 
 its performance. The  filtering stage of entity-based stream
 
 filtering and ranking can be likened to the adaptive filtering task
 
 of the filtering track. The persistent information needs are the KB
 
 entities, and the relevance judgments are the small number of postive
 
 entities, and the relevance judgments are the small number of positive
 
 examples.
 
 
Stream filtering is then the task to, given a stream of documents of news items, blogs
 
@@ -359,18 +359,18 @@ Stream filtering is then the task to, given a stream of documents of news items,
 
\end{enumerate}
 
 
The TREC filtering and the filtering as part of the entity-centric
 
stream filtering and ranking pipepline have different purposes. The
 
stream filtering and ranking pipeline have different purposes. The
 
TREC filtering track's goal is the binary classification of documents:
 
for each incoming docuemnt, it decides whether the incoming document
 
is relevant or not for a given profile. The docuemnts are either
 
for each incoming document, it decides whether the incoming document
 
is relevant or not for a given profile. The documents are either
 
relevant or not. In our case, the documents have relevance ranking and
 
the goal of the filtering stage is to filter as many potentially
 
relevant documents as possible, but less  irrelevant documents as
 
possible not to obfuscate the later stages of the piepline.  Filtering
 
possible not to obfuscate the later stages of the pipeline.  Filtering
 
as part of the pipeline needs that delicate balance between retrieving
 
relavant documents and irrrelevant documensts. Bcause of this,
 
relevant documents and irrelevant documents. Because of this,
 
filtering in this case can only be studied by binding it to the later
 
stages of the entity-centric pipeline. This bond influnces how we do
 
stages of the entity-centric pipeline. This bond influences how we do
 
evaluation.
 
 
To achieve this, we use recall percentages in the filtering stage for
 
@@ -382,8 +382,8 @@ F-score obtained over all relevance cut-offs.
 
 
\section{Literature Review} \label{sec:lit}
 
 
There has been a great deal of interest  as of late on entity-based filtering and ranking.  The Text Analysis Conference   started  Knwoledge Base Population with the goal of developing methods and technologies to fascilitate the creation and population of KBs \cite{ji2011knowledge}. The most relevant track in KBP is entity-linking: given an entity and
 
a document containing a mention of the entity, identify the mention in the document and link it to the its profile in a KB.  Many studies have attempted to address this task \cite{dalton2013neighborhood, dredze2010entity, davis2012named}. 
 
There has been a great deal of interest  as of late on entity-based filtering and ranking.  The Text Analysis Conference   started  Knowledge Base Population with the goal of developing methods and technologies to facilitate the creation and population of KBs \cite{ji2011knowledge}. The most relevant track in KBP is entity-linking: given an entity and
 
a document containing a mention of the entity, identify the mention in the document and link it to  its profile in a KB.  Many studies have attempted to address this task \cite{dalton2013neighborhood, dredze2010entity, davis2012named}. 
 
 
 A more recent manifestation of that is the introduction of TREC KBA in 2012.  Following that, there have been a number of research works done on the topic \cite{frank2012building, ceccarelli2013learning, taneva2013gem, wang2013bit, balog2013multi}.  These works are based on KBA 2012 task and dataset  and they address the whole problem of entity filtering and ranking.  TREC KBA continued in 2013, but the task underwent some changes. The main change between  the 2012 and 2013 are in the number of entities, the type of entities, the corpus and the relevance rankings.
 
 
@@ -559,7 +559,7 @@ these, 24162 unique document-entity pairs are vital (9521) or relevant
 
The upper part of Table \ref{tab:name} shows the recall performances on the cleansed version and the lower part on the raw version. The recall performances for all entity types  are increased substantially in the raw version. Recall increases on Wikipedia entities  vary from 8.2 to 12.8, and in Twitter entities from 6.8 to 26.2. In all entities, it ranges from 8.0 to 13.6.  The recall increases are substantial. To put it into perspective, an 11.8 increase in recall on all entities is a retrieval of 2864 more unique document-entity pairs. %This suggests that cleansing has removed some documents that we could otherwise retrieve. 
 
 
\subsection{Entity Profiles}
 
If we look at the recall performances for the raw corpus,   filtering documents by canonical names achieves a recall of  59\%.  Adding other name variants  improves the recall to 79.8\%, an increase of 20.8\%. This means  20.8\% of documents mentioned the entities by other names  rather than by their canonical names. Canonical partial  achieves a recall of 72\%  and name-variant partial achives 90.2\%. This says that 18.2\% of documents mentioned the entities by  partial names of other non-canonical name variants. 
 
If we look at the recall performances for the raw corpus,   filtering documents by canonical names achieves a recall of  59\%.  Adding other name variants  improves the recall to 79.8\%, an increase of 20.8\%. This means  20.8\% of documents mentioned the entities by other names  rather than by their canonical names. Canonical partial  achieves a recall of 72\%  and name-variant partial achieves 90.2\%. This says that 18.2\% of documents mentioned the entities by  partial names of other non-canonical name variants. 
 
 
 
%\begin{table*}
 
@@ -794,7 +794,7 @@ Finally,  we select  the 5 most frequent n-grams for each context.
 
 
 
  
 
  Features we use incude similarity features such as cosine and jaccard, document-entity features such as docuemnt mentions entity in title, in body, frequency  of mention, etc., and related entity features such as page rank scores. In total we sue  The features consist of similarity measures between the KB entiities profile text, document-entity features such as  
 
   
 
  In here, we present results showing how  the choices in corpus, entity types, and entity profiles impact these latest stages of the pipeline.  In tables \ref{tab:class-vital} and \ref{tab:class-vital-relevant}, we show the performances in max-F. 
 
\begin{table*}
 
\caption{vital performance under different name variants(upper part from cleansed, lower part from raw)}
 
@@ -1019,10 +1019,10 @@ removes the related links and adverts which may contain a mention of
 
the entities. One example we saw was the the cleansing removed an
 
image with a text of an entity name which was actually relevant. And
 
that it removes social documents can be explained by the fact that
 
most of the missing of the missing  docuemnts from cleansed are
 
social. And all the docuemnts that are missing from raw corpus
 
most of the missing of the missing  documents from cleansed are
 
social. And all the documents that are missing from raw corpus
 
social. So in both cases social seem to suffer from text
 
transformation and cleasing processes. 
 
transformation and cleansing processes. 
 
 
%%%% NEEDS WORK:
 
 
@@ -1120,7 +1120,7 @@ We observed that there are vital-relevant documents that we miss from raw only,
 
 Avoiding duplicates, we randomly selected 35 documents, one for each entity.   The documents are 13 news and  22  social. Here below we have classified the situation under which a document can be vital for an entity without mentioning the entities with the different entity  profiles we used for filtering. 
 
 
\paragraph*{Outgoing link mentions} A post (tweet) with an outgoing link which mentions the entity.
 
\paragraph*{Event place - Event} A document that talks about an event is vital to the location entity where it takes place.  For example Maha Music Festival takes place in Lewis and Clark\_Landing, and a document talking about the festival is vital for the park. There are also cases where an event's address places the event in a park and due to that the document becomes vital to the park. This is basically being mentioned by address which belongs to alarger space. 
 
\paragraph*{Event place - Event} A document that talks about an event is vital to the location entity where it takes place.  For example Maha Music Festival takes place in Lewis and Clark\_Landing, and a document talking about the festival is vital for the park. There are also cases where an event's address places the event in a park and due to that the document becomes vital to the park. This is basically being mentioned by address which belongs to a larger space. 
 
\paragraph*{Entity -related entity} A document about an important figure such as artist, athlete  can be vital to another. This is specially true if the two are contending for the same title, one has snatched a title, or award from the other. 
 
\paragraph*{Organization - main activity} A document that talks about about an area on which the company is active is vital for the organization. For example, Atacocha is a mining company  and a news item on mining waste was annotated vital. 
 
\paragraph*{Entity - group} If an entity belongs to a certain group (class),  a news item about the group can be vital for the individual members. FrankandOak is  named innovative company and a news item that talks about the group  of innovative companies is relevant for a  it. Other examples are: a  big event  of which an entity is related such an Film awards for actors. 
0 comments (0 inline, 0 general)