HCDA/cikm-paper Changeset - 027586725c55 · Centrum Wiskunde & Informatica (CWI)

@@ -129,7 +129,7 @@ relevance of the document-entity pair under consideration.

We analyze how these factors (and the design choices made in their

corresponding system components) affect filtering performance.

We identify and characterize the relevant documents that do not pass

the filtering stage by examing their contents. This way, we

the filtering stage by examining their contents. This way, we

estimate a practical upper-bound of recall for entity-centric stream

filtering.

@@ -208,7 +208,7 @@ of recall on entity-based filtering.

The rest of the paper  is organized as follows. Section \ref{sec:desc}

describes the dataset and section \ref{sec:fil} defines the task. In

section  \ref{sec:lit}, we discuss related literature folowed by a

section  \ref{sec:lit}, we discuss related literature followed by a

discussion of our method in \ref{sec:mthd}. Following that,  we

present the experimental results in \ref{sec:expr}, and discuss and

analyze them in \ref{sec:analysis}. Towards the end, we discuss the

@@ -300,7 +300,7 @@ that can be useful for initial KB-dossier are annotated as

%broken down by source categories and relevance rank% (vital, or

%relevant).

In total, the dataset contains 24162 unique entity-document

pairs, vital or relevant; 9521 of these have been labelled as vital,

pairs, vital or relevant; 9521 of these have been labeled as vital,

and the remaining 17424 as relevant.

All documents are categorized into 8 source categories: 0.98\%

arxiv(a), 0.034\% classified(c), 0.34\% forum(f), 5.65\% linking(l),

@@ -334,7 +334,7 @@ grouped as others.

 its performance. The  filtering stage of entity-based stream

 filtering and ranking can be likened to the adaptive filtering task

 of the filtering track. The persistent information needs are the KB

 entities, and the relevance judgments are the small number of postive

 entities, and the relevance judgments are the small number of positive

 examples.

Stream filtering is then the task to, given a stream of documents of news items, blogs

@@ -359,18 +359,18 @@ Stream filtering is then the task to, given a stream of documents of news items,

\end{enumerate}

The TREC filtering and the filtering as part of the entity-centric

stream filtering and ranking pipepline have different purposes. The

stream filtering and ranking pipeline have different purposes. The

TREC filtering track's goal is the binary classification of documents:

for each incoming docuemnt, it decides whether the incoming document

is relevant or not for a given profile. The docuemnts are either

for each incoming document, it decides whether the incoming document

is relevant or not for a given profile. The documents are either

relevant or not. In our case, the documents have relevance ranking and

the goal of the filtering stage is to filter as many potentially

relevant documents as possible, but less  irrelevant documents as

possible not to obfuscate the later stages of the piepline.  Filtering

possible not to obfuscate the later stages of the pipeline.  Filtering

as part of the pipeline needs that delicate balance between retrieving

relavant documents and irrrelevant documensts. Bcause of this,

relevant documents and irrelevant documents. Because of this,

filtering in this case can only be studied by binding it to the later

stages of the entity-centric pipeline. This bond influnces how we do

stages of the entity-centric pipeline. This bond influences how we do

evaluation.

To achieve this, we use recall percentages in the filtering stage for

@@ -382,8 +382,8 @@ F-score obtained over all relevance cut-offs.

\section{Literature Review} \label{sec:lit}

There has been a great deal of interest  as of late on entity-based filtering and ranking.  The Text Analysis Conference   started  Knwoledge Base Population with the goal of developing methods and technologies to fascilitate the creation and population of KBs \cite{ji2011knowledge}. The most relevant track in KBP is entity-linking: given an entity and

a document containing a mention of the entity, identify the mention in the document and link it to the its profile in a KB.  Many studies have attempted to address this task \cite{dalton2013neighborhood, dredze2010entity, davis2012named}.

There has been a great deal of interest  as of late on entity-based filtering and ranking.  The Text Analysis Conference   started  Knowledge Base Population with the goal of developing methods and technologies to facilitate the creation and population of KBs \cite{ji2011knowledge}. The most relevant track in KBP is entity-linking: given an entity and

a document containing a mention of the entity, identify the mention in the document and link it to  its profile in a KB.  Many studies have attempted to address this task \cite{dalton2013neighborhood, dredze2010entity, davis2012named}.

 A more recent manifestation of that is the introduction of TREC KBA in 2012.  Following that, there have been a number of research works done on the topic \cite{frank2012building, ceccarelli2013learning, taneva2013gem, wang2013bit, balog2013multi}.  These works are based on KBA 2012 task and dataset  and they address the whole problem of entity filtering and ranking.  TREC KBA continued in 2013, but the task underwent some changes. The main change between  the 2012 and 2013 are in the number of entities, the type of entities, the corpus and the relevance rankings.

@@ -559,7 +559,7 @@ these, 24162 unique document-entity pairs are vital (9521) or relevant

The upper part of Table \ref{tab:name} shows the recall performances on the cleansed version and the lower part on the raw version. The recall performances for all entity types  are increased substantially in the raw version. Recall increases on Wikipedia entities  vary from 8.2 to 12.8, and in Twitter entities from 6.8 to 26.2. In all entities, it ranges from 8.0 to 13.6.  The recall increases are substantial. To put it into perspective, an 11.8 increase in recall on all entities is a retrieval of 2864 more unique document-entity pairs. %This suggests that cleansing has removed some documents that we could otherwise retrieve.

\subsection{Entity Profiles}

If we look at the recall performances for the raw corpus,   filtering documents by canonical names achieves a recall of  59\%.  Adding other name variants  improves the recall to 79.8\%, an increase of 20.8\%. This means  20.8\% of documents mentioned the entities by other names  rather than by their canonical names. Canonical partial  achieves a recall of 72\%  and name-variant partial achives 90.2\%. This says that 18.2\% of documents mentioned the entities by  partial names of other non-canonical name variants.

If we look at the recall performances for the raw corpus,   filtering documents by canonical names achieves a recall of  59\%.  Adding other name variants  improves the recall to 79.8\%, an increase of 20.8\%. This means  20.8\% of documents mentioned the entities by other names  rather than by their canonical names. Canonical partial  achieves a recall of 72\%  and name-variant partial achieves 90.2\%. This says that 18.2\% of documents mentioned the entities by  partial names of other non-canonical name variants.

%\begin{table*}

@@ -794,7 +794,7 @@ Finally,  we select  the 5 most frequent n-grams for each context.

  Features we use incude similarity features such as cosine and jaccard, document-entity features such as docuemnt mentions entity in title, in body, frequency  of mention, etc., and related entity features such as page rank scores. In total we sue  The features consist of similarity measures between the KB entiities profile text, document-entity features such as

  In here, we present results showing how  the choices in corpus, entity types, and entity profiles impact these latest stages of the pipeline.  In tables \ref{tab:class-vital} and \ref{tab:class-vital-relevant}, we show the performances in max-F.

\begin{table*}

\caption{vital performance under different name variants(upper part from cleansed, lower part from raw)}

@@ -1019,10 +1019,10 @@ removes the related links and adverts which may contain a mention of

the entities. One example we saw was the the cleansing removed an

image with a text of an entity name which was actually relevant. And

that it removes social documents can be explained by the fact that

most of the missing of the missing  docuemnts from cleansed are

social. And all the docuemnts that are missing from raw corpus

most of the missing of the missing  documents from cleansed are

social. And all the documents that are missing from raw corpus

social. So in both cases social seem to suffer from text

transformation and cleasing processes.

transformation and cleansing processes.

%%%% NEEDS WORK:

@@ -1120,7 +1120,7 @@ We observed that there are vital-relevant documents that we miss from raw only,

 Avoiding duplicates, we randomly selected 35 documents, one for each entity.   The documents are 13 news and  22  social. Here below we have classified the situation under which a document can be vital for an entity without mentioning the entities with the different entity  profiles we used for filtering.

\paragraph*{Outgoing link mentions} A post (tweet) with an outgoing link which mentions the entity.

\paragraph*{Event place - Event} A document that talks about an event is vital to the location entity where it takes place.  For example Maha Music Festival takes place in Lewis and Clark\_Landing, and a document talking about the festival is vital for the park. There are also cases where an event's address places the event in a park and due to that the document becomes vital to the park. This is basically being mentioned by address which belongs to alarger space.

\paragraph*{Event place - Event} A document that talks about an event is vital to the location entity where it takes place.  For example Maha Music Festival takes place in Lewis and Clark\_Landing, and a document talking about the festival is vital for the park. There are also cases where an event's address places the event in a park and due to that the document becomes vital to the park. This is basically being mentioned by address which belongs to a larger space.

\paragraph*{Entity -related entity} A document about an important figure such as artist, athlete  can be vital to another. This is specially true if the two are contending for the same title, one has snatched a title, or award from the other.

\paragraph*{Organization - main activity} A document that talks about about an area on which the company is active is vital for the organization. For example, Atacocha is a mining company  and a news item on mining waste was annotated vital.

\paragraph*{Entity - group} If an entity belongs to a certain group (class),  a news item about the group can be vital for the individual members. FrankandOak is  named innovative company and a news item that talks about the group  of innovative companies is relevant for a  it. Other examples are: a  big event  of which an entity is related such an Film awards for actors.