diff --git a/mypaper-final.tex b/mypaper-final.tex index 61bd5686a27402d0906a8032318542d754c6f54e..7fb00c4c656d67b782b1d11b69f111de1240104d 100644 --- a/mypaper-final.tex +++ b/mypaper-final.tex @@ -129,7 +129,7 @@ relevance of the document-entity pair under consideration. We analyze how these factors (and the design choices made in their corresponding system components) affect filtering performance. We identify and characterize the relevant documents that do not pass -the filtering stage by examing their contents. This way, we +the filtering stage by examining their contents. This way, we estimate a practical upper-bound of recall for entity-centric stream filtering. @@ -208,7 +208,7 @@ of recall on entity-based filtering. The rest of the paper is organized as follows. Section \ref{sec:desc} describes the dataset and section \ref{sec:fil} defines the task. In -section \ref{sec:lit}, we discuss related literature folowed by a +section \ref{sec:lit}, we discuss related literature followed by a discussion of our method in \ref{sec:mthd}. Following that, we present the experimental results in \ref{sec:expr}, and discuss and analyze them in \ref{sec:analysis}. Towards the end, we discuss the @@ -300,7 +300,7 @@ that can be useful for initial KB-dossier are annotated as %broken down by source categories and relevance rank% (vital, or %relevant). In total, the dataset contains 24162 unique entity-document -pairs, vital or relevant; 9521 of these have been labelled as vital, +pairs, vital or relevant; 9521 of these have been labeled as vital, and the remaining 17424 as relevant. All documents are categorized into 8 source categories: 0.98\% arxiv(a), 0.034\% classified(c), 0.34\% forum(f), 5.65\% linking(l), @@ -334,7 +334,7 @@ grouped as others. its performance. The filtering stage of entity-based stream filtering and ranking can be likened to the adaptive filtering task of the filtering track. The persistent information needs are the KB - entities, and the relevance judgments are the small number of postive + entities, and the relevance judgments are the small number of positive examples. Stream filtering is then the task to, given a stream of documents of news items, blogs @@ -359,18 +359,18 @@ Stream filtering is then the task to, given a stream of documents of news items, \end{enumerate} The TREC filtering and the filtering as part of the entity-centric -stream filtering and ranking pipepline have different purposes. The +stream filtering and ranking pipeline have different purposes. The TREC filtering track's goal is the binary classification of documents: -for each incoming docuemnt, it decides whether the incoming document -is relevant or not for a given profile. The docuemnts are either +for each incoming document, it decides whether the incoming document +is relevant or not for a given profile. The documents are either relevant or not. In our case, the documents have relevance ranking and the goal of the filtering stage is to filter as many potentially relevant documents as possible, but less irrelevant documents as -possible not to obfuscate the later stages of the piepline. Filtering +possible not to obfuscate the later stages of the pipeline. Filtering as part of the pipeline needs that delicate balance between retrieving -relavant documents and irrrelevant documensts. Bcause of this, +relevant documents and irrelevant documents. Because of this, filtering in this case can only be studied by binding it to the later -stages of the entity-centric pipeline. This bond influnces how we do +stages of the entity-centric pipeline. This bond influences how we do evaluation. To achieve this, we use recall percentages in the filtering stage for @@ -382,8 +382,8 @@ F-score obtained over all relevance cut-offs. \section{Literature Review} \label{sec:lit} -There has been a great deal of interest as of late on entity-based filtering and ranking. The Text Analysis Conference started Knwoledge Base Population with the goal of developing methods and technologies to fascilitate the creation and population of KBs \cite{ji2011knowledge}. The most relevant track in KBP is entity-linking: given an entity and -a document containing a mention of the entity, identify the mention in the document and link it to the its profile in a KB. Many studies have attempted to address this task \cite{dalton2013neighborhood, dredze2010entity, davis2012named}. +There has been a great deal of interest as of late on entity-based filtering and ranking. The Text Analysis Conference started Knowledge Base Population with the goal of developing methods and technologies to facilitate the creation and population of KBs \cite{ji2011knowledge}. The most relevant track in KBP is entity-linking: given an entity and +a document containing a mention of the entity, identify the mention in the document and link it to its profile in a KB. Many studies have attempted to address this task \cite{dalton2013neighborhood, dredze2010entity, davis2012named}. A more recent manifestation of that is the introduction of TREC KBA in 2012. Following that, there have been a number of research works done on the topic \cite{frank2012building, ceccarelli2013learning, taneva2013gem, wang2013bit, balog2013multi}. These works are based on KBA 2012 task and dataset and they address the whole problem of entity filtering and ranking. TREC KBA continued in 2013, but the task underwent some changes. The main change between the 2012 and 2013 are in the number of entities, the type of entities, the corpus and the relevance rankings. @@ -559,7 +559,7 @@ these, 24162 unique document-entity pairs are vital (9521) or relevant The upper part of Table \ref{tab:name} shows the recall performances on the cleansed version and the lower part on the raw version. The recall performances for all entity types are increased substantially in the raw version. Recall increases on Wikipedia entities vary from 8.2 to 12.8, and in Twitter entities from 6.8 to 26.2. In all entities, it ranges from 8.0 to 13.6. The recall increases are substantial. To put it into perspective, an 11.8 increase in recall on all entities is a retrieval of 2864 more unique document-entity pairs. %This suggests that cleansing has removed some documents that we could otherwise retrieve. \subsection{Entity Profiles} -If we look at the recall performances for the raw corpus, filtering documents by canonical names achieves a recall of 59\%. Adding other name variants improves the recall to 79.8\%, an increase of 20.8\%. This means 20.8\% of documents mentioned the entities by other names rather than by their canonical names. Canonical partial achieves a recall of 72\% and name-variant partial achives 90.2\%. This says that 18.2\% of documents mentioned the entities by partial names of other non-canonical name variants. +If we look at the recall performances for the raw corpus, filtering documents by canonical names achieves a recall of 59\%. Adding other name variants improves the recall to 79.8\%, an increase of 20.8\%. This means 20.8\% of documents mentioned the entities by other names rather than by their canonical names. Canonical partial achieves a recall of 72\% and name-variant partial achieves 90.2\%. This says that 18.2\% of documents mentioned the entities by partial names of other non-canonical name variants. %\begin{table*} @@ -794,7 +794,7 @@ Finally, we select the 5 most frequent n-grams for each context. - Features we use incude similarity features such as cosine and jaccard, document-entity features such as docuemnt mentions entity in title, in body, frequency of mention, etc., and related entity features such as page rank scores. In total we sue The features consist of similarity measures between the KB entiities profile text, document-entity features such as + In here, we present results showing how the choices in corpus, entity types, and entity profiles impact these latest stages of the pipeline. In tables \ref{tab:class-vital} and \ref{tab:class-vital-relevant}, we show the performances in max-F. \begin{table*} \caption{vital performance under different name variants(upper part from cleansed, lower part from raw)} @@ -1019,10 +1019,10 @@ removes the related links and adverts which may contain a mention of the entities. One example we saw was the the cleansing removed an image with a text of an entity name which was actually relevant. And that it removes social documents can be explained by the fact that -most of the missing of the missing docuemnts from cleansed are -social. And all the docuemnts that are missing from raw corpus +most of the missing of the missing documents from cleansed are +social. And all the documents that are missing from raw corpus social. So in both cases social seem to suffer from text -transformation and cleasing processes. +transformation and cleansing processes. %%%% NEEDS WORK: @@ -1120,7 +1120,7 @@ We observed that there are vital-relevant documents that we miss from raw only, Avoiding duplicates, we randomly selected 35 documents, one for each entity. The documents are 13 news and 22 social. Here below we have classified the situation under which a document can be vital for an entity without mentioning the entities with the different entity profiles we used for filtering. \paragraph*{Outgoing link mentions} A post (tweet) with an outgoing link which mentions the entity. -\paragraph*{Event place - Event} A document that talks about an event is vital to the location entity where it takes place. For example Maha Music Festival takes place in Lewis and Clark\_Landing, and a document talking about the festival is vital for the park. There are also cases where an event's address places the event in a park and due to that the document becomes vital to the park. This is basically being mentioned by address which belongs to alarger space. +\paragraph*{Event place - Event} A document that talks about an event is vital to the location entity where it takes place. For example Maha Music Festival takes place in Lewis and Clark\_Landing, and a document talking about the festival is vital for the park. There are also cases where an event's address places the event in a park and due to that the document becomes vital to the park. This is basically being mentioned by address which belongs to a larger space. \paragraph*{Entity -related entity} A document about an important figure such as artist, athlete can be vital to another. This is specially true if the two are contending for the same title, one has snatched a title, or award from the other. \paragraph*{Organization - main activity} A document that talks about about an area on which the company is active is vital for the organization. For example, Atacocha is a mining company and a news item on mining waste was annotated vital. \paragraph*{Entity - group} If an entity belongs to a certain group (class), a news item about the group can be vital for the individual members. FrankandOak is named innovative company and a news item that talks about the group of innovative companies is relevant for a it. Other examples are: a big event of which an entity is related such an Film awards for actors.