HCDA/cikm-paper Changeset - c8b8892c5424 · Centrum Wiskunde & Informatica (CWI)

@@ -210,11 +210,10 @@ compromising precision, describing and classifying relevant documents

that are not amenable to filtering , and estimating the upper-bound

of recall on entity-based filtering.

The rest of the paper is is organized as follows:

The rest of the paper  is organized as follows. Section \ref{sec:desc} describes the dataset and section \ref{sec:fil} defines the task. In section  \ref{sec:lit}, we discuss related litrature folowed by a discussion of our method in \ref{mthd}. Following that,  we present the experimental resulsy in \ref{sec:expr}, and discuss and analyze them in \ref{sec:analysis}. Towards the end, we discuss the impact of filtering choices on classification in section \ref{sec:impact}, examine and categorize unfilterable docuemnts in section \ref{sec:unfil}. Finally, we present our conclusions in \section{sec:conc}.

\textbf{TODO!!}

 \section{Data Description}

 \section{Data Description}\label{sec:desc}

We base this analysis on the TREC-KBA 2013 dataset%

\footnote{\url{http://trec-kba.org/trec-kba-2013.shtml}}

that consists of three main parts: a time-stamped stream corpus, a set of

@@ -307,7 +306,7 @@ relevant annotations are social (social and weblog) (63.13\%). News

93\% of all annotations.  The rest make up about 7\% and are all

grouped as others.

 \section{Stream Filtering}

 \section{Stream Filtering}\label{sec:fil}

 The TREC Filtering track defines filtering as a ``system that sifts

 through stream of incoming information to find documents that are

@@ -368,7 +367,7 @@ pipeline performance we use the official TREC KBA evaluation metric

and scripts \cite{frank2013stream} to report max-F, the maximum

F-score obtained over all relevance cut-offs.

\section{Literature Review}

\section{Literature Review} \label{sec:rev}

There has been a great deal of interest  as of late on entity-based filtering and ranking. One manifestation of that is the introduction of TREC KBA in 2012. Following that, there have been a number of research works done on the topic \cite{frank2012building, ceccarelli2013learning, taneva2013gem, wang2013bit, balog2013multi}.  These works are based on KBA 2012 task and dataset  and they address the whole problem of entity filtering and ranking.  TREC KBA continued in 2013, but the task underwent some changes. The main change between  the 2012 and 2013 are in the number of entities, the type of entities, the corpus and the relevance rankings.

The number of entities increased from 29 to 141, and it included 20 Twitter entities. The TREC KBA 2012 corpus is 1.9TB after xz-compression and has  400M documents. By contrast, the KBA 2013 corpus is 6.45 after XZ-compression and GPG encryption. A version with all-non English documented removed  is 4.5 TB and consists of 1 Billion documents. The 2013 corpus subsumed the 2012 corpus and added others from spinn3r, namely main-stream news, forum, arxiv, classified, reviews and meme-tracker.  A more important difference is, however, a change in the definitions of relevance ratings vital and relevant. While in KBA 2012, a document was judged vital if it has citation-worthy content for a given entity, in 2013 it must have the freshliness, that is the content must trigger an editing of the given entity's KB entry.

@@ -379,7 +378,7 @@ All of the studies used filtering as their first step to generate a smaller set

Moreover, there has not been a chance to study at this scale and/or a study into what type of documents defy filtering and why? In this paper, we conduct a manual examination of the documents that are missing and classify them into different categories. We also estimate the general upper bound of recall using the different entities profiles and choose the best profile that results in an increased over all performance as measured by F-measure.

\section{Method}

\section{Method}\label{sec:mth}

All analyses in this paper are carried out on the documents that have

relevance assessments associated to them. For this purpose, we

extracted those documents from the big corpus. We experiment with all

@@ -508,7 +507,7 @@ these, 24162 unique document-entity pairs are vital (9521) or relevant

\section{Experiments and Results}

\section{Experiments and Results}\label{sec:expr}

 We conducted experiments to study  the effect of cleansing, different entity profiles, types of entities, category of documents, relevance ranks (vital or relevant), and the impact on classification.  In the following subsections, we present the results in different categories, and describe them.

 \subsection{Cleansing: raw or cleansed}

@@ -714,7 +713,7 @@ entities, and 79.5\% on Twitter entities.

Section \ref{sec:analysis} discusses the most plausible explanations for these findings.

%% TODO: PERHAPS SUMMARY OF DISCUSSION HERE

\section{Impact on classification}

\section{Impact on classification}\label{sec:impact}

In the overall experimental setup, classification, ranking, and

evaluation are kept constant. Following \cite{balog2013multi}

settings, we use

@@ -1004,7 +1003,7 @@ Across document categories, we observe a pattern in recall of others, followed b

and name-variants bring in new relevant documents that can not be retrieved by canonicals. The rest of the two deltas are very small,  suggesting that partial names of name variants do not bring in new relevant documents.

\section{Unfilterable documents}

\section{Unfilterable documents}\label{sec:unfil}

\subsection{Missing vital-relevant documents \label{miss}}

@@ -1062,11 +1061,11 @@ We observed that there are vital-relevant documents that we miss from raw only,

\paragraph*{head - organization} A document that talks about an organization of which the entity is the head can be vital for the entity.  Jasper\_Schneider is USDA Rural Development state director for North Dakota and an article about problems of primary health centers in North Dakota is judged vital for him.

\paragraph*{World Knowledge} Some things are impossible to know without your world knowledge. For example ''refreshments, treats, gift shop specials, "bountiful, fresh and fabulous holiday decor," a demonstration of simple ways to create unique holiday arrangements for any home; free and open to the public`` is judged relevant to Hjemkomst\_Center. This is a social media post, and unless one knows the person posting it, there is no way that this text shows that. Similarly ''learn about the gray wolf's hunting and feeding behaviors and watch the wolves have their evening meal of a full deer carcass; $15 for members, $20 for nonmembers`` is judged vital to Red\_River\_Zoo.

\paragraph*{No document content} A small number of documents were found to have no content.

\paragraph*{Disagreement} For a few remaining documents, the authors disagree with the assessors as to why these are vital to the entity.

\paragraph*{Disagreement} For a few remaining documents, the authors disagree with the assessors as to why these are vital to the entity.

\section{Conclusions}

\section{Conclusions} \label{sec:conc}

In this paper, we examined the filtering stage of the entity-centric stream filtering and ranking  by holding the later stages of fixed. In particular, we studied the cleansing step, different entity profiles, type of entities(Wikipedia or Twitter), categories of documents(news, social, or others) and the relevance ratings. We attempted to address the following research questions: 1) does cleansing affect filtering and subsequent performance? 2) what is the most effective way of entity profiling? 3) is filtering different for Wikipedia and Twitter entities? 4) are some type of documents easily filterable and others not? 5) does a gain in recall at filtering step translate to a gain in F-measure at the end of the pipeline? and 6) what are the circumstances under which vital documents can not be retrieved?

Cleansing does remove parts or entire contents of documents making them irretrievable. However, because of the introduction of false positives, recall gains by  raw corpus and some  richer entity profiles do not necessarily translate to overall performance gain. The results conclusion on this is mixed in the sense that cleansing helps improve the recall on vital documents and Wikipedia entities, but reduces the recall on Twitter entities and the relative category of relevance ranking. Vital and relevant documents show a difference in retrieval nonperformance documents are easier to filter than relevant.