HCDA/cikm-paper Changeset - 3034dd468026 · Centrum Wiskunde & Informatica (CWI)

@@ -120,36 +120,26 @@ web-scale corpus of news and other relevant information sources that

may contain entity mentions into a working set of documents that should

be more manageable for the subsequent stages.

Nevertheless, this step has a large impact on the recall that can be

maximally attained! Therefore, in this study, we have focused on just

this filtering stage and conduct an in-depth analysis of the main design

decisions here: how to cleans the noisy text obtained online,

the methods to create entity profiles, the

types of entities of interest, document type, and the grade of

relevance of the document-entity pair under consideration.

We analyze how these factors (and the design choices made in their

corresponding system components) affect filtering performance.

We identify and characterize the relevant documents that do not pass

<<<<<<< HEAD

<<<<<<< HEAD

the filtering stage by examining their contents. This way, we give

estimate of a practical upper-bound of recall for entity-centric stream

=======

the filtering stage by examing their contents. This way, we

estimate a practical upper-bound of recall for entity-centric stream

>>>>>>> 68fbea2f0372ab9b4199b88f980dbf5e97f49063

=======

the filtering stage by examing their contents. This way, we

estimate a practical upper-bound of recall for entity-centric stream

>>>>>>> 3eb20e9cca3d074a4001a593e626a9269cb5608c

filtering.

\end{abstract}

% A category with the (minimum) three required fields

\category{H.4}{Information Filtering}{Miscellaneous}

%A category including the fourth, optional field follows...

%\category{D.2.8}{Software Engineering}{Metrics}[complexity measures, performance measures]

\terms{Theory}

\keywords{Information Filtering; Cumulative Citation Recommendation; knowledge maintenance; Stream Filtering;  emerging entities} % NOT required for Proceedings

@@ -216,51 +206,24 @@ document category (social, news, etc) on the filtering components'

performance. The main contribution of the

paper are an in-depth analysis of the factors that affect entity-based

stream filtering, identifying optimal entity profiles without

compromising precision, describing and classifying relevant documents

that are not amenable to filtering , and estimating the upper-bound

of recall on entity-based filtering.

The rest of the paper is is organized as follows:

\textbf{TODO!!}

 \section{Data Description}

<<<<<<< HEAD

We base this analysis on the TREC-KBA 2013 dataset%

\footnote{http://http://trec-kba.org/trec-kba-2013.shtml}

that consists of three main parts: a time-stamped stream corpus, a set of

KB entities to be curated, and a set of relevance judgments. A CCR

system now has to identify for each KB entity which documents in the

stream corpus are to be considered by the human curator.

\subsection{Stream corpus} The stream corpus comes in two versions:

raw and cleaned. The raw and cleansed versions are 6.45TB and 4.5TB

respectively,  after xz-compression and GPG encryption. The raw data

is a  dump of  raw HTML pages. The cleansed version is the raw data

after its HTML tags are stripped off and only English documents

identified with Chromium Compact Language Detector

\footnote{https://code.google.com/p/chromium-compact-language-detector/}

are included.  The stream corpus is organized in hourly folders each

of which contains many  chunk files. Each chunk file contains between

hundreds and hundreds of thousands of serialized  thrift objects. One

thrift object is one document. A document could be a blog article, a

news article, or a social media post (including tweet).  The stream

corpus comes from three sources: TREC KBA 2012 (social, news and

linking) \footnote{http://trec-kba.org/kba-stream-corpus-2012.shtml},

arxiv\footnote{http://arxiv.org/}, and

spinn3r\footnote{http://spinn3r.com/}.

Table \ref{tab:streams} shows the sources, the number of hourly

directories, and the number of chunk files.

=======

We base this analysis on the TREC-KBA 2013 dataset%

\footnote{\url{http://trec-kba.org/trec-kba-2013.shtml}}

that consists of three main parts: a time-stamped stream corpus, a set of

KB entities to be curated, and a set of relevance judgments. A CCR

system now has to identify for each KB entity which documents in the

stream corpus are to be considered by the human curator.

\subsection{Stream corpus} The stream corpus comes in two versions:

raw and cleaned. The raw and cleansed versions are 6.45TB and 4.5TB

respectively,  after xz-compression and GPG encryption. The raw data

is a  dump of  raw HTML pages. The cleansed version is the raw data

after its HTML tags are stripped off and only English documents

@@ -268,25 +231,24 @@ identified with Chromium Compact Language Detector

\footnote{\url{https://code.google.com/p/chromium-compact-language-detector/}}

are included.  The stream corpus is organized in hourly folders each

of which contains many  chunk files. Each chunk file contains between

hundreds and hundreds of thousands of serialized  thrift objects. One

thrift object is one document. A document could be a blog article, a

news article, or a social media post (including tweet).  The stream

corpus comes from three sources: TREC KBA 2012 (social, news and

linking) \footnote{\url{http://trec-kba.org/kba-stream-corpus-2012.shtml}},

arxiv\footnote{\url{http://arxiv.org/}}, and

spinn3r\footnote{\url{http://spinn3r.com/}}.

Table \ref{tab:streams} shows the sources, the number of hourly

directories, and the number of chunk files.

>>>>>>> 3eb20e9cca3d074a4001a593e626a9269cb5608c

\begin{table}

\caption{Retrieved documents to different sources }

\begin{center}

 \begin{tabular}{l*{4}{l}l}

 documents     &   chunk files    &    Sub-stream \\

\hline

126,952         &11,851         &arxiv \\

394,381,405      &   688,974        & social \\

134,933,117       &  280,658       &  news \\

5,448,875         &12,946         &linking \\

@@ -449,76 +411,67 @@ Redirect  &49 \\

 Birth Name &6\\

 Nickname & 1&\\

 Alias &1 \\

 Alternative Names &4\\

\hline

\end{tabular}

\end{center}

\label{tab:sources}

\end{table}

<<<<<<< HEAD

We have a total of 121 Wikipedia entities.  Every entity has a DBpedia label.  Only 82 entities have a name string and only 49 entities have redirect strings. Most of the entities have only one string, but some have several redirect sterings. One entity, Buddy\_MacKay, has the highest (12) number of redirect strings. 6 entities have  birth names, 1 entity has a nick name, 1 entity has alias and  4 entities have alternative names.

The collection contains a total number of 121 Wikipedia entities.

Every entity has a corresponding DBpedia label.  Only 82 entities have

a name string and only 49 entities have redirect strings. (Most of the

entities have only one string, except for a few cases with multiple

redirect strings; Buddy\_MacKay, has the highest (12) number of

redirect strings.)

We combine the different name variants we extracted to form a set of

strings for each KB entity. For Twitter entities, we used the display

names that we collected. We consider the names of the entities that

are part of the URL as canonical. For example in entity\\

\url{http://en.wikipedia.org/wiki/Benjamin_Bronfman}\\

Benjamin Bronfman is a canonical name of the entity.

An example is given in Table \ref{tab:profile}.

From the combined name variants and

the canonical names, we  created four sets of profiles for each

entity: canonical(cano) canonical partial (cano-part), all name

variants combined (all) and partial names of all name

variants(all-part). We refer to the last two profiles as name-variant

and name-variant partial. The names in parentheses are used in table

captions.

We combined the different name variants  we extracted to form a set of strings for each KB entity.  For Twitter entities, we used the display names that we collected . We consider the names of the entities that are part of the URL as canonical. For example in http://en.wikipedia.org/wiki/Benjamin\_Bronfman, Benjamin Bronfman is a canonical name of the entity.  From the combined name variants and the canonical names, we  created four sets of profiles for each entity: canonical(cano) canonical partial (cano-part), all name variants combined (all) and partial names of all name variants(all-part). We refer to the last two profiles as name-variant and name-variant partial. The names in paranthesis are used in table captions.

\begin{table*}

\caption{Example entity profiles (upper part Wikipedia, lower part Twitter)}

\begin{center}

\begin{tabular}{l*{3}{c}}

 &Wikipedia&Twitter \\

\hline

 &Benjamin\_Bronfman& roryscovel\\

  cano&[Benjamin Bronfman] &[roryscovel]\\

  cano-part &[Benjamin, Bronfman]&[roryscovel]\\

  all&[Ben Brewer, Benjamin Zachary Bronfman] &[Rory Scovel] \\

  all-part& [Ben, Brewer, Benjamin, Zachary, Bronfman]&[Rory, Scovel]\\

   \hline

\end{tabular}

\end{center}

\label{tab:breakdown}

\label{tab:profile}

\end{table*}

=======

The collection contains a total number of 121 Wikipedia entities.

Every entity has a corresponding DBpedia label.  Only 82 entities have

a name string and only 49 entities have redirect strings. (Most of the

entities have only one string, except for a few cases with multiple

redirect strings; Buddy\_MacKay, has the highest (12) number of

redirect strings.)

We combine the different name variants we extracted to form a set of

strings for each KB entity. For Twitter entities, we used the display

names that we collected.

We consider the names of the entities that

are part of the URL as canonical. For example in entity\\

\url{http://en.wikipedia.org/wiki/Benjamin_Bronfman}\\

Benjamin Bronfman is a canonical name of the entity. From the combined name variants and

the canonical names, we  created four sets of profiles for each

entity: canonical(cano) canonical partial (cano-part), all name

variants combined (all) and partial names of all name

variants(all-part). We refer to the last two profiles as name-variant

and name-variant partial. The names in parentheses are used in table

captions.

>>>>>>> 3eb20e9cca3d074a4001a593e626a9269cb5608c

\subsection{Annotation Corpus}

The annotation set is a combination of the annotations from before the Training Time Range(TTR) and Evaluation Time Range (ETR) and consists of 68405 annotations.  Its breakdown into training and test sets is  shown in Table \ref{tab:breakdown}.

\begin{table}

\caption{Number of annotation documents with respect to different categories(relevance rating, training and testing)}

\begin{center}

\begin{tabular}{l*{3}{c}r}

 &&Vital&Relevant  &Total \\

\hline