HCDA/cikm-paper Changeset - e9656e726fbf · Centrum Wiskunde & Informatica (CWI)

@@ -27,13 +27,13 @@

% For tracking purposes - this is V3.1SP - APRIL 2009

\documentclass{acm_proc_article-sp}

\usepackage{booktabs}

\usepackage{multirow}

\usepackage{todonotes}

\usepackage{url}

\usepackage{url}

\begin{document}

\title{Entity-Centric Stream Filtering and ranking: Filtering and Unfilterable Documents

%SUGGESTION:

@@ -62,13 +62,13 @@

% the seventh etc. author(s) as the argument for the

% \additionalauthors command.

% These 'additional authors' will be output/set for you

% without further effort on your part as the last section in

% the body of your article BEFORE References or any Appendices.

\numberofauthors{2} %  in this sample file, there are a *total*

\numberofauthors{8} %  in this sample file, there are a *total*

% of EIGHT authors. SIX appear on the 'first-page' (for formatting

% reasons) and the remaining two appear in the \additionalauthors section.

% \author{

% % You can go ahead and credit any number of authors here,

% % e.g. one 'row of three' or two rows (consisting of one row of three

@@ -128,13 +128,13 @@ types of entities of interest, document type, and the grade of

relevance of the document-entity pair under consideration.

We analyze how these factors (and the design choices made in their

corresponding system components) affect filtering performance.

We identify and characterize the relevant documents that do not pass

the filtering stage by examing their contents. This way, we

estimate a practical upper-bound of recall for entity-centric stream

filtering.

filtering.

\end{abstract}

% A category with the (minimum) three required fields

\category{H.4}{Information Filtering}{Miscellaneous}

%A category including the fourth, optional field follows...

@@ -212,37 +212,37 @@ of recall on entity-based filtering.

The rest of the paper is is organized as follows:

\textbf{TODO!!}

 \section{Data Description}

We base this analysis on the TREC-KBA 2013 dataset%

\footnote{\url{http://trec-kba.org/trec-kba-2013.shtml}}

that consists of three main parts: a time-stamped stream corpus, a set of

KB entities to be curated, and a set of relevance judgments. A CCR

system now has to identify for each KB entity which documents in the

stream corpus are to be considered by the human curator.

\subsection{Stream corpus} The stream corpus comes in two versions:

raw and cleaned. The raw and cleansed versions are 6.45TB and 4.5TB

respectively,  after xz-compression and GPG encryption. The raw data

is a  dump of  raw HTML pages. The cleansed version is the raw data

after its HTML tags are stripped off and only English documents

identified with Chromium Compact Language Detector

\footnote{\url{https://code.google.com/p/chromium-compact-language-detector/}}

are included.  The stream corpus is organized in hourly folders each

of which contains many  chunk files. Each chunk file contains between

hundreds and hundreds of thousands of serialized  thrift objects. One

thrift object is one document. A document could be a blog article, a

news article, or a social media post (including tweet).  The stream

corpus comes from three sources: TREC KBA 2012 (social, news and

linking) \footnote{\url{http://trec-kba.org/kba-stream-corpus-2012.shtml}},

arxiv\footnote{\url{http://arxiv.org/}}, and

spinn3r\footnote{\url{http://spinn3r.com/}}.

Table \ref{tab:streams} shows the sources, the number of hourly

directories, and the number of chunk files.

We base this analysis on the TREC-KBA 2013 dataset%

\footnote{\url{http://trec-kba.org/trec-kba-2013.shtml}}

that consists of three main parts: a time-stamped stream corpus, a set of

KB entities to be curated, and a set of relevance judgments. A CCR

system now has to identify for each KB entity which documents in the

stream corpus are to be considered by the human curator.

\subsection{Stream corpus} The stream corpus comes in two versions:

raw and cleaned. The raw and cleansed versions are 6.45TB and 4.5TB

respectively,  after xz-compression and GPG encryption. The raw data

is a  dump of  raw HTML pages. The cleansed version is the raw data

after its HTML tags are stripped off and only English documents

identified with Chromium Compact Language Detector

\footnote{\url{https://code.google.com/p/chromium-compact-language-detector/}}

are included.  The stream corpus is organized in hourly folders each

of which contains many  chunk files. Each chunk file contains between

hundreds and hundreds of thousands of serialized  thrift objects. One

thrift object is one document. A document could be a blog article, a

news article, or a social media post (including tweet).  The stream

corpus comes from three sources: TREC KBA 2012 (social, news and

linking) \footnote{\url{http://trec-kba.org/kba-stream-corpus-2012.shtml}},

arxiv\footnote{\url{http://arxiv.org/}}, and

spinn3r\footnote{\url{http://spinn3r.com/}}.

Table \ref{tab:streams} shows the sources, the number of hourly

directories, and the number of chunk files.

\begin{table}

\caption{Retrieved documents to different sources }

\begin{center}

 \begin{tabular}{l*{4}{l}l}

 documents     &   chunk files    &    Sub-stream \\

@@ -385,22 +385,22 @@ relevance assessments associated to them. For this purpose, we

extracted those documents from the big corpus. We experiment with all

KB entities. For each KB entity, we extract different name variants

from DBpedia and Twitter.

\subsection{Entity Profiling}

We build entity profiles for the KB entities of interest. We have two

types: Twitter and Wikipedia. Both entities have been selected, on

purpose by the track organisers, to occur only sparsely and be less-documented.

For the Wikipedia entities, we fetch different name variants

from DBpedia: name, label, birth name, alternative names,

redirects, nickname, or alias.

These extraction results are summarized in Table

\ref{tab:sources}.

For the Twitter entities, we visit

their respective Twitter pages and fetch their display names.

We build entity profiles for the KB entities of interest. We have two

types: Twitter and Wikipedia. Both entities have been selected, on

purpose by the track organisers, to occur only sparsely and be less-documented.

For the Wikipedia entities, we fetch different name variants

from DBpedia: name, label, birth name, alternative names,

redirects, nickname, or alias.

These extraction results are summarized in Table

\ref{tab:sources}.

For the Twitter entities, we visit

their respective Twitter pages and fetch their display names.

\begin{table}

\caption{Number of different DBpedia name variants}

\begin{center}

 \begin{tabular}{l*{4}{c}l}

 Name variant& No. of strings  \\

@@ -417,35 +417,35 @@ Redirect  &49 \\

\end{tabular}

\end{center}

\label{tab:sources}

\end{table}

The collection contains a total number of 121 Wikipedia entities.

Every entity has a corresponding DBpedia label.  Only 82 entities have

a name string and only 49 entities have redirect strings. (Most of the

entities have only one string, except for a few cases with multiple

redirect strings; Buddy\_MacKay, has the highest (12) number of

redirect strings.)

We combine the different name variants we extracted to form a set of

strings for each KB entity. For Twitter entities, we used the display

names that we collected. We consider the names of the entities that

are part of the URL as canonical. For example in entity\\

\url{http://en.wikipedia.org/wiki/Benjamin_Bronfman}\\

Benjamin Bronfman is a canonical name of the entity.

An example is given in Table \ref{tab:profile}.

From the combined name variants and

the canonical names, we  created four sets of profiles for each

entity: canonical(cano) canonical partial (cano-part), all name

variants combined (all) and partial names of all name

variants(all-part). We refer to the last two profiles as name-variant

and name-variant partial. The names in parentheses are used in table

captions.

The collection contains a total number of 121 Wikipedia entities.

Every entity has a corresponding DBpedia label.  Only 82 entities have

a name string and only 49 entities have redirect strings. (Most of the

entities have only one string, except for a few cases with multiple

redirect strings; Buddy\_MacKay, has the highest (12) number of

redirect strings.)

We combine the different name variants we extracted to form a set of

strings for each KB entity. For Twitter entities, we used the display

names that we collected. We consider the names of the entities that

are part of the URL as canonical. For example in entity\\

\url{http://en.wikipedia.org/wiki/Benjamin_Bronfman}\\

Benjamin Bronfman is a canonical name of the entity.

An example is given in Table \ref{tab:profile}.

From the combined name variants and

the canonical names, we  created four sets of profiles for each

entity: canonical(cano) canonical partial (cano-part), all name

variants combined (all) and partial names of all name

variants(all-part). We refer to the last two profiles as name-variant

and name-variant partial. The names in parentheses are used in table

captions.

\begin{table*}

\caption{Example entity profiles (upper part Wikipedia, lower part Twitter)}

\begin{center}

\begin{tabular}{l*{3}{c}}

 &Wikipedia&Twitter \\

@@ -497,17 +497,17 @@ The annotation set is a combination of the annotations from before the Training

%Most (more than 80\%) of the annotation documents are in the test set.

The 2013 training and test data contain 68405

annotations, of which 50688 are unique document-entity pairs.   Out of

these, 24162 unique document-entity pairs are vital (9521) or relevant

(17424).

%Most (more than 80\%) of the annotation documents are in the test set.

The 2013 training and test data contain 68405

annotations, of which 50688 are unique document-entity pairs.   Out of

these, 24162 unique document-entity pairs are vital (9521) or relevant

(17424).

\section{Experiments and Results}

 We conducted experiments to study  the effect of cleansing, different entity profiles, types of entities, category of documents, relevance ranks (vital or relevant), and the impact on classification.  In the following subsections, we present the results in different categories, and describe them.

@@ -614,41 +614,41 @@ If we look at the recall performances for the raw corpus,   filtering documents

 \subsection{ Relevance Rating: vital and relevant}

When comparing recall for vital and relevant, we observe that

canonical names are more effective for vital than for relevant

entities, in particular for the Wikipedia entities.

%For example, the recall for news is 80.1 and for social is 76, while the corresponding recall in relevant is 75.6 and 63.2 respectively.

We conclude that the most relevant documents mention the

entities by their common name variants.

When comparing recall for vital and relevant, we observe that

canonical names are more effective for vital than for relevant

entities, in particular for the Wikipedia entities.

%For example, the recall for news is 80.1 and for social is 76, while the corresponding recall in relevant is 75.6 and 63.2 respectively.

We conclude that the most relevant documents mention the

entities by their common name variants.

%  \subsection{Difference by document categories}

%  Generally, there is greater variation in relevant rank than in vital. This is specially true in most of the Delta's for Wikipedia. This  maybe be explained by news items referring to  vital documents by a some standard name than documents that are relevant. Twitter entities show greater deltas than Wikipedia entities in both vital and relevant. The greater variation can be explained by the fact that the canonical name of Twitter entities retrieves very few documents. The deltas that involve canonical names of Twitter entities, thus, show greater deltas.

% If we look in recall performances, In Wikipedia entities, the order seems to be others, news and social. This means that others achieve a higher recall than news than social.  However, in Twitter entities, it does not show such a strict pattern. In all, entities also, we also see almost the same pattern of other, news and social.

\subsection{Recall across document categories: others, news and social}

The recall for Wikipedia entities in Table \ref{tab:name} ranged from

61.8\% (canonicals) to 77.9\% (name-variants).  Table

\ref{tab:source-delta} shows how recall is distributed across document

categories. For Wikipedia entities, across all entity profiles, others

have a higher recall followed by news, and then by social.  While the

recall for news ranges from 76.4\% to 98.4\%, the recall for social

documents ranges from 65.7\% to 86.8\%. In Twitter entities, however,

the pattern is different. In canonicals (and their partials), social

documents achieve higher recall than news.

The recall for Wikipedia entities in Table \ref{tab:name} ranged from

61.8\% (canonicals) to 77.9\% (name-variants).  Table

\ref{tab:source-delta} shows how recall is distributed across document

categories. For Wikipedia entities, across all entity profiles, others

have a higher recall followed by news, and then by social.  While the

recall for news ranges from 76.4\% to 98.4\%, the recall for social

documents ranges from 65.7\% to 86.8\%. In Twitter entities, however,

the pattern is different. In canonicals (and their partials), social

documents achieve higher recall than news.

%This indicates that social documents refer to Twitter entities by their canonical names (user names) more than news do. In name- variant partial, news achieve better results than social. The difference in recall between canonicals and name-variants show that news do not refer to Twitter entities by their user names, they refer to them by their display names.

Overall, across all entities types and all entity profiles, documents

Overall, across all entities types and all entity profiles, documents

in the others category achieve a higher recall than news, and news documents, in turn, achieve higher recall than social documents.

% This suggests that social documents are the hardest  to retrieve.  This  makes sense since social posts such as tweets and blogs are short and are more likely to point to other resources, or use short informal names.

%%NOTE TABLE REMOVED:\\\\

@@ -669,65 +669,65 @@ in the others category achieve a higher recall than news, and news documents, in

% of all name variants and partials of canonicals (93\%). delta. Both

% of them are for news category.  For Wikipedia entities, the highest

% delta observed is 19.5\% in cano\_part - cano followed by 17.5\% in

% all\_part in relevant.

  \subsection{Entity Types: Wikipedia and Twitter}

Table \ref{tab:name} summarizes the differences between Wikipedia and

Twitter entities.  Wikipedia entities' canonical representation

achieves a recall of 70\%, while canonical partial achieves a recall of 86.1\%. This is an

increase in recall of 16.1\%. By contrast, the increase in recall of

name-variant partial over name-variant is 8.3\%.

%This high increase in recall when moving from canonical names to their

%partial names, in comparison to the lower increase when moving from

%all name variants to their partial names can be explained by

%saturation: documents have already been extracted by the different

%name variants and thus using their partial names do not bring in many

%new relevant documents.

For Wikipedia entities, canonical

partial achieves better recall than name-variant in both the cleansed and

the raw corpus.  %In the raw extraction, the difference is about 3.7.

In Twitter entities, recall of canonical matching is very low.%

\footnote{Canonical

and canonical partial are the same for Twitter entities because they

are one word strings. For example in https://twitter.com/roryscovel,

``roryscovel`` is the canonical name and its partial is identical.}

%The low recall is because the canonical names of Twitter entities are

%not really names; they are usually arbitrarily created user names. It

%shows that  documents  refer to them by their display names, rarely

%by their user name, which is reflected in the name-variant recall

%(67.9\%). The use of name-variant partial increases the recall to

%88.2\%.

The tables in \ref{tab:name} and \ref{tab:source-delta} show a higher recall

for Wikipedia than for Twitter entities. Generally, at both

aggregate and document category levels, we observe that recall

increases as we move from canonicals to canonical partial, to

name-variant, and to name-variant partial. The only case where this

does not hold is in the transition from Wikipedia's canonical partial

to name-variant. At the aggregate level (as can be inferred from Table

\ref{tab:name}), the difference in performance between  canonical  and

name-variant partial is 31.9\% on all entities, 20.7\% on Wikipedia

entities, and 79.5\% on Twitter entities.

Section \ref{sec:analysis} discusses the most plausible explanations for these findings.

Table \ref{tab:name} summarizes the differences between Wikipedia and

Twitter entities.  Wikipedia entities' canonical representation

achieves a recall of 70\%, while canonical partial achieves a recall of 86.1\%. This is an

increase in recall of 16.1\%. By contrast, the increase in recall of

name-variant partial over name-variant is 8.3\%.

%This high increase in recall when moving from canonical names to their

%partial names, in comparison to the lower increase when moving from

%all name variants to their partial names can be explained by

%saturation: documents have already been extracted by the different

%name variants and thus using their partial names do not bring in many

%new relevant documents.

For Wikipedia entities, canonical

partial achieves better recall than name-variant in both the cleansed and

the raw corpus.  %In the raw extraction, the difference is about 3.7.

In Twitter entities, recall of canonical matching is very low.%

\footnote{Canonical

and canonical partial are the same for Twitter entities because they

are one word strings. For example in https://twitter.com/roryscovel,

``roryscovel`` is the canonical name and its partial is identical.}

%The low recall is because the canonical names of Twitter entities are

%not really names; they are usually arbitrarily created user names. It

%shows that  documents  refer to them by their display names, rarely

%by their user name, which is reflected in the name-variant recall

%(67.9\%). The use of name-variant partial increases the recall to

%88.2\%.

The tables in \ref{tab:name} and \ref{tab:source-delta} show a higher recall

for Wikipedia than for Twitter entities. Generally, at both

aggregate and document category levels, we observe that recall

increases as we move from canonicals to canonical partial, to

name-variant, and to name-variant partial. The only case where this

does not hold is in the transition from Wikipedia's canonical partial

to name-variant. At the aggregate level (as can be inferred from Table

\ref{tab:name}), the difference in performance between  canonical  and

name-variant partial is 31.9\% on all entities, 20.7\% on Wikipedia

entities, and 79.5\% on Twitter entities.

Section \ref{sec:analysis} discusses the most plausible explanations for these findings.

%% TODO: PERHAPS SUMMARY OF DISCUSSION HERE

\section{Impact on classification}

In the overall experimental setup, classification, ranking, and

evaluation are kept constant. Following \cite{balog2013multi}

settings, we use

WEKA's\footnote{\url{http://www.cs.waikato.ac.nz/~ml/weka/}} Classification

Random Forest. However, we use fewer numbers of features which we

found to be more effective. We determined the effectiveness of the

features by running the classification algorithm using the fewer

features we implemented and their features. Our feature

implementations achieved better results.  The total numbers of

features we used are 13 and are listed below.

In the overall experimental setup, classification, ranking, and

evaluation are kept constant. Following \cite{balog2013multi}

settings, we use

WEKA's\footnote{\url{http://www.cs.waikato.ac.nz/~ml/weka/}} Classification

Random Forest. However, we use fewer numbers of features which we

found to be more effective. We determined the effectiveness of the

features by running the classification algorithm using the fewer

features we implemented and their features. Our feature

implementations achieved better results.  The total numbers of

features we used are 13 and are listed below.

\paragraph*{Google's Cross Lingual Dictionary (GCLD)}

This is a mapping of strings to Wikipedia concepts and vice versa

\cite{spitkovsky2012cross}.

(1) the probability with which a string is used as anchor text to

@@ -958,13 +958,15 @@ There is a trade-off between using a richer entity-profile and retrieval of irre

%%%%%%%%%%%%

In vital ranking, across all entity profiles and types of corpus, Wikipedia's canonical partial  achieves better performance than any other Wikipedia entity profiles. In vital-relevant documents too, Wikipedia's canonical partial achieves the best result. In the raw corpus, it achieves a little less than name-variant partial. For Twitter entities, the name-variant partial profile achieves the highest F-score across all entity profiles and types of corpus.

Cleansing impacts Twitter

There are 3 interesting observations:

1) cleansing impacts Twitter

entities and relevant documents.  This  is validated by the

observation that recall  gains in Twitter entities and the relevant

categories in the raw corpus also translate into overall performance

gains. This observation implies that cleansing removes relevant and

social documents than it does vital and news. That it removes relevant

documents more than vital can be explained by the fact that cleansing

@@ -976,13 +978,13 @@ most of the missing of the missing  docuemnts from cleansed are

social. And all the docuemnts that are missing from raw corpus

social. So in both cases socuial seem to suffer from text

transformation and cleasing processes.

%%%% NEEDS WORK:

Taking both performance (recall at filtering and overall F-score

2) Taking both performance (recall at filtering and overall F-score

during evaluation) into account, there is a clear trade-off between using a richer entity-profile and retrieval of irrelevant documents. The richer the profile, the more relevant documents it retrieves, but also the more irrelevant documents. To put it into perspective, lets compare the number of documents that are retrieved with  canonical partial and with name-variant partial. Using the raw corpus, the former retrieves a total of 2547487 documents and achieves a recall of 72.2\%. By contrast, the later retrieves a total of 4735318 documents and achieves a recall of 90.2\%. The total number of documents extracted increases by 85.9\% for a recall gain of 18\%. The rest of the documents, that is 67.9\%, are newly introduced irrelevant documents.

Wikipedia's canonical partial is the best entity profile for Wikipedia entities. This is interesting  to see that the retrieval of of  thousands vital-relevant document-entity pairs by name-variant partial does not translate to an increase in over all performance. It is even more interesting since canonical partial was not considered as contending profile for stream filtering by any of participant to the best of our knowledge. With this understanding, there  is actually no need to go and fetch different names variants from DBpedia, a saving of time and computational resources.

%%%%%%%%%%%%