HCDA/cikm-paper Changeset - 7babd96c03f9 · Centrum Wiskunde & Informatica (CWI)

@@ -196,29 +196,29 @@ sub-components of the pipeline.

In this paper, we therefore fix the subsequent steps of the pipeline,

and zoom in on \emph{only} the filtering step; and conduct an in-depth analysis of its

main components.  In particular, we study the effect of cleansing,

entity profiling, type of entity filtered for (Wikipedia or Twitter), and

document category (social, news, etc) on the filtering components'

performance. The main contribution of the

paper are an in-depth analysis of the factors that affect entity-based

stream filtering, identifying optimal entity profiles without

compromising precision, describing and classifying relevant documents

that are not amenable to filtering , and estimating the upper-bound

of recall on entity-based filtering.

<<<<<<< HEAD

<<<<<<< HEAD

The rest of the paper  is organized as follows. Section \ref{sec:desc} describes the dataset and section \ref{sec:fil} defines the task. In section  \ref{sec:lit}, we discuss related litrature folowed by a discussion of our method in \ref{sec:mthd}. Following that,  we present the experimental resulsy in \ref{sec:expr}, and discuss and analyze them in \ref{sec:analysis}. Towards the end, we discuss the impact of filtering choices on classification in section \ref{sec:impact}, examine and categorize unfilterable docuemnts in section \ref{sec:unfil}. Finally, we present our conclusions in \ref{sec:conc}.

=======

=======

The rest of the paper  is organized as follows. Section \ref{sec:desc} describes the dataset and section \ref{sec:fil} defines the task. In section  \ref{sec:lit}, we discuss related litrature folowed by a discussion of our method in \ref{sec:mthd}. Following that,  we present the experimental resulsy in \ref{sec:expr}, and discuss and analyze them in \ref{sec:analysis}. Towards the end, we discuss the impact of filtering choices on classification in section \ref{sec:impact}, examine and categorize unfilterable documents in section \ref{sec:unfil}. Finally, we present our conclusions in \ref{}{sec:conc}.

>>>>>>> 51b8586f2e1def3777b3e65737b7ab32c2ff0981

>>>>>>> 51b8586f2e1def3777b3e65737b7ab32c2ff0981

 \section{Data Description}\label{sec:desc}

We base this analysis on the TREC-KBA 2013 dataset%

\footnote{\url{http://trec-kba.org/trec-kba-2013.shtml}}

that consists of three main parts: a time-stamped stream corpus, a set of

KB entities to be curated, and a set of relevance judgments. A CCR

system now has to identify for each KB entity which documents in the

stream corpus are to be considered by the human curator.

\subsection{Stream corpus} The stream corpus comes in two versions:

raw and cleaned. The raw and cleansed versions are 6.45TB and 4.5TB

@@ -717,28 +717,30 @@ In the overall experimental setup, classification, ranking, and

evaluation are kept constant. Following \cite{balog2013multi}

settings, we use

WEKA's\footnote{\url{http://www.cs.waikato.ac.nz/~ml/weka/}} Classification

Random Forest. However, we use fewer numbers of features which we

found to be more effective. We determined the effectiveness of the

features by running the classification algorithm using the fewer

features we implemented and their features. Our feature

implementations achieved better results.  The total numbers of

features we used are 13 and are listed below.

\paragraph*{Google's Cross Lingual Dictionary (GCLD)}

This is a mapping of strings to Wikipedia concepts and vice versa

\cite{spitkovsky2012cross}.

The GCLD corpus estimates two probabilities:

(1) the probability with which a string is used as anchor text to

a Wikipedia entity

%thus distributing the probability mass over the different entities that it is used as anchor text;

and (2) the

probability that indicates the strength of co-reference of an anchor with respect to other anchors to  a given Wikipedia entity.  We use the product of both for each string.

\paragraph*{jac}

  Jaccard similarity between the document and the entity's Wikipedia page

\paragraph*{cos}

  Cosine similarity between the document and the entity's Wikipedia page

\paragraph*{kl}

  KL-divergence between the document and the entity's Wikipedia page

  \paragraph*{PPR}

For each entity, we computed a PPR score from

a Wikipedia snapshot  and we kept the top 100  entities along

with the corresponding scores.

@@ -932,116 +934,116 @@ In vital-relevant category (Table \ref{tab:class-vital-relevant}), the performan

% \end{table*}

\section{Analysis and Discussion}\label{sec:analysis}

We conducted experiments to study the impacts on recall of

different components of the filtering stage of entity-based filtering and ranking pipeline. Specifically

we conducted experiments to study the impacts of cleansing,

entity profiles, relevance ratings, categories of documents, entity

profiles. We also measured  impact of the different factors and

entity profiles, relevance ratings, categories of documents, entity

profiles. We also measured  impact of the different factors and

choices  on later stages of the pipeline of our own system.

Experimental results show that cleansing can remove entire or parts of

the content of documents making them difficult to retrieve. These

documents can, otherwise, be retrieved from the raw version. The use

of the raw corpus brings in documents that can not be retrieved from

the cleansed corpus. This is true for all entity profiles and for all

entity types. The  recall difference between the cleansed and raw

ranges from  6.8\% t 26.2\%. These increases, in actual

document-entity pairs,  is in thousands. We believe this is a

substantial increase. However, the recall increases do not always

translate to improved F-score in overall performance.  In the vital

relevance ranking for both Wikipedia and aggregate entities, the

cleansed version performs better than the raw version.  In Twitter

entities, the raw corpus achieves better except in the case of all

name-variant, though the difference is negligible.  However, for

vital-relevant, the raw corpus performs  better across all entity

profiles and entity types except in partial canonical names of

Wikipedia entities.

The use of different profiles also shows a big difference in

recall. While in Wikipedia the use of canonical

partial achieves better than name-variant, there is a steady increase

in recall from canonical to canonical partial, to name-variant, and

to name-variant partial. This pattern is also observed across the

document categories.  However, here too, the relationship between

the gain in recall as we move from less richer profile to a more

richer profile and overall performance as measured by F-score  is not

linear.

Experimental results show that cleansing can remove entire or parts of

the content of documents making them difficult to retrieve. These

documents can, otherwise, be retrieved from the raw version. The use

of the raw corpus brings in documents that can not be retrieved from

the cleansed corpus. This is true for all entity profiles and for all

entity types. The  recall difference between the cleansed and raw

ranges from  6.8\% t 26.2\%. These increases, in actual

document-entity pairs,  is in thousands. We believe this is a

substantial increase. However, the recall increases do not always

translate to improved F-score in overall performance.  In the vital

relevance ranking for both Wikipedia and aggregate entities, the

cleansed version performs better than the raw version.  In Twitter

entities, the raw corpus achieves better except in the case of all

name-variant, though the difference is negligible.  However, for

vital-relevant, the raw corpus performs  better across all entity

profiles and entity types except in partial canonical names of

Wikipedia entities.

The use of different profiles also shows a big difference in

recall. While in Wikipedia the use of canonical

partial achieves better than name-variant, there is a steady increase

in recall from canonical to canonical partial, to name-variant, and

to name-variant partial. This pattern is also observed across the

document categories.  However, here too, the relationship between

the gain in recall as we move from less richer profile to a more

richer profile and overall performance as measured by F-score  is not

linear.

%%%%%%%%%%%%

In vital ranking, across all entity profiles and types of corpus,

Wikipedia's canonical partial  achieves better performance than any

other Wikipedia entity profiles. In vital-relevant documents too,

Wikipedia's canonical partial achieves the best result. In the raw

corpus, it achieves a little less than name-variant partial. For

Twitter entities, the name-variant partial profile achieves the

highest F-score across all entity profiles and types of corpus.

In vital ranking, across all entity profiles and types of corpus,

Wikipedia's canonical partial  achieves better performance than any

other Wikipedia entity profiles. In vital-relevant documents too,

Wikipedia's canonical partial achieves the best result. In the raw

corpus, it achieves a little less than name-variant partial. For

Twitter entities, the name-variant partial profile achieves the

highest F-score across all entity profiles and types of corpus.

There are 3 interesting observations:

1) cleansing impacts Twitter

entities and relevant documents.  This  is validated by the

observation that recall  gains in Twitter entities and the relevant

categories in the raw corpus also translate into overall performance

gains. This observation implies that cleansing removes relevant and

social documents than it does vital and news. That it removes relevant

documents more than vital can be explained by the fact that cleansing

removes the related links and adverts which may contain a mention of

the entities. One example we saw was the the cleansing removed an

image with a text of an entity name which was actually relevant. And

that it removes social documents can be explained by the fact that

most of the missing of the missing  docuemnts from cleansed are

social. And all the docuemnts that are missing from raw corpus

social. So in both cases social seem to suffer from text

transformation and cleasing processes.

%%%% NEEDS WORK:

Taking both performance (recall at filtering and overall F-score

during evaluation) into account, there is a clear trade-off between

using a richer entity-profile and retrieval of irrelevant

documents. The richer the profile, the more relevant documents it

retrieves, but also the more irrelevant documents. To put it into

perspective, lets compare the number of documents that are retrieved

with  canonical partial and with name-variant partial. Using the raw

corpus, the former retrieves a total of 2547487 documents and achieves

a recall of 72.2\%. By contrast, the later retrieves a total of

4735318 documents and achieves a recall of 90.2\%. The total number of

documents extracted increases by 85.9\% for a recall gain of 18\%. The

rest of the documents, that is 67.9\%, are newly introduced irrelevant

documents.

Perhaps surprising, Wikipedia's canonical partial is the best entity profile for Wikipedia

entities. Here, the retrieval of

thousands vital-relevant document-entity pairs by name-variant partial

does not materialize into an increase in over all performance. Notice

that none of the participants in TREC KBA considered canonical partial

as a viable strategy though. We conclude that, at least for our

system, the remainder of the pipeline needs a different approach to

handle the correct scoring of the additional documents -- that are

necessary if we do not want to accept a low recall of the filtering

step.

%With this understanding, there  is actually no

%need to go and fetch different names variants from DBpedia, a saving

%of time and computational resources.

during evaluation) into account, there is a clear trade-off between

using a richer entity-profile and retrieval of irrelevant

documents. The richer the profile, the more relevant documents it

retrieves, but also the more irrelevant documents. To put it into

perspective, lets compare the number of documents that are retrieved

with  canonical partial and with name-variant partial. Using the raw

corpus, the former retrieves a total of 2547487 documents and achieves

a recall of 72.2\%. By contrast, the later retrieves a total of

4735318 documents and achieves a recall of 90.2\%. The total number of

documents extracted increases by 85.9\% for a recall gain of 18\%. The

rest of the documents, that is 67.9\%, are newly introduced irrelevant

documents.

Perhaps surprising, Wikipedia's canonical partial is the best entity profile for Wikipedia

entities. Here, the retrieval of

thousands vital-relevant document-entity pairs by name-variant partial

does not materialize into an increase in over all performance. Notice

that none of the participants in TREC KBA considered canonical partial

as a viable strategy though. We conclude that, at least for our

system, the remainder of the pipeline needs a different approach to

handle the correct scoring of the additional documents -- that are

necessary if we do not want to accept a low recall of the filtering

step.

%With this understanding, there  is actually no

%need to go and fetch different names variants from DBpedia, a saving

%of time and computational resources.

%%%%%%%%%%%%

The deltas between entity profiles, relevance ratings, and document categories reveal four differences between Wikipedia and Twitter entities. 1) For Wikipedia entities, the difference between canonical partial and canonical is higher(16.1\%) than between name-variant partial and  name-variant(8.3\%).  This can be explained by saturation. This is to mean that documents have already been extracted by  name-variants and thus using their partials does not bring in many new relevant documents.  2) Twitter entities are mentioned by name-variant or name-variant partial and that is seen in the high recall achieved  compared to the low recall achieved by canonical(or their partial). This indicates that documents (specially news and others) almost never use user names to refer to Twitter entities. Name-variant partials are the best entity profiles for Twitter entities. 3) However, comparatively speaking, social documents refer to Twitter entities by their user names than news and others suggesting a difference in

adherence to standard in names and naming. 4) Wikipedia entities achieve higher recall and higher overall performance.

The high recall and subsequent higher overall performance of Wikipedia entities can  be due to two reasons. 1) Wikipedia entities are relatively well described than Twitter entities. The fact that we can retrieve different name variants from DBpedia is a measure of relatively rich description. Rich description plays a role in both filtering and computation of features such as similarity measures in later stages of the pipeline.   By contrast, we have only two names for Twitter entities: their user names and their display names which we collect from their Twitter pages. 2) There is not DBpedia-like resource for Twitter entities from which alternative names cane be collected.