Changeset - e9656e726fbf
[Not reviewed]
0 1 0
Gebrekirstos Gebremeskel - 11 years ago 2014-06-12 04:24:01
destinycome@gmail.com
updated
1 file changed with 137 insertions and 135 deletions:
0 comments (0 inline, 0 general)
mypaper-final.tex
Show inline comments
 
@@ -27,13 +27,13 @@
 
% For tracking purposes - this is V3.1SP - APRIL 2009
 
 
\documentclass{acm_proc_article-sp}
 
\usepackage{booktabs}
 
\usepackage{multirow}
 
\usepackage{todonotes}
 
\usepackage{url}
 
\usepackage{url}
 
 
\begin{document}
 
 
\title{Entity-Centric Stream Filtering and ranking: Filtering and Unfilterable Documents 
 
}
 
%SUGGESTION:
 
@@ -62,13 +62,13 @@
 
% the seventh etc. author(s) as the argument for the
 
% \additionalauthors command.
 
% These 'additional authors' will be output/set for you
 
% without further effort on your part as the last section in
 
% the body of your article BEFORE References or any Appendices.
 
 
\numberofauthors{2} %  in this sample file, there are a *total*
 
\numberofauthors{8} %  in this sample file, there are a *total*
 
% of EIGHT authors. SIX appear on the 'first-page' (for formatting
 
% reasons) and the remaining two appear in the \additionalauthors section.
 
%
 
% \author{
 
% % You can go ahead and credit any number of authors here,
 
% % e.g. one 'row of three' or two rows (consisting of one row of three
 
@@ -128,13 +128,13 @@ types of entities of interest, document type, and the grade of
 
relevance of the document-entity pair under consideration.
 
We analyze how these factors (and the design choices made in their
 
corresponding system components) affect filtering performance.
 
We identify and characterize the relevant documents that do not pass
 
the filtering stage by examing their contents. This way, we
 
estimate a practical upper-bound of recall for entity-centric stream
 
filtering.
 
filtering.
 
 
\end{abstract}
 
% A category with the (minimum) three required fields
 
\category{H.4}{Information Filtering}{Miscellaneous}
 
 
%A category including the fourth, optional field follows...
 
@@ -212,37 +212,37 @@ of recall on entity-based filtering.
 
 
The rest of the paper is is organized as follows: 
 
 
\textbf{TODO!!}
 
 
 \section{Data Description}
 
We base this analysis on the TREC-KBA 2013 dataset%
 
\footnote{\url{http://trec-kba.org/trec-kba-2013.shtml}}
 
that consists of three main parts: a time-stamped stream corpus, a set of
 
KB entities to be curated, and a set of relevance judgments. A CCR
 
system now has to identify for each KB entity which documents in the
 
stream corpus are to be considered by the human curator.
 

	
 
\subsection{Stream corpus} The stream corpus comes in two versions:
 
raw and cleaned. The raw and cleansed versions are 6.45TB and 4.5TB
 
respectively,  after xz-compression and GPG encryption. The raw data
 
is a  dump of  raw HTML pages. The cleansed version is the raw data
 
after its HTML tags are stripped off and only English documents
 
identified with Chromium Compact Language Detector
 
\footnote{\url{https://code.google.com/p/chromium-compact-language-detector/}}
 
are included.  The stream corpus is organized in hourly folders each
 
of which contains many  chunk files. Each chunk file contains between
 
hundreds and hundreds of thousands of serialized  thrift objects. One
 
thrift object is one document. A document could be a blog article, a
 
news article, or a social media post (including tweet).  The stream
 
corpus comes from three sources: TREC KBA 2012 (social, news and
 
linking) \footnote{\url{http://trec-kba.org/kba-stream-corpus-2012.shtml}},
 
arxiv\footnote{\url{http://arxiv.org/}}, and
 
spinn3r\footnote{\url{http://spinn3r.com/}}.
 
Table \ref{tab:streams} shows the sources, the number of hourly
 
directories, and the number of chunk files.
 
We base this analysis on the TREC-KBA 2013 dataset%
 
\footnote{\url{http://trec-kba.org/trec-kba-2013.shtml}}
 
that consists of three main parts: a time-stamped stream corpus, a set of
 
KB entities to be curated, and a set of relevance judgments. A CCR
 
system now has to identify for each KB entity which documents in the
 
stream corpus are to be considered by the human curator.
 
 
\subsection{Stream corpus} The stream corpus comes in two versions:
 
raw and cleaned. The raw and cleansed versions are 6.45TB and 4.5TB
 
respectively,  after xz-compression and GPG encryption. The raw data
 
is a  dump of  raw HTML pages. The cleansed version is the raw data
 
after its HTML tags are stripped off and only English documents
 
identified with Chromium Compact Language Detector
 
\footnote{\url{https://code.google.com/p/chromium-compact-language-detector/}}
 
are included.  The stream corpus is organized in hourly folders each
 
of which contains many  chunk files. Each chunk file contains between
 
hundreds and hundreds of thousands of serialized  thrift objects. One
 
thrift object is one document. A document could be a blog article, a
 
news article, or a social media post (including tweet).  The stream
 
corpus comes from three sources: TREC KBA 2012 (social, news and
 
linking) \footnote{\url{http://trec-kba.org/kba-stream-corpus-2012.shtml}},
 
arxiv\footnote{\url{http://arxiv.org/}}, and
 
spinn3r\footnote{\url{http://spinn3r.com/}}.
 
Table \ref{tab:streams} shows the sources, the number of hourly
 
directories, and the number of chunk files.
 
\begin{table}
 
\caption{Retrieved documents to different sources }
 
\begin{center}
 
 
 \begin{tabular}{l*{4}{l}l}
 
 documents     &   chunk files    &    Sub-stream \\
 
@@ -385,22 +385,22 @@ relevance assessments associated to them. For this purpose, we
 
extracted those documents from the big corpus. We experiment with all
 
KB entities. For each KB entity, we extract different name variants
 
from DBpedia and Twitter.
 
\
 
 
\subsection{Entity Profiling}
 
We build entity profiles for the KB entities of interest. We have two
 
types: Twitter and Wikipedia. Both entities have been selected, on
 
purpose by the track organisers, to occur only sparsely and be less-documented.
 
For the Wikipedia entities, we fetch different name variants
 
from DBpedia: name, label, birth name, alternative names,
 
redirects, nickname, or alias. 
 
These extraction results are summarized in Table
 
\ref{tab:sources}.
 
For the Twitter entities, we visit
 
their respective Twitter pages and fetch their display names. 
 
We build entity profiles for the KB entities of interest. We have two
 
types: Twitter and Wikipedia. Both entities have been selected, on
 
purpose by the track organisers, to occur only sparsely and be less-documented.
 
For the Wikipedia entities, we fetch different name variants
 
from DBpedia: name, label, birth name, alternative names,
 
redirects, nickname, or alias. 
 
These extraction results are summarized in Table
 
\ref{tab:sources}.
 
For the Twitter entities, we visit
 
their respective Twitter pages and fetch their display names. 
 
\begin{table}
 
\caption{Number of different DBpedia name variants}
 
\begin{center}
 
 
 \begin{tabular}{l*{4}{c}l}
 
 Name variant& No. of strings  \\
 
@@ -417,35 +417,35 @@ Redirect  &49 \\
 
\end{tabular}
 
\end{center}
 
\label{tab:sources}
 
\end{table}
 
 
 
The collection contains a total number of 121 Wikipedia entities.
 
Every entity has a corresponding DBpedia label.  Only 82 entities have
 
a name string and only 49 entities have redirect strings. (Most of the
 
entities have only one string, except for a few cases with multiple
 
redirect strings; Buddy\_MacKay, has the highest (12) number of
 
redirect strings.) 
 

	
 
We combine the different name variants we extracted to form a set of
 
strings for each KB entity. For Twitter entities, we used the display
 
names that we collected. We consider the names of the entities that
 
are part of the URL as canonical. For example in entity\\
 
\url{http://en.wikipedia.org/wiki/Benjamin_Bronfman}\\
 
Benjamin Bronfman is a canonical name of the entity. 
 
An example is given in Table \ref{tab:profile}.
 

	
 
From the combined name variants and
 
the canonical names, we  created four sets of profiles for each
 
entity: canonical(cano) canonical partial (cano-part), all name
 
variants combined (all) and partial names of all name
 
variants(all-part). We refer to the last two profiles as name-variant
 
and name-variant partial. The names in parentheses are used in table
 
captions.
 

	
 
The collection contains a total number of 121 Wikipedia entities.
 
Every entity has a corresponding DBpedia label.  Only 82 entities have
 
a name string and only 49 entities have redirect strings. (Most of the
 
entities have only one string, except for a few cases with multiple
 
redirect strings; Buddy\_MacKay, has the highest (12) number of
 
redirect strings.) 
 
 
We combine the different name variants we extracted to form a set of
 
strings for each KB entity. For Twitter entities, we used the display
 
names that we collected. We consider the names of the entities that
 
are part of the URL as canonical. For example in entity\\
 
\url{http://en.wikipedia.org/wiki/Benjamin_Bronfman}\\
 
Benjamin Bronfman is a canonical name of the entity. 
 
An example is given in Table \ref{tab:profile}.
 
 
From the combined name variants and
 
the canonical names, we  created four sets of profiles for each
 
entity: canonical(cano) canonical partial (cano-part), all name
 
variants combined (all) and partial names of all name
 
variants(all-part). We refer to the last two profiles as name-variant
 
and name-variant partial. The names in parentheses are used in table
 
captions.
 
 
 
\begin{table*}
 
\caption{Example entity profiles (upper part Wikipedia, lower part Twitter)}
 
\begin{center}
 
\begin{tabular}{l*{3}{c}}
 
 &Wikipedia&Twitter \\
 
@@ -497,17 +497,17 @@ The annotation set is a combination of the annotations from before the Training
 
 
 
 
 
 
 
%Most (more than 80\%) of the annotation documents are in the test set.
 
The 2013 training and test data contain 68405
 
annotations, of which 50688 are unique document-entity pairs.   Out of
 
these, 24162 unique document-entity pairs are vital (9521) or relevant
 
(17424).
 
%Most (more than 80\%) of the annotation documents are in the test set.
 
The 2013 training and test data contain 68405
 
annotations, of which 50688 are unique document-entity pairs.   Out of
 
these, 24162 unique document-entity pairs are vital (9521) or relevant
 
(17424).
 
 
 
 
 
\section{Experiments and Results}
 
 We conducted experiments to study  the effect of cleansing, different entity profiles, types of entities, category of documents, relevance ranks (vital or relevant), and the impact on classification.  In the following subsections, we present the results in different categories, and describe them.
 
 
 
@@ -614,41 +614,41 @@ If we look at the recall performances for the raw corpus,   filtering documents
 
 
 
 
 
 
 
 
 
 \subsection{ Relevance Rating: vital and relevant}
 
 
 
When comparing recall for vital and relevant, we observe that
 
canonical names are more effective for vital than for relevant
 
entities, in particular for the Wikipedia entities. 
 
%For example, the recall for news is 80.1 and for social is 76, while the corresponding recall in relevant is 75.6 and 63.2 respectively.
 
We conclude that the most relevant documents mention the
 
entities by their common name variants.
 
When comparing recall for vital and relevant, we observe that
 
canonical names are more effective for vital than for relevant
 
entities, in particular for the Wikipedia entities. 
 
%For example, the recall for news is 80.1 and for social is 76, while the corresponding recall in relevant is 75.6 and 63.2 respectively.
 
We conclude that the most relevant documents mention the
 
entities by their common name variants.
 
%  \subsection{Difference by document categories}
 
%  
 
 
 
%  Generally, there is greater variation in relevant rank than in vital. This is specially true in most of the Delta's for Wikipedia. This  maybe be explained by news items referring to  vital documents by a some standard name than documents that are relevant. Twitter entities show greater deltas than Wikipedia entities in both vital and relevant. The greater variation can be explained by the fact that the canonical name of Twitter entities retrieves very few documents. The deltas that involve canonical names of Twitter entities, thus, show greater deltas.  
 
%  
 
 
% If we look in recall performances, In Wikipedia entities, the order seems to be others, news and social. This means that others achieve a higher recall than news than social.  However, in Twitter entities, it does not show such a strict pattern. In all, entities also, we also see almost the same pattern of other, news and social. 
 
 
 
 
  
 
\subsection{Recall across document categories: others, news and social}
 
The recall for Wikipedia entities in Table \ref{tab:name} ranged from
 
61.8\% (canonicals) to 77.9\% (name-variants).  Table
 
\ref{tab:source-delta} shows how recall is distributed across document
 
categories. For Wikipedia entities, across all entity profiles, others
 
have a higher recall followed by news, and then by social.  While the
 
recall for news ranges from 76.4\% to 98.4\%, the recall for social
 
documents ranges from 65.7\% to 86.8\%. In Twitter entities, however,
 
the pattern is different. In canonicals (and their partials), social
 
documents achieve higher recall than news.
 
The recall for Wikipedia entities in Table \ref{tab:name} ranged from
 
61.8\% (canonicals) to 77.9\% (name-variants).  Table
 
\ref{tab:source-delta} shows how recall is distributed across document
 
categories. For Wikipedia entities, across all entity profiles, others
 
have a higher recall followed by news, and then by social.  While the
 
recall for news ranges from 76.4\% to 98.4\%, the recall for social
 
documents ranges from 65.7\% to 86.8\%. In Twitter entities, however,
 
the pattern is different. In canonicals (and their partials), social
 
documents achieve higher recall than news.
 
%This indicates that social documents refer to Twitter entities by their canonical names (user names) more than news do. In name- variant partial, news achieve better results than social. The difference in recall between canonicals and name-variants show that news do not refer to Twitter entities by their user names, they refer to them by their display names.
 
Overall, across all entities types and all entity profiles, documents
 
Overall, across all entities types and all entity profiles, documents
 
in the others category achieve a higher recall than news, and news documents, in turn, achieve higher recall than social documents. 
 
 
% This suggests that social documents are the hardest  to retrieve.  This  makes sense since social posts such as tweets and blogs are short and are more likely to point to other resources, or use short informal names.
 
 
 
%%NOTE TABLE REMOVED:\\\\
 
@@ -669,65 +669,65 @@ in the others category achieve a higher recall than news, and news documents, in
 
% of all name variants and partials of canonicals (93\%). delta. Both
 
% of them are for news category.  For Wikipedia entities, the highest
 
% delta observed is 19.5\% in cano\_part - cano followed by 17.5\% in
 
% all\_part in relevant. 
 
  
 
  \subsection{Entity Types: Wikipedia and Twitter}
 
Table \ref{tab:name} summarizes the differences between Wikipedia and
 
Twitter entities.  Wikipedia entities' canonical representation
 
achieves a recall of 70\%, while canonical partial achieves a recall of 86.1\%. This is an
 
increase in recall of 16.1\%. By contrast, the increase in recall of
 
name-variant partial over name-variant is 8.3\%.
 
%This high increase in recall when moving from canonical names to their
 
%partial names, in comparison to the lower increase when moving from
 
%all name variants to their partial names can be explained by
 
%saturation: documents have already been extracted by the different
 
%name variants and thus using their partial names do not bring in many
 
%new relevant documents.
 
For Wikipedia entities, canonical
 
partial achieves better recall than name-variant in both the cleansed and
 
the raw corpus.  %In the raw extraction, the difference is about 3.7.
 
In Twitter entities, recall of canonical matching is very low.%
 
\footnote{Canonical
 
and canonical partial are the same for Twitter entities because they
 
are one word strings. For example in https://twitter.com/roryscovel,
 
``roryscovel`` is the canonical name and its partial is identical.}
 
%The low recall is because the canonical names of Twitter entities are
 
%not really names; they are usually arbitrarily created user names. It
 
%shows that  documents  refer to them by their display names, rarely
 
%by their user name, which is reflected in the name-variant recall
 
%(67.9\%). The use of name-variant partial increases the recall to
 
%88.2\%.
 
 
 
 
The tables in \ref{tab:name} and \ref{tab:source-delta} show a higher recall
 
for Wikipedia than for Twitter entities. Generally, at both
 
aggregate and document category levels, we observe that recall
 
increases as we move from canonicals to canonical partial, to
 
name-variant, and to name-variant partial. The only case where this
 
does not hold is in the transition from Wikipedia's canonical partial
 
to name-variant. At the aggregate level (as can be inferred from Table
 
\ref{tab:name}), the difference in performance between  canonical  and
 
name-variant partial is 31.9\% on all entities, 20.7\% on Wikipedia
 
entities, and 79.5\% on Twitter entities. 
 

	
 
Section \ref{sec:analysis} discusses the most plausible explanations for these findings.
 
Table \ref{tab:name} summarizes the differences between Wikipedia and
 
Twitter entities.  Wikipedia entities' canonical representation
 
achieves a recall of 70\%, while canonical partial achieves a recall of 86.1\%. This is an
 
increase in recall of 16.1\%. By contrast, the increase in recall of
 
name-variant partial over name-variant is 8.3\%.
 
%This high increase in recall when moving from canonical names to their
 
%partial names, in comparison to the lower increase when moving from
 
%all name variants to their partial names can be explained by
 
%saturation: documents have already been extracted by the different
 
%name variants and thus using their partial names do not bring in many
 
%new relevant documents.
 
For Wikipedia entities, canonical
 
partial achieves better recall than name-variant in both the cleansed and
 
the raw corpus.  %In the raw extraction, the difference is about 3.7.
 
In Twitter entities, recall of canonical matching is very low.%
 
\footnote{Canonical
 
and canonical partial are the same for Twitter entities because they
 
are one word strings. For example in https://twitter.com/roryscovel,
 
``roryscovel`` is the canonical name and its partial is identical.}
 
%The low recall is because the canonical names of Twitter entities are
 
%not really names; they are usually arbitrarily created user names. It
 
%shows that  documents  refer to them by their display names, rarely
 
%by their user name, which is reflected in the name-variant recall
 
%(67.9\%). The use of name-variant partial increases the recall to
 
%88.2\%.
 
 
 
 
The tables in \ref{tab:name} and \ref{tab:source-delta} show a higher recall
 
for Wikipedia than for Twitter entities. Generally, at both
 
aggregate and document category levels, we observe that recall
 
increases as we move from canonicals to canonical partial, to
 
name-variant, and to name-variant partial. The only case where this
 
does not hold is in the transition from Wikipedia's canonical partial
 
to name-variant. At the aggregate level (as can be inferred from Table
 
\ref{tab:name}), the difference in performance between  canonical  and
 
name-variant partial is 31.9\% on all entities, 20.7\% on Wikipedia
 
entities, and 79.5\% on Twitter entities. 
 
 
Section \ref{sec:analysis} discusses the most plausible explanations for these findings.
 
%% TODO: PERHAPS SUMMARY OF DISCUSSION HERE
 

	
 
 
\section{Impact on classification}
 
In the overall experimental setup, classification, ranking, and
 
evaluation are kept constant. Following \cite{balog2013multi}
 
settings, we use
 
WEKA's\footnote{\url{http://www.cs.waikato.ac.nz/~ml/weka/}} Classification
 
Random Forest. However, we use fewer numbers of features which we
 
found to be more effective. We determined the effectiveness of the
 
features by running the classification algorithm using the fewer
 
features we implemented and their features. Our feature
 
implementations achieved better results.  The total numbers of
 
features we used are 13 and are listed below.
 
In the overall experimental setup, classification, ranking, and
 
evaluation are kept constant. Following \cite{balog2013multi}
 
settings, we use
 
WEKA's\footnote{\url{http://www.cs.waikato.ac.nz/~ml/weka/}} Classification
 
Random Forest. However, we use fewer numbers of features which we
 
found to be more effective. We determined the effectiveness of the
 
features by running the classification algorithm using the fewer
 
features we implemented and their features. Our feature
 
implementations achieved better results.  The total numbers of
 
features we used are 13 and are listed below.
 
  
 
\paragraph*{Google's Cross Lingual Dictionary (GCLD)}
 
 
This is a mapping of strings to Wikipedia concepts and vice versa
 
\cite{spitkovsky2012cross}. 
 
(1) the probability with which a string is used as anchor text to
 
@@ -958,13 +958,15 @@ There is a trade-off between using a richer entity-profile and retrieval of irre
 
%%%%%%%%%%%%
 
 
 
In vital ranking, across all entity profiles and types of corpus, Wikipedia's canonical partial  achieves better performance than any other Wikipedia entity profiles. In vital-relevant documents too, Wikipedia's canonical partial achieves the best result. In the raw corpus, it achieves a little less than name-variant partial. For Twitter entities, the name-variant partial profile achieves the highest F-score across all entity profiles and types of corpus.  
 
 
 
Cleansing impacts Twitter
 
There are 3 interesting observations: 
 
 
1) cleansing impacts Twitter
 
entities and relevant documents.  This  is validated by the
 
observation that recall  gains in Twitter entities and the relevant
 
categories in the raw corpus also translate into overall performance
 
gains. This observation implies that cleansing removes relevant and
 
social documents than it does vital and news. That it removes relevant
 
documents more than vital can be explained by the fact that cleansing
 
@@ -976,13 +978,13 @@ most of the missing of the missing  docuemnts from cleansed are
 
social. And all the docuemnts that are missing from raw corpus
 
social. So in both cases socuial seem to suffer from text
 
transformation and cleasing processes. 
 
 
%%%% NEEDS WORK:
 
 
Taking both performance (recall at filtering and overall F-score
 
2) Taking both performance (recall at filtering and overall F-score
 
during evaluation) into account, there is a clear trade-off between using a richer entity-profile and retrieval of irrelevant documents. The richer the profile, the more relevant documents it retrieves, but also the more irrelevant documents. To put it into perspective, lets compare the number of documents that are retrieved with  canonical partial and with name-variant partial. Using the raw corpus, the former retrieves a total of 2547487 documents and achieves a recall of 72.2\%. By contrast, the later retrieves a total of 4735318 documents and achieves a recall of 90.2\%. The total number of documents extracted increases by 85.9\% for a recall gain of 18\%. The rest of the documents, that is 67.9\%, are newly introduced irrelevant documents. 
 
 
Wikipedia's canonical partial is the best entity profile for Wikipedia entities. This is interesting  to see that the retrieval of of  thousands vital-relevant document-entity pairs by name-variant partial does not translate to an increase in over all performance. It is even more interesting since canonical partial was not considered as contending profile for stream filtering by any of participant to the best of our knowledge. With this understanding, there  is actually no need to go and fetch different names variants from DBpedia, a saving of time and computational resources.
 
 
 
%%%%%%%%%%%%
0 comments (0 inline, 0 general)