HCDA/cikm-paper Changeset - 4d481f8d3ab8 · Centrum Wiskunde & Informatica (CWI)

@@ -139,13 +139,13 @@ documents (news, blog, tweets) can influence filtering.

 The rest of the paper is is organized as follows:

 \section{Data and Task description}

 \section{Data and Probelm description}

We use TREC KBA-CCR-2013 dataset \footnote{http://http://trec-kba.org/trec-kba-2013.shtml} and problem setting. The dataset consists of a time-stamped  stream corpus, a set of KB entities, and a set of relevance judgments.

\subsection{Stream corpus} The stream corpus comes in two versions: raw and cleaned. The raw  and cleansed versions are 6.45TB and 4.5TB respectively,  after xz-compression and GPG encryption. The raw data is a  dump of  raw HTML pages. The cleansed version is the raw data after its HTML tags are stripped off and non-English docuemnts removed. The stream corpus is organized in hourly folders each of which contains many  chunk files. Each chunk file contains between hundreds and hundreds of thousands of serialized  thrift objects. One thrift object is one document. A document could be a blog article, a news article, or a social media post (including tweet).  The stream corpus comes from three sources: TREC KBA 2012 (social, news and linking) \footnote{http://trec-kba.org/kba-stream-corpus-2013.shtml}, arxiv\footnote{http://arxiv.org/}, and spinn3r\footnote{http://spinn3r.com/}. Table \ref{tab:streams}   shows the sources, the number of hourly directories, and the number of chunk files.

\begin{table*}

\caption{retrieved documents to different sources }

\begin{center}

@@ -176,25 +176,28 @@ We use TREC KBA-CCR-2013 dataset \footnote{http://http://trec-kba.org/trec-kba-2

\subsection{Relevance judgments}

TREC-KBA provided relevance judgments for training and testing. Relevance judgments are given to a document-entity pairs. Documents with citation-worthy content to a given entity are annotated  as \emph{vital},  while documents with tangentially relevant content, or docuemnts that lack freshliness o  with content that can be useful for initial KB-dossier are annotated as \emph{relevant}. Documents with no relevant content are labeled \emph{neutral} and spam is labeled as \emph{garbage}.  The inter-annotator agreement on vital in 2012 was 70\% while in 2013 it is 76\%. This is due to the more refined definition of vital and the distinction made between vital and relevant.

 \subsection{Research questions}

 Given a stream of documents of news items, blogs and social media on one hand and KB entities (Wikipedia, Twitter)  on the other,  we study the cleansing step, the entity-profile construction, the category of the stream items, the type of entities (Wikipedia or Twitter), and the impact on classification.  In particular, we strive to answer the following questions:

 \subsection{Problem description}

 Given a stream of documents of news items, blogs and social media on one hand and KB entities (Wikipedia, Twitter)  on the other,  we study the factors and choices that affect filtering perfromance. Specifically, we conduct in-depth analysis on the cleansing step, the entity-profile construction, the docuemnt category of the stream items, and the type of entities (Wikipedia or Twitter). We also study the impact of chouces on classification performance. Finally, we conduct manual examination of the relevant docuemnts that defy filtering. We strive to answer the following research questions:

 \begin{enumerate}

  \item Does cleansing affect filtering and subsequent performance

  \item What is the most effective way of entity profile representation

  \item Is filtering different for Wikipedia and Twitter entities?

  \item Are some type of documents easily filterable and others not ?

  \item Does a gain in recall at filtering step translate to a gain in F-measure at the end of the pipeline?

  \item What are the vital(relevant documents that are not retrievable by a system?

  \item Are there vital (relevant) documents that are not filterable by a reasonable system?

\end{enumerate}

\subsection{Evaluation}

The TREC filtering track

\subsection{Literature Review}

There has been a great deal of interest  as of late on entity-based filtering and ranking. One manifestation of that is the introduction of TREC KBA in 2012. Following that, there have been a number of research works done on the topic \cite{frank2012building, ceccarelli2013learning, taneva2013gem, wang2013bit, balog2013multi}.  These works are based on KBA 2012 task and dataset  and they address the whole problem of entity filtering and ranking.  TREC KBA continued in 2013, but the task underwent some changes. The main change between  the 2012 and 2013 are in the number of entities, the type of entities, the corpus and the relevance rankings.

The number of entities increased from 29 to 141, and it included 20 Twitter entities. The TREC KBA 2012 corpus was 1.9TB after xz-compression and had  400M documents. By contrast, the KBA 2013 corpus was 6.45 after XZ-compression and GPG encryption. A version with all-non English documented removed  is 4.5 TB and consists of 1 Billion documents. The 2013 corpus subsumed the 2012 corpus and added others from spinn3r, namely main-stream news, forum, arxiv, classified, reviews and meme-tracker.  A more important difference is, however, that the definition of the relevance ranking changed. The change   in the definitions of vital and relevant. While in KBA 2012, a document was judged vital if it has citation-worthy content, In 2013 it must have the freshliness, that is the content must trigger an editing of the KB entry.

While the task of 2012 and 2013 are fundamentally the same, the approaches for the tasks varied due  to the size of the corpus. In the 2013, all participants used filtering to reduce the size of the big corpus.   They used different ways of filtering: many of them used two or more of different name variants from DBpedia such as labels, names, redirects, birth names, alias, nicknames, same-as and alternative names \cite{wang2013bit, dietzumass ,liu2013related, zhangpris}.  Although all of the participants used DBpedia name variants none of them used all them.  A few other participants used bold words in the first paragraph of the Wikipedia entity's profiles and anchor texts from other Wikipedia pages  \cite{bouvierfiltering, niauniversity}.  Very few participants used Boolean And built from the tokens of the canonical names \cite{illiotrec2013}.

@@ -435,20 +438,20 @@ When we talk at an aggregate-level (both Twitter and Wikipedia entities), we obs

The tables in \ref{tab:name} and \ref{tab:source-delta} show, recall for Wikipedia entities are higher than for Twitter. This indicates that Wikipedia entities are easier to match in documents than Twitter. This can be due to two reasons: 1) Wikipedia entities are relatively well described than Twitter entities. The fact that we can retrieve different name variants from DBpedia is a measure of relative description. By contrast, we have only two names for Twitter entities:their user names and their display names which we collect from their Twitter pages. 2) DBpedia entities are less obscure and that is why they are not in Wikipedia anyways. Another point is that mentioned by their display names more than they are by their user names. We also observed that social documents mention Twitter entities by their user names more than news suggesting a distinction between the standard in news and social documents.

   \subsection{Impact on classification}

  In the overall experimental setup, the goal is to keep the classification constant. In here, we present results showing how filtering affects performance.  In tables \ref{tab:class-vital} and \ref{tab:class-vital-relevant}, we show the performances in F-measure and SU.

  In the overall experimental setup, the classification and evaluation, ranking and evaluation are kept constant. In here, we present results showing how  the choices in corpus, entity types, and entity profiles impact these latest stages of the pipeline.  In tables \ref{tab:class-vital} and \ref{tab:class-vital-relevant}, we show the performances in F-measure and SU.

\begin{table*}

\caption{vital performance under different name variants , upper part from cleansed, lower part from raw}

\caption{vital performance under different name variants(upper part from cleansed, lower part from raw)}

\begin{center}

\begin{tabular}{ll@{\quad}lllllll}

\hline

\multicolumn{1}{l}{\rule{0pt}{12pt}}&\multicolumn{1}{l}{\rule{0pt}{12pt}cano}&\multicolumn{1}{l}{\rule{0pt}{12pt}cano partial }&\multicolumn{1}{l}{\rule{0pt}{12pt}all }&\multicolumn{1}{l}{\rule{0pt}{50pt}all\_part}\\[5pt]

\hline{Cleansed}

&\multicolumn{1}{l}{\rule{0pt}{12pt}}&\multicolumn{1}{l}{\rule{0pt}{12pt}cano}&\multicolumn{1}{l}{\rule{0pt}{12pt}cano partial }&\multicolumn{1}{l}{\rule{0pt}{12pt}all }&\multicolumn{1}{l}{\rule{0pt}{50pt}all\_part}\\[5pt]

   all-entities &F& 0.241&0.261&0.259&0.265\\

	      &SU&0.259  &0.258 &0.263 &0.262 \\

   Wikipedia &F&0.252&0.274& 0.265&0.271\\

	      &SU& 0.261& 0.259&  0.265&0.264 \\

@@ -472,18 +475,18 @@ The tables in \ref{tab:name} and \ref{tab:source-delta} show, recall for Wikiped

\end{center}

\label{tab:class-vital}

\end{table*}

  \begin{table*}

\caption{vital or relevant performances under different name variants , upper part from cleansed, lower part from raw}

\caption{vital-relevant performances under different name variants (upper part from cleansed, lower part from raw)}

\begin{center}

\begin{tabular}{ll@{\quad}lllllll}

\hline

\multicolumn{1}{l}{\rule{0pt}{12pt}}&\multicolumn{1}{l}{\rule{0pt}{12pt}cano}&\multicolumn{1}{l}{\rule{0pt}{12pt}cano partial }&\multicolumn{1}{l}{\rule{0pt}{12pt}all }&\multicolumn{1}{l}{\rule{0pt}{50pt}all\_part}\\[5pt]

\hline{Cleansed}

&\multicolumn{1}{l}{\rule{0pt}{12pt}}&\multicolumn{1}{l}{\rule{0pt}{12pt}cano}&\multicolumn{1}{l}{\rule{0pt}{12pt}cano partial }&\multicolumn{1}{l}{\rule{0pt}{12pt}all }&\multicolumn{1}{l}{\rule{0pt}{50pt}all\_part}\\[5pt]

   all-entities &F& 0.497&0.560&0.579&0.607\\

	      &SU&0.468  &0.484 &0.483 &0.492 \\

   Wikipedia &F&0.546&0.618&0.599&0.617\\

   &SU&0.494  &0.513 &0.498 &0.508 \\

@@ -506,23 +509,18 @@ The tables in \ref{tab:name} and \ref{tab:source-delta} show, recall for Wikiped

\end{center}

\label{tab:class-vital-relevant}

\end{table*}

We have looked into the effect of cleansing in filtering. Further, we have looked into the retrieval effectiveness of different profiles of entities.  We also looked into weather the source categories have an effect on filtering.

On Wikipedia entities, except in the canonical profile, the cleansed version achieves  better results than the raw version.  However, on Twitter entities, the raw corpus achieves a better  in all profiles (except in in the partial names of all names variants).  In all entities (both Wikipedia and Twitter), we see that in three profiles, cleansed achieves better. only in canonical partial, does raw perform better.  This result is interesting because we saw in previous sections that the raw corpus achieves a higher recall. In the case of partial names of name variants, for example, 10\% more relevant documents are retrieved. This suggests that a gain in recall does not necessarily mean a gain in F\_measure here. One explanation for this is that it brings in many false positives from, among related links, adverts, etc.

Let's look into cleansed versus raw on  Wikipedia entities: It seems that the cleansed version gives better results than the raw version in three of the profiles(except in canonical). However, in Twitter entities, the raw corpus achieves a better score in all profiles (except in name variants profiles).  In all entities, we see that in three profiles, cleansed achieves better (only in canonical partial, does raw perform better.)

For Wikipedia entities,  canonical partial names seem to achieve the highest performance. For Twitter, the partial names of name variants achieve  better results. In vital-relevant, in three cases, raw achieves better results (except in cano partials). For Twitter entities, the raw corpus achieves better results.  In terms of  entity profiles, Wikipedia's canonical partial names achieves  the best F-score. For Twitter, as before, partial names of canonical names.

In terms of profiles, Wikipedia's canonical partial names seem to achieve the highest performance. For Twitter, the partial names of name variants achieve  better results.

For vital plus relevant:

In three cases, raw achieves better results (except in cano partials). For Twitter entities, the raw corpus achieves better results.  In terms of  entity profiles, Wikipedia's canonical partial names achieve  the best F-score. For Twitter, as before, partial names of canonical names.

It seems, the raw corpus has more effect on Twitter entities performances. An increase in recall does not necessarily mean an increase in F-measure.

It seems, the raw corpus has more effect on Twitter entities performances. An increase in recall does not necessarily mean an increase in F-measure.  The fact that canonical partial names achive better results is interesting.  We know that partial names were used as a baseline in TREC KBA 2012, but no one of the KBA participants actually used partial names for filtering.

\subsection{Missing relevant documents \label{miss}}

There is a trade-off between using a richer entity-profile and retrieval of irrelevant documents. The richer the profile, the more relevant documents it retrieves, but also the more irrelevant documents. To put into perspective, lets compare the number of documents that are retrieved with partial names of name variants and partial names of canonical names. Using the raw corpus, the partial names of canonical names extracts a total of 2547487 and achieves a recall of 72.2\%. By contrast, the partial names of name variants extracts a total of 4735318 documents and achieves a recall of 90.2\%. The total number of documents extracted increases by 85.9\% for a recall gain of 18\%. The rest of the documents, that is 67.9\%, are newly introduced irrelevant documents. There is an advantage in excluding irrelevant documents from filtering because they confuse the later stages of the pipeline.

 The use of the partial names of name variants for filtering is an aggressive attempt to retrieve as many relevant documents as possible at the cost of retrieving irrelevant documents. However, we still miss about  2363(10\%) the vital-relevant documents.  Why are these documents missed? If they are not mentioned by partial names of name variants, what are they mentioned by? Table \ref{tab:miss} shows the documents that we miss with respect to cleansed and raw corpus.  The upper part shows the number of documents missing from cleansed and raw versions of the corpus. The lower part of the table shows the intersections and exclusions in each corpus.

@@ -554,12 +552,13 @@ Raw & 276 & 4951 & 5227 \\

converting.  In both cases the mention of the entity happened to be on the part of the text that is cut out during conversion.

 The interesting set  of relevance judgments are those that  we miss from both raw and cleansed extractions. These are 2146 unique document-entity pairs, 219 of them are with vital relevance judgments.   The total number of entities in the missed vital annotations is  28 Wikipedia and 7  Twitter, making a total of 35. Looking into document categories shows that the  great majority (86.7\%) of the documents are social. This suggests that social (tweets and blogs) can talk about the entities without mentioning  them by name. This is, of course, inline with intuition.

   Vital documents show higher recall than relevants. This is not surprising as we it is more liekely that vital documents mention the entities more than relevant. Across docuemnt categories, we observe a pattern in recall of others, followed by news, and then by social. Socila documents are the hardest to retrieve. This can be explained by the fact that social docuemnts (tweets, blogs) are more likely to point to a resource without mentioning the entities. By contrast news documents mention the entities they talk about.

%    \begin{table*}

% \caption{Breakdown of missing documents by sources for cleansed, raw and cleansed-and-raw}

% \begin{center}\begin{tabular}{l*{9}r}

@@ -651,28 +650,24 @@ We also observed that although docuemnts have different document ids, several of

\section{Analysis and Discussion}

We conducted experiments to study  the impacts on recall of

different components of the filtering step of the CCR pipeline. Specifically

different components of the filtering step of entity-based filtering and ranking pipeline. Specifically

we conducted experiments to study the impacts of cleansing,

entity profile, relevance rankings, categories of documents, and documents that are missed.

Experimental results using TREC-KBA task show that cleansing removes documents  or part of the documents making them difficult to retrieve. These documents can, otherwise, be retrieved from the raw version. The use of the raw corpus brings in documents that can not be retrieved from the cleansed corpus. This is true for all entity profiles and for all entity types. The  recall increase is  between 6.8\% to 26.2\%. These increase, in actual document-entity pairs,  is in thousands.

The use of different profiles also shows a big difference in percentage recall. Except in the case of Wikipedia, where the use of canonical\_partial achieves better than name variants, there is a steady increase in recall from canonical names to partial canonical names, name variants and partial names of name variants. The difference between partial names of name variants and canonical names is 30.8\%. And between partial names of name variants and partial names of canonical names is 18.0\%.

entity profile, relevance rankings, categories of documents, and documents that are missed. We also measured their impacts on clasification peformance.

Does this increase in recall as move from a less richer profile to a more richer profile translate to an increase in classification performance? The results show that it does not. In most profiles, for both Wikipedia and total entities, the cleansed version performs better than the raw version. In Twitter entities, the raw corpus achieves better except in the case of all name variants. However, the difference in performance are so small that the increase can be ignored. The highest performance for Wikipedia entities is achieved with partial names of canonical names, rather than partial names of all names variants which retrieve 18.0\% more documents.

Experimental results using TREC-KBA problem setting and dataset  show that cleansing removes entire or parts of document contents making them difficult to retrieve. These documents can, otherwise, be retrieved from the raw version. The use of the raw corpus brings in documents that can not be retrieved from the cleansed corpus. This is true for all entity profiles and for all entity types. The  recall increase is  between 6.8\% to 26.2\%. These increase, in actual document-entity pairs,  is in thousands.

The use of different profiles also shows a big difference in percentage recall. Except in the case of Wikipedia, where the use of canonical partial name achieves better than name variants, there is a steady increase in recall from canonical names to partial canonical names, name variants and partial names of name variants. The difference between partial names of name variants and canonical names is 30.8\%. And between partial names of name variants and partial names of canonical names is 18.0\%.

However, for vital plus relevant, the raw corpus performs  better except in partial canonical names. In all cases, Wikipedia's canonical partial names achieves better performance than any other profile. This is interesting because the retrieval of thousands of document-entity pairs did not translate to an increase in performance  in classification.

Does this increase in recall as we move from a less richer profile to a more richer profile translate to an increase in classification performance? The results show that it does not. In most profiles, for both Wikipedia and total entities, the cleansed version performs better than the raw version. In Twitter entities, the raw corpus achieves better except in the case of all name variants. However, the difference in performance are so small that the increase can be ignored. The highest performance for Wikipedia entities is achieved with partial names of canonical names, rather than partial names of all names variants which retrieve 18.0\% more documents.

One reason why an increase in recall does not translate to an increase in F-measure later is because of the retrieval of many false positives which confuse the classifier. A good profile for Wikipedia entities seem canonical partial names suggesting that there is actually no need to go and fetch different names variants.

For Twitter entities, the use of partial names of their display names are a good entity profiles.

However, for vital plus relevant, the raw corpus performs  better except in partial canonical names. In all cases, Wikipedia's canonical partial names achieves better performance than any other profile. This is interesting because the retrieval of thousands of document-entity pairs did not translate to an increase in performance  in classification. One reason why an increase in recall does not translate to an increase in F-measure later is because of the retrieval of many false positives which confuse the classifier. A good profile for Wikipedia entities seem canonical partial names suggesting that there is actually no need to go and fetch different names variants. For Twitter entities, the use of partial names of their display names are  good entity profiles.

\section{Conclusions}