HCDA/cikm-paper Changeset - 2c1313a39690 · Centrum Wiskunde & Informatica (CWI)

Changeset - 2c1313a39690

Parent rev.

Child rev.

[Not reviewed]

0 1 0

Gebrekirstos Gebremeskel - 11 years ago 2014-06-10 20:24:09
destinycome@gmail.com

updated

1 file changed with 22 insertions and 25 deletions:

mypaper-final.tex

0 comments (0 inline, 0 general)

mypaper-final.tex

➞

Show inline comments

@@ @@ -132,18 +132,15 @@ documents (news, blog, tweets) can influence filtering. @@
  In this paper,  we hold the subsequent steps of the pipeline fixed, zoom in on the filtering step and  conduct an in-depth analysis of the main components in it.  In particular, we study  cleansing, different entity profiling,  type of entities (Wikipedia or Twitter), and type of documents (social, news, etc).  The main contribution of the paper:
  An in-depth analysis of the factors that affect entity-based stream filtering
  Identifying optimal entity profiles vis-avis not compromising precision
  Describing relevant documents that are not amenable to filtering and thereby estimating the upper-bound on entity-based filtering
  In this paper,  we hold the subsequent steps of the pipeline fixed, zoom in on the filtering step and  conduct an in-depth analysis of the main components in it.  In particular, we study  cleansing, different entity profiling,  type of entities (Wikipedia or Twitter), and document categories (social, news, etc).  The main contribution of the paper are an in-depth analysis of the factors that affect entity-based stream filtering, identifying optimal entity profiles vis-avis not compromising precision, describing and classifying relevant documents that are not amenable to filtering , and  estimating the upper-bound of recall on entity-based filtering.
  The rest of the paper is is organized as follows:
  \section{Data and Probelm description}
 We use TREC KBA-CCR-2013 dataset \footnote{http://http://trec-kba.org/trec-kba-2013.shtml} and problem setting. The dataset consists of a time-stamped  stream corpus, a set of KB entities, and a set of relevance judgments.
  \section{Data Description}
 We use TREC-KBA 2013 dataset \footnote{http://http://trec-kba.org/trec-kba-2013.shtml}. The dataset consists of a time-stamped  stream corpus, a set of KB entities, and a set of relevance judgments.
 \subsection{Stream corpus} The stream corpus comes in two versions: raw and cleaned. The raw  and cleansed versions are 6.45TB and 4.5TB respectively,  after xz-compression and GPG encryption. The raw data is a  dump of  raw HTML pages. The cleansed version is the raw data after its HTML tags are stripped off and non-English documents removed. The stream corpus is organized in hourly folders each of which contains many  chunk files. Each chunk file contains between hundreds and hundreds of thousands of serialized  thrift objects. One thrift object is one document. A document could be a blog article, a news article, or a social media post (including tweet).  The stream corpus comes from three sources: TREC KBA 2012 (social, news and linking) \footnote{http://trec-kba.org/kba-stream-corpus-2013.shtml}, arxiv\footnote{http://arxiv.org/}, and spinn3r\footnote{http://spinn3r.com/}. Table \ref{tab:streams}   shows the sources, the number of hourly directories, and the number of chunk files.
 \begin{table*}
- \subsection{Problem description}
+ \subsection{Stream Filtering}
  Given a stream of documents of news items, blogs and social media on one hand and KB entities (Wikipedia, Twitter)  on the other,  we study the factors and choices that affect filtering perfromance. Specifically, we conduct in-depth analysis on the cleansing step, the entity-profile construction, the document category of the stream items, and the type of entities (Wikipedia or Twitter). We also study the impact of choices on classification performance. Finally, we conduct manual examination of the relevant documents that defy filtering. We strive to answer the following research questions:
  \begin{enumerate}
@@ @@ -198,11 +195,11 @@ The TREC filtering track @@
 \subsection{Literature Review}
 There has been a great deal of interest  as of late on entity-based filtering and ranking. One manifestation of that is the introduction of TREC KBA in 2012. Following that, there have been a number of research works done on the topic \cite{frank2012building, ceccarelli2013learning, taneva2013gem, wang2013bit, balog2013multi}.  These works are based on KBA 2012 task and dataset  and they address the whole problem of entity filtering and ranking.  TREC KBA continued in 2013, but the task underwent some changes. The main change between  the 2012 and 2013 are in the number of entities, the type of entities, the corpus and the relevance rankings.
-The number of entities increased from 29 to 141, and it included 20 Twitter entities. The TREC KBA 2012 corpus was 1.9TB after xz-compression and had  400M documents. By contrast, the KBA 2013 corpus was 6.45 after XZ-compression and GPG encryption. A version with all-non English documented removed  is 4.5 TB and consists of 1 Billion documents. The 2013 corpus subsumed the 2012 corpus and added others from spinn3r, namely main-stream news, forum, arxiv, classified, reviews and meme-tracker.  A more important difference is, however, that the definition of the relevance ranking changed. The change   in the definitions of vital and relevant. While in KBA 2012, a document was judged vital if it has citation-worthy content, In 2013 it must have the freshliness, that is the content must trigger an editing of the KB entry.
+The number of entities increased from 29 to 141, and it included 20 Twitter entities. The TREC KBA 2012 corpus is 1.9TB after xz-compression and has  400M documents. By contrast, the KBA 2013 corpus is 6.45 after XZ-compression and GPG encryption. A version with all-non English documented removed  is 4.5 TB and consists of 1 Billion documents. The 2013 corpus subsumed the 2012 corpus and added others from spinn3r, namely main-stream news, forum, arxiv, classified, reviews and meme-tracker.  A more important difference is, however, a change in the definitions of relevance ratings vital and relevant. While in KBA 2012, a document was judged vital if it has citation-worthy content for a given entity, in 2013 it must have the freshliness, that is the content must trigger an editing of the given entity's KB entry.
-While the task of 2012 and 2013 are fundamentally the same, the approaches for the tasks varied due  to the size of the corpus. In the 2013, all participants used filtering to reduce the size of the big corpus.   They used different ways of filtering: many of them used two or more of different name variants from DBpedia such as labels, names, redirects, birth names, alias, nicknames, same-as and alternative names \cite{wang2013bit, dietzumass ,liu2013related, zhangpris}.  Although all of the participants used DBpedia name variants none of them used all them.  A few other participants used bold words in the first paragraph of the Wikipedia entity's profiles and anchor texts from other Wikipedia pages  \cite{bouvierfiltering, niauniversity}.  Very few participants used Boolean And built from the tokens of the canonical names \cite{illiotrec2013}.
+While the tasks of 2012 and 2013 are fundamentally the same, the approaches  varied due  to the size of the corpus. In 2013, all participants used filtering to reduce the size of the big corpus.   They used different ways of filtering: many of them used two or more of different name variants from DBpedia such as labels, names, redirects, birth names, alias, nicknames, same-as and alternative names \cite{wang2013bit,dietzumass,liu2013related, zhangpris}.  Although most of the participants used DBpedia name variants none of them used all the name variants.  A few other participants used bold words in the first paragraph of the Wikipedia entity's profiles and anchor texts from other Wikipedia pages  \cite{bouvierfiltering, niauniversity}. One participant used Boolean \emph{and} built from the tokens of the canonical names \cite{illiotrec2013}.
-All of the studies mentioned used filtering as their first step to generate a smaller set of documents. And many systems suffered from poor recall and their system performances were highly affected \cite{frank2012building}. Although  systems  used different entity profiles to filter the stream, and achieved different performance levels, there is no study on and the factors and choices that affect the filtering step itself. Of course filtering has been extensively examined in TREC Filtering \cite{robertson2002trec}. However, those studies were isolated in the sense that they were intended to optimize recall. What we have here is a different scenario. Documents have relevance rating. Thus we want to study filtering in connection to  relevance to the entities and thus can be done by coupling filtering to the later stages of the pipeline. This is new to the best of our knowledge and the TREC KBA problem setting and data-sets offer a good opportunity to examine this aspect of filtering.
 All of the studies used filtering as their first step to generate a smaller set of documents. And many systems suffered from poor recall and their system performances were highly affected \cite{frank2012building}. Although  systems  used different entity profiles to filter the stream, and achieved different performance levels, there is no study on and the factors and choices that affect the filtering step itself. Of course filtering has been extensively examined in TREC Filtering \cite{robertson2002trec}. However, those studies were isolated in the sense that they were intended to optimize recall. What we have here is a different scenario. Documents have relevance rating. Thus we want to study filtering in connection to  relevance to the entities and thus can be done by coupling filtering to the later stages of the pipeline. This is new to the best of our knowledge and the TREC KBA problem setting and data-sets offer a good opportunity to examine this aspect of filtering.
 Moreover, there has not been a chance to study at this scale and/or a study into what type of documents defy filtering and why? In this paper, we conduct a manual examination of the documents that are missing and classify them into different categories. We also estimate the general upper bound of recall using the different entities profiles and choose the best profile that results in an increased over all performance as measured by F-measure.
+\
 \subsection{Entity Profiling}
-We build profiles for the KB entities of interest. We have two types: Twitter and Wikipedia. Both Entities are selected, on purpose, to be sparse, less-documented.  For the Twitter entities, we visit their respective Twitter pages  and  manually fetch their display names from their Twitter pages. For the Wikipedia entities, we fetch different name variants from DBpedia, namely  name, label, birth name, alternative names, redirects, nickname, or alias.  The extraction results are in Table \ref{tab:sources}.
 We build profiles for the KB entities of interest. We have two types: Twitter and Wikipedia. Both Entities are selected, on purpose, to be sparse, less-documented.  For the Twitter entities, we visit their respective Twitter pages  and  manually fetch their display names. For the Wikipedia entities, we fetch different name variants from DBpedia, namely  name, label, birth name, alternative names, redirects, nickname, or alias.  The extraction results are in Table \ref{tab:sources}.
 \begin{table*}
 \caption{Number of retrieved name variants for different sources }
 \begin{center}
@@ @@ -236,7 +233,7 @@ Redirect &49&96 \\ @@
 We have a total of 121 Wikipedia entities. Not every entity has a value for the different name variants. Every entity has a DBpedia label.  Only 82 entities have a name string and only 49 entities have redirect strings. Most of the entities have only one string, but some have 2, 3, 4 and 5. One entity, Buddy\_MacKay,has the highest (12) number of redirect strings. 6 entities have  birth names, only 1 ( Chuck Pankow) has a nick name,  ``The Peacemaker'',  only 1 entity has alias and only 4 have alternative names.
-We combined the different name variants we extracted to form a set of strings for each KB entity. Specifically, we merged the names, labels, redirects, birth names, nick names and alternative names of each entity. For Twitter entities, we used the display names that we collected. We consider the names of that entities that are part of the URL as canonical names . For example in http://en.wikipedia.org/wiki/Benjamin\_Bronfman, Benjamin Bronfman is a canonical name of the entity. From the combined name variants and the canonical names, we  created four sets of profiles for each entity:  canonical names (cano), partial names of canonical names (cano\_part), all names variants  (all) and partial names of all names(all\_part).
+We combined the different name variants we extracted to form a set of strings for each KB entity. Specifically, we merged the names, labels, redirects, birth names, nick names and alternative names of each entity. For Twitter entities, we used the display names that we collected. We consider the names of the entities that are part of the URL as canonical names . For example in http://en.wikipedia.org/wiki/Benjamin\_Bronfman, Benjamin Bronfman is a canonical name of the entity. From the combined name variants and the canonical names, we  created four sets of profiles for each entity:  canonical names (cano), partial names of canonical names (cano\_part), all names variants (all) and partial names of all names(all\_part).
 \subsection{Annotation Corpus}
 The annotation set is a combination of the annotations from before the Training Time Range(TTR) and Evaluation Time Range (ETR) and consists of 68405 annotations.  Its breakdown into training and test sets is  shown in Table \ref{tab:breakdown}.
-Most (more than 80\%) of the annotation documents are in the test set. Some annotation documents do not have a cleansed content. In both the training and test data for 2013, there are  68405 annotations, of which 50688 are unique document-entity pairs.   Out of 50688,  24162  unique document-entity pairs vital or relevant, of which 9521 are vital and 17424 are relevant.
 Most (more than 80\%) of the annotation documents are in the test set.  In both the training and test data for 2013, there are  68405 annotations, of which 50688 are unique document-entity pairs.   Out of 50688,  24162  unique document-entity pairs vital or relevant, of which 9521 are vital and 17424 are relevant.
 \section{Experiments and Results}
- We conducted experiments to study  the effect of cleansing, different entity profiles, types of entities, category of documents, relevance ranks (vital or relevant), and the impact on classification.  For ease of understanding, we present the results in two categories: cleansing, and document categories. In each case we study the number of annotated documents that are retrieved.
+ We conducted experiments to study  the effect of cleansing, different entity profiles, types of entities, category of documents, relevance ranks (vital or relevant), and the impact on classification.  In the following subsections, we present the results and discuss them.
  \subsection{Cleansing: raw or cleansed}
 \begin{table}
-\subsection{Document category: others. news and social}
+\subsection{Recall across document categories(others, news and social)}
 The recall for Wikipedia entities in \ref{tab:name} ranged from 61.8\% (canonical names) to 77.9\% (partial names of name variants. We looked at how these recall is distributed across the three document categories. In Table \ref{tab:source-delta}, Wikipedia column, we see, across all entity profiles, that others have a higher recall followed by news. Social documents achieve the lowest recall.  While the news recall  ranged from 76.4\% to 98.4\%, the recall for social documents ranged from 65.7\% to 86.8\%. Others achieve higher than news and news achieve higher than social. This pattern  holds across  all name variants in  Wikipedia  entities. Notice that the others category stands for arxiv (scientific documents), classifieds, forums and linking.
 In Twitter entities, however, the pattern is different. In canonical names (and their partials), social documents achieve higher recall than news . This suggests that social documents refer to Twitter entities by their canonical names (user names) more than news. In partial names of all name variants, news achieve better results than social. The difference in recall between canonical and partial names of all name variants shows that news do not refer to Twitter entities by their user names, they refer to them with their display names.
 We computed four percentage increases in recall (deltas)  between the different entity profiles (see \ref{tab:source-delta2}. The first delta is the recall percentage between partial names of canonical names and canonical names. The second  is the delta between name variants and canonical names. The third is the difference between partial names of name variants  and partial names of canonical names and the fourth between partial names of name variants and name variants. we believe these four deltas offer a clear meaning. The delta between all name variants and canonical names shows the percentage of documents that the new name variants retrieve, but the canonical name does not. Similarly, the delta between partial names of name variants and partial names of canonical names shows the percentage of document-entity pairs that can be gained by the partial names of the name variants.
+We computed four percentage increases in recall (deltas)  between the different entity profiles (see Table \ref{tab:source-delta2}. It is to help writng, will be deleted before submission). The first delta is the recall percentage between partial names of canonical names and canonical names. The second  is the delta between name variants and canonical names. The third is the difference between partial names of name variants  and partial names of canonical names and the fourth between partial names of name variants and name variants. we believe these four deltas offer a clear meaning. The delta between all name variants and canonical names shows the percentage of documents that the new name variants retrieve, but the canonical name does not. Similarly, the delta between partial names of name variants and partial names of canonical names shows the percentage of document-entity pairs that can be gained by the partial names of the name variants.
 In most of the  deltas, news followed by social followed by others show greater difference. This suggests s that news refer to entities by different names, rather than by a certain standard name.  This is counter-intuitive since one would expect news to mention entities by some consistent name(s) thereby reducing the difference. The deltas, for Wikipedia entities, between canonical partials and canonicals,  and all name variants and canonicals are high  suggesting that partial names and all other name variants bring in new documents that can not be retrieved by canonical names. The rest of the two deltas are very small suggesting that partial names of all name variants do not bring in new relevant documents. In Twitter entities,  name variants bring in new documents.
 % The  biggest delta  observed is in Twitter entities between partials of all name variants and partials of canonicals (93\%). delta. Both of them are for news category.  For Wikipedia entities, the highest delta observed is 19.5\% in cano\_part - cano followed by 17.5\% in all\_part in relevant.
   \subsection{Entity Type: Wikipedia and Twitter)}
 From Table \ref{tab:name} shows the difference between Wikipedia and Twitter entities.  Wikipedia entities' canonical names achieve a recall of 70\%, and partial names of canonical names achieve a recall of 86.1\%. This is an increase in recall of 16.1\%. By contrast, the increase in recall of partial names of all name variants over just all name variants is 8.3.  The high increase in recall when moving from canonical names  to their partial names, in comparison to the lower increase when moving from all name variants to their partial names can be explained by saturation. This is to mean that documents have already been extracted by the different name variants and thus using partial name does not bring in many new documents. One interesting observation is that, on Wikipedia entities, partial names of canonical names achieve better results than name variants.  This holds in both cleansed and raw extractions. %In the raw extraction, the difference is about 3.7.
   \subsection{Entity Types: Wikipedia and Twitter)}
 Table \ref{tab:name} shows the difference between Wikipedia and Twitter entities.  Wikipedia entities' canonical names achieve a recall of 70\%, and partial names of canonical names achieve a recall of 86.1\%. This is an increase in recall of 16.1\%. By contrast, the increase in recall of partial names of all name variants over just all name variants is 8.3.  The high increase in recall when moving from canonical names  to their partial names, in comparison to the lower increase when moving from all name variants to their partial names can be explained by saturation. This is to mean that documents have already been extracted by the different name variants and thus using their partial names does not bring in many new relevant documents. One interesting observation is that, on Wikipedia entities, partial names of canonical names achieve better recall than name variants.  This holds in both cleansed and raw extractions. %In the raw extraction, the difference is about 3.7.
-In Twitter entities, however, it is different. Both canonical and their partial names perform the same and the recall is very low. Canonical names and partial canonical names are the same for Twitter entities because they are one word names. For example in https://twitter.com/roryscovel, ``roryscovel`` is the canonical name and its partial name is also the same.  That they perform very low is because the canonical names of Twitter entities are not really names; they are usually arbitrarily created user names. It shows that  documents do not refer to Twitter entities by their user names. They refer to them by their display names, which is reflected in the recall (67.9\%). The use of partial names of all name variants increases the recall to 88.2\%.
+In Twitter entities, however, it is different. Both canonical and their partial names perform the same and the recall is very low. Canonical names and partial canonical names are the same for Twitter entities because they are one word names. For example in https://twitter.com/roryscovel, ``roryscovel`` is the canonical name and its partial name is also the same.  The low recall is because the canonical names of Twitter entities are not really names; they are usually arbitrarily created user names. It shows that  documents do not refer to Twitter entities by their user names. They refer to them by their display names, which is reflected in the recall (67.9\%). The use of partial names of display names increases the recall to 88.2\%.
 When we talk at an aggregate-level (both Twitter and Wikipedia entities), we observe two important patterns. 1) we see that recall increases as we move from canonical names to canonical partial names, to all name variants, and to partial names of name variants. But we saw that that is not the case in Wikipedia entities.  The influence, therefore, comes from Twitter entities. 2) Using canonical names retrieves the least number of vital or relevant documents, and the partial names of all name variants retrieves the most number of documents. The difference in performance is 31.9\% on all entities, 20.7\% on Wikipedia entities, and 79.5\% on Twitter entities. This is a significant performance difference.
-The tables in \ref{tab:name} and \ref{tab:source-delta} show, recall for Wikipedia entities are higher than for Twitter. This indicates that Wikipedia entities are easier to match in documents than Twitter. This can be due to two reasons: 1) Wikipedia entities are relatively well described than Twitter entities. The fact that we can retrieve different name variants from DBpedia is a measure of relative description. By contrast, we have only two names for Twitter entities: their user names and their display names which we collect from their Twitter pages. 2) DBpedia entities are less obscure and that is why they are not in Wikipedia anyways. Another point is that mentioned by their display names more than they are by their user names. We also observed that social documents mention Twitter entities by their user names more than news suggesting a distinction between the standard in news and social documents.
+The tables in \ref{tab:name} and \ref{tab:source-delta} show recall for Wikipedia entities are higher than for Twitter. This indicates that Wikipedia entities are easier to match in documents than Twitter. This can be due to one reasons: 1) Wikipedia entities are relatively well described than Twitter entities. The fact that we can retrieve different name variants from DBpedia is a measure of relatively rich description. By contrast, we have only two names for Twitter entities: their user names and their display names which we collect from their Twitter pages. 2) Twitter entities have a less richer entity profiles such as DBpedia from which we can collect alternative names. The results in the tables also show  that Twitter entities are mentioned by their display names more than  by their user names. However,  social documents mention Twitter entities by their user names more than news suggesting a distinction in standard between news and social documents.
    \subsection{Impact on classification}
-  In the overall experimental setup, the classification and evaluation, ranking and evaluation are kept constant. In here, we present results showing how  the choices in corpus, entity types, and entity profiles impact these latest stages of the pipeline.  In tables \ref{tab:class-vital} and \ref{tab:class-vital-relevant}, we show the performances in F-measure and SU.
+  In the overall experimental setup, classification, ranking,  and evaluationn are kept constant. In here, we present results showing how  the choices in corpus, entity types, and entity profiles impact these latest stages of the pipeline.  In tables \ref{tab:class-vital} and \ref{tab:class-vital-relevant}, we show the performances in F-measure and SU.
 \begin{table*}
 \caption{vital performance under different name variants(upper part from cleansed, lower part from raw)}
 \begin{center}
 For Wikipedia entities,  canonical partial names seem to achieve the highest performance. For Twitter, the partial names of name variants achieve  better results. In vital-relevant, in three cases, raw achieves better results (except in cano partials). For Twitter entities, the raw corpus achieves better results.  In terms of  entity profiles, Wikipedia's canonical partial names achieves  the best F-score. For Twitter, as before, partial names of canonical names.
-It seems, the raw corpus has more effect on Twitter entities performances. An increase in recall does not necessarily mean an increase in F-measure.  The fact that canonical partial names achive better results is interesting.  We know that partial names were used as a baseline in TREC KBA 2012, but no one of the KBA participants actually used partial names for filtering.
+It seems, the raw corpus has more effect on Twitter entities performances. An increase in recall does not necessarily mean an increase in F-measure.  The fact that canonical partial names achieve better results is interesting.  We know that partial names were used as a baseline in TREC KBA 2012, but no one of the KBA participants actually used partial names for filtering.
 \subsection{Missing relevant documents \label{miss}}
  The interesting set  of relevance judgments are those that  we miss from both raw and cleansed extractions. These are 2146 unique document-entity pairs, 219 of them are with vital relevance judgments.   The total number of entities in the missed vital annotations is  28 Wikipedia and 7  Twitter, making a total of 35. Looking into document categories shows that the  great majority (86.7\%) of the documents are social. This suggests that social (tweets and blogs) can talk about the entities without mentioning  them by name. This is, of course, inline with intuition.
-   Vital documents show higher recall than relevants. This is not surprising as we it is more liekely that vital documents mention the entities more than relevant. Across docuemnt categories, we observe a pattern in recall of others, followed by news, and then by social. Socila documents are the hardest to retrieve. This can be explained by the fact that social docuemnts (tweets, blogs) are more likely to point to a resource without mentioning the entities. By contrast news documents mention the entities they talk about.
+   Vital documents show higher recall than relevant. This is not surprising as we it is more likely that vital documents mention the entities more than relevant. Across document categories, we observe a pattern in recall of others, followed by news, and then by social. Social documents are the hardest to retrieve. This can be explained by the fact that social documents (tweets, blogs) are more likely to point to a resource without mentioning the entities. By contrast news documents mention the entities they talk about.
+%
 % \label{tab:miss from both}
 % \end{table*}
-We also observed that although docuemnts have different document ids, several of them have the same content. In the vital annotation, there are only three (88 news, and 409 weblog). In the 35  vital document-entity pairs we examined, and 13 are news and  22 are social.
+We also observed that although documents have different document ids, several of them have the same content. In the vital annotation, there are only three (88 news, and 409 weblog). In the 35  vital document-entity pairs we examined, and 13 are news and  22 are social.
 We conducted experiments to study  the impacts on recall of
 different components of the filtering step of entity-based filtering and ranking pipeline. Specifically
 we conducted experiments to study the impacts of cleansing,
-entity profile, relevance rankings, categories of documents, and documents that are missed. We also measured their impacts on clasification peformance.
+entity profile, relevance rankings, categories of documents, and documents that are missed. We also measured their impacts on classification performance.
 Experimental results using TREC-KBA problem setting and dataset  show that cleansing removes entire or parts of document contents making them difficult to retrieve. These documents can, otherwise, be retrieved from the raw version. The use of the raw corpus brings in documents that can not be retrieved from the cleansed corpus. This is true for all entity profiles and for all entity types. The  recall increase is  between 6.8\% to 26.2\%. These increase, in actual document-entity pairs,  is in thousands.

0 comments (0 inline, 0 general)