HCDA/cikm-paper Changeset - 43f0469d19e6 · Centrum Wiskunde & Informatica (CWI)

Changeset - 43f0469d19e6

Parent rev.

Child rev.

[Not reviewed]

Merge

0 2 0

Gebrekirstos Gebremeskel - 11 years ago 2014-06-12 06:30:51
destinycome@gmail.com

updated

2 files changed with 53 insertions and 5 deletions:

mypaper-final.tex

sigproc.bib

0 comments (0 inline, 0 general)

mypaper-final.tex

➞

Show inline comments

@@ @@ -184,53 +184,53 @@ occur (news, blogs, or tweets) cause further variations. @@
 % final results in comparison to other systems.  A typical TREC KBA
 % poster presentation or talk explains the system pipeline and reports
 % the final results. The systems may employ similar (even the same)
 % steps  but the choices they make at every step are usually
 % different.
 In such a situation, it becomes hard to identify the factors that
 result in improved performance. There is  a lack of insight across
 different approaches. This makes  it hard to know whether the
 improvement in performance of a particular approach is due to
 preprocessing, filtering, classification, scoring  or any of the
 sub-components of the pipeline.
 In this paper, we therefore fix the subsequent steps of the pipeline,
 and zoom in on \emph{only} the filtering step; and conduct an in-depth analysis of its
 main components.  In particular, we study the effect of cleansing,
 entity profiling, type of entity filtered for (Wikipedia or Twitter), and
 document category (social, news, etc) on the filtering components'
 performance. The main contribution of the
 paper are an in-depth analysis of the factors that affect entity-based
 stream filtering, identifying optimal entity profiles without
 compromising precision, describing and classifying relevant documents
 that are not amenable to filtering , and estimating the upper-bound
 of recall on entity-based filtering.
-<<<<<<< HEAD
 The rest of the paper  is organized as follows. Section \ref{sec:desc} describes the dataset and section \ref{sec:fil} defines the task. In section  \ref{sec:lit}, we discuss related litrature folowed by a discussion of our method in \ref{sec:mthd}. Following that,  we present the experimental resulsy in \ref{sec:expr}, and discuss and analyze them in \ref{sec:analysis}. Towards the end, we discuss the impact of filtering choices on classification in section \ref{sec:impact}, examine and categorize unfilterable docuemnts in section \ref{sec:unfil}. Finally, we present our conclusions in \ref{sec:conc}.
 =======
 The rest of the paper  is organized as follows. Section \ref{sec:desc} describes the dataset and section \ref{sec:fil} defines the task. In section  \ref{sec:lit}, we discuss related litrature folowed by a discussion of our method in \ref{sec:mthd}. Following that,  we present the experimental resulsy in \ref{sec:expr}, and discuss and analyze them in \ref{sec:analysis}. Towards the end, we discuss the impact of filtering choices on classification in section \ref{sec:impact}, examine and categorize unfilterable documents in section \ref{sec:unfil}. Finally, we present our conclusions in \ref{}{sec:conc}.
 >>>>>>> 51b8586f2e1def3777b3e65737b7ab32c2ff0981
 The rest of the paper  is organized as follows. Section \ref{sec:desc} describes the dataset and section \ref{sec:fil} defines the task. In section  \ref{sec:lit}, we discuss related literature folowed by a discussion of our method in \ref{sec:mthd}. Following that,  we present the experimental resulsy in \ref{sec:expr}, and discuss and analyze them in \ref{sec:analysis}. Towards the end, we discuss the impact of filtering choices on classification in section \ref{sec:impact}, examine and categorize unfilterable documents in section \ref{sec:unfil}. Finally, we present our conclusions in \ref{}{sec:conc}.
  \section{Data Description}\label{sec:desc}
 We base this analysis on the TREC-KBA 2013 dataset%
 \footnote{\url{http://trec-kba.org/trec-kba-2013.shtml}}
 that consists of three main parts: a time-stamped stream corpus, a set of
 KB entities to be curated, and a set of relevance judgments. A CCR
 system now has to identify for each KB entity which documents in the
 stream corpus are to be considered by the human curator.
 \subsection{Stream corpus} The stream corpus comes in two versions:
 raw and cleaned. The raw and cleansed versions are 6.45TB and 4.5TB
 respectively,  after xz-compression and GPG encryption. The raw data
 is a  dump of  raw HTML pages. The cleansed version is the raw data
 after its HTML tags are stripped off and only English documents
 identified with Chromium Compact Language Detector
 \footnote{\url{https://code.google.com/p/chromium-compact-language-detector/}}
 are included.  The stream corpus is organized in hourly folders each
 of which contains many  chunk files. Each chunk file contains between
 hundreds and hundreds of thousands of serialized  thrift objects. One
 thrift object is one document. A document could be a blog article, a
 news article, or a social media post (including tweet).  The stream
 corpus comes from three sources: TREC KBA 2012 (social, news and
 linking) \footnote{\url{http://trec-kba.org/kba-stream-corpus-2012.shtml}},
 The TREC filtering and the filtering as part of the entity-centric
 stream filtering and ranking pipepline have different purposes. The
 TREC filtering track's goal is the binary classification of documents:
 for each incoming docuemnt, it decides whether the incoming document
 is relevant or not for a given profile. The docuemnts are either
 relevant or not. In our case, the documents have relevance ranking and
 the goal of the filtering stage is to filter as many potentially
 relevant documents as possible, but less  irrelevant documents as
 possible not to obfuscate the later stages of the piepline.  Filtering
 as part of the pipeline needs that delicate balance between retrieving
 relavant documents and irrrelevant documensts. Bcause of this,
 filtering in this case can only be studied by binding it to the later
 stages of the entity-centric pipeline. This bond influnces how we do
 evaluation.
 To achieve this, we use recall percentages in the filtering stage for
 the different choices of entity profiles. However, we use the overall
 performance to select the best entity profiles.To generate the overall
 pipeline performance we use the official TREC KBA evaluation metric
 and scripts \cite{frank2013stream} to report max-F, the maximum
 F-score obtained over all relevance cut-offs.
 \section{Literature Review} \label{sec:lit}
 There has been a great deal of interest  as of late on entity-based filtering and ranking. One manifestation of that is the introduction of TREC KBA in 2012. Following that, there have been a number of research works done on the topic \cite{frank2012building, ceccarelli2013learning, taneva2013gem, wang2013bit, balog2013multi}.  These works are based on KBA 2012 task and dataset  and they address the whole problem of entity filtering and ranking.  TREC KBA continued in 2013, but the task underwent some changes. The main change between  the 2012 and 2013 are in the number of entities, the type of entities, the corpus and the relevance rankings.
 There has been a great deal of interest  as of late on entity-based filtering and ranking.  The Text Analysis Conference   started  Knwoledge Base Population with the goal of developing methods and technologies to fascilitate the creation and population of KBs \cite{ji2011knowledge}. The most relevant track in KBP is entity-linking: given an entity and
 a document containing a mention of the entity, identify the mention in the document and link it to the its profile in a KB.  Many studies have attempted to address this task \cite{dalton2013neighborhood, dredze2010entity, davis2012named}.
  A more recent manifestation of that is the introduction of TREC KBA in 2012.  Following that, there have been a number of research works done on the topic \cite{frank2012building, ceccarelli2013learning, taneva2013gem, wang2013bit, balog2013multi}.  These works are based on KBA 2012 task and dataset  and they address the whole problem of entity filtering and ranking.  TREC KBA continued in 2013, but the task underwent some changes. The main change between  the 2012 and 2013 are in the number of entities, the type of entities, the corpus and the relevance rankings.
 The number of entities increased from 29 to 141, and it included 20 Twitter entities. The TREC KBA 2012 corpus is 1.9TB after xz-compression and has  400M documents. By contrast, the KBA 2013 corpus is 6.45 after XZ-compression and GPG encryption. A version with all-non English documented removed  is 4.5 TB and consists of 1 Billion documents. The 2013 corpus subsumed the 2012 corpus and added others from spinn3r, namely main-stream news, forum, arxiv, classified, reviews and meme-tracker.  A more important difference is, however, a change in the definitions of relevance ratings vital and relevant. While in KBA 2012, a document was judged vital if it has citation-worthy content for a given entity, in 2013 it must have the freshliness, that is the content must trigger an editing of the given entity's KB entry.
 While the tasks of 2012 and 2013 are fundamentally the same, the approaches  varied due  to the size of the corpus. In 2013, all participants used filtering to reduce the size of the big corpus.   They used different ways of filtering: many of them used two or more of different name variants from DBpedia such as labels, names, redirects, birth names, alias, nicknames, same-as and alternative names \cite{wang2013bit,dietzumass,liu2013related, zhangpris}.  Although most of the participants used DBpedia name variants none of them used all the name variants.  A few other participants used bold words in the first paragraph of the Wikipedia entity's profiles and anchor texts from other Wikipedia pages  \cite{bouvierfiltering, niauniversity}. One participant used Boolean \emph{and} built from the tokens of the canonical names \cite{illiotrec2013}.
 All of the studies used filtering as their first step to generate a smaller set of documents. And many systems suffered from poor recall and their system performances were highly affected \cite{frank2012building}. Although  systems  used different entity profiles to filter the stream, and achieved different performance levels, there is no study on and the factors and choices that affect the filtering step itself. Of course filtering has been extensively examined in TREC Filtering \cite{robertson2002trec}. However, those studies were isolated in the sense that they were intended to optimize recall. What we have here is a different scenario. Documents have relevance rating. Thus we want to study filtering in connection to  relevance to the entities and thus can be done by coupling filtering to the later stages of the pipeline. This is new to the best of our knowledge and the TREC KBA problem setting and data-sets offer a good opportunity to examine this aspect of filtering.
 Moreover, there has not been a chance to study at this scale and/or a study into what type of documents defy filtering and why? In this paper, we conduct a manual examination of the documents that are missing and classify them into different categories. We also estimate the general upper bound of recall using the different entities profiles and choose the best profile that results in an increased over all performance as measured by F-measure.
 \section{Method}\label{sec:mthd}
 All analyses in this paper are carried out on the documents that have
 relevance assessments associated to them. For this purpose, we
 extracted those documents from the big corpus. We experiment with all
 KB entities. For each KB entity, we extract different name variants
 from DBpedia and Twitter.
+\
 \subsection{Entity Profiling}
 We build entity profiles for the KB entities of interest. We have two
 types: Twitter and Wikipedia. Both entities have been selected, on
 purpose by the track organisers, to occur only sparsely and be less-documented.
 For the Wikipedia entities, we fetch different name variants
 from DBpedia: name, label, birth name, alternative names,
 redirects, nickname, or alias.

sigproc.bib

➞

Show inline comments

@@ @@ -120,24 +120,68 @@ @@
   year={2013}
+}
 @article{niauniversity,
   title={University of Florida Knowledge Base Acceleration Notebook},
   author={Nia, Morteza Shahriari and Grant, Christan and Peng, Yang and Wang, Daisy Zhe and Petrovic, Milenko},
    journal={Proceedings of The 22th TREC},
   year={2013}
+}
 @article{frank2013stream,
   title={Evaluating Stream Filtering for Entity Profile Updates for TREC 2013},
   author={Frank, John R and Bauer, J and  Kleiman-Weiner, Max and Roberts, Daniel A and Tripuraneni, Nilesh  and  Zhang, Ce and R{\'e}, Christopher and Voohees, Ellen and Soboroff, Ian},
   journal={Proceedings of The 22th TREC},
   year={2013}
+}
 @inproceedings{spitkovsky2012cross,
   title={A Cross-Lingual Dictionary for English Wikipedia Concepts.},
   author={Spitkovsky, Valentin I and Chang, Angel X},
   booktitle={LREC},
   pages={3168--3175},
   year={2012}
+}
 @inproceedings{ji2011knowledge,
   title={Knowledge base population: Successful approaches and challenges},
   author={Ji, Heng and Grishman, Ralph},
   booktitle={Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1},
   pages={1148--1158},
   year={2011},
   organization={Association for Computational Linguistics}
+}
 @techreport{singh12:wiki-links,
       author    = {Sameer Singh and Amarnag Subramanya and Fernando Pereira and Andrew McCallum},
       title     = {Wikilinks: A Large-scale Cross-Document Coreference Corpus Labeled via Links to {Wikipedia}},
       institute = {University of Massachusetts, Amherst},
       number    = {UM-CS-2012-015},
       year      = {2012}
+}
 @inproceedings{dredze2010entity,
   title={Entity disambiguation for knowledge base population},
   author={Dredze, Mark and McNamee, Paul and Rao, Delip and Gerber, Adam and Finin, Tim},
   booktitle={Proceedings of the 23rd International Conference on Computational Linguistics},
   pages={277--285},
   year={2010},
   organization={Association for Computational Linguistics}
+}
 @inproceedings{dalton2013neighborhood,
   title={A neighborhood relevance model for entity linking},
   author={Dalton, Jeffrey and Dietz, Laura},
   booktitle={Proceedings of the 10th Conference on Open Research Areas in Information Retrieval},
   pages={149--156},
   year={2013},
   organization={LE CENTRE DE HAUTES ETUDES INTERNATIONALES D'INFORMATIQUE DOCUMENTAIRE}
+}
 @inproceedings{davis2012named,
   title={Named entity disambiguation in streaming data},
   author={Davis, Alexandre and Veloso, Adriano and da Silva, Altigran S and Meira Jr, Wagner and Laender, Alberto HF},
   booktitle={Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1},
   pages={815--824},
   year={2012},
   organization={Association for Computational Linguistics}
+}
@@ \ No newline at end of file @@

0 comments (0 inline, 0 general)