HCDA/cikm-paper Changeset - 43f0469d19e6 · Centrum Wiskunde & Informatica (CWI)

Changeset - 43f0469d19e6

Parent rev.

Child rev.

[Not reviewed]

Merge

0 2 0

Gebrekirstos Gebremeskel - 11 years ago 2014-06-12 06:30:51
destinycome@gmail.com

updated

2 files changed with 53 insertions and 5 deletions:

mypaper-final.tex

sigproc.bib

0 comments (0 inline, 0 general)

mypaper-final.tex

➞

Show inline comments

@@ @@ -202,17 +202,17 @@ performance. The main contribution of the @@
 paper are an in-depth analysis of the factors that affect entity-based
 stream filtering, identifying optimal entity profiles without
 compromising precision, describing and classifying relevant documents
 that are not amenable to filtering , and estimating the upper-bound
 of recall on entity-based filtering.
-<<<<<<< HEAD
 The rest of the paper  is organized as follows. Section \ref{sec:desc} describes the dataset and section \ref{sec:fil} defines the task. In section  \ref{sec:lit}, we discuss related litrature folowed by a discussion of our method in \ref{sec:mthd}. Following that,  we present the experimental resulsy in \ref{sec:expr}, and discuss and analyze them in \ref{sec:analysis}. Towards the end, we discuss the impact of filtering choices on classification in section \ref{sec:impact}, examine and categorize unfilterable docuemnts in section \ref{sec:unfil}. Finally, we present our conclusions in \ref{sec:conc}.
 =======
 The rest of the paper  is organized as follows. Section \ref{sec:desc} describes the dataset and section \ref{sec:fil} defines the task. In section  \ref{sec:lit}, we discuss related litrature folowed by a discussion of our method in \ref{sec:mthd}. Following that,  we present the experimental resulsy in \ref{sec:expr}, and discuss and analyze them in \ref{sec:analysis}. Towards the end, we discuss the impact of filtering choices on classification in section \ref{sec:impact}, examine and categorize unfilterable documents in section \ref{sec:unfil}. Finally, we present our conclusions in \ref{}{sec:conc}.
 >>>>>>> 51b8586f2e1def3777b3e65737b7ab32c2ff0981
 The rest of the paper  is organized as follows. Section \ref{sec:desc} describes the dataset and section \ref{sec:fil} defines the task. In section  \ref{sec:lit}, we discuss related literature folowed by a discussion of our method in \ref{sec:mthd}. Following that,  we present the experimental resulsy in \ref{sec:expr}, and discuss and analyze them in \ref{sec:analysis}. Towards the end, we discuss the impact of filtering choices on classification in section \ref{sec:impact}, examine and categorize unfilterable documents in section \ref{sec:unfil}. Finally, we present our conclusions in \ref{}{sec:conc}.
  \section{Data Description}\label{sec:desc}
 We base this analysis on the TREC-KBA 2013 dataset%
 \footnote{\url{http://trec-kba.org/trec-kba-2013.shtml}}
 that consists of three main parts: a time-stamped stream corpus, a set of
 performance to select the best entity profiles.To generate the overall
 pipeline performance we use the official TREC KBA evaluation metric
 and scripts \cite{frank2013stream} to report max-F, the maximum
 F-score obtained over all relevance cut-offs.
 \section{Literature Review} \label{sec:lit}
 There has been a great deal of interest  as of late on entity-based filtering and ranking. One manifestation of that is the introduction of TREC KBA in 2012. Following that, there have been a number of research works done on the topic \cite{frank2012building, ceccarelli2013learning, taneva2013gem, wang2013bit, balog2013multi}.  These works are based on KBA 2012 task and dataset  and they address the whole problem of entity filtering and ranking.  TREC KBA continued in 2013, but the task underwent some changes. The main change between  the 2012 and 2013 are in the number of entities, the type of entities, the corpus and the relevance rankings.
 There has been a great deal of interest  as of late on entity-based filtering and ranking.  The Text Analysis Conference   started  Knwoledge Base Population with the goal of developing methods and technologies to fascilitate the creation and population of KBs \cite{ji2011knowledge}. The most relevant track in KBP is entity-linking: given an entity and
 a document containing a mention of the entity, identify the mention in the document and link it to the its profile in a KB.  Many studies have attempted to address this task \cite{dalton2013neighborhood, dredze2010entity, davis2012named}.
  A more recent manifestation of that is the introduction of TREC KBA in 2012.  Following that, there have been a number of research works done on the topic \cite{frank2012building, ceccarelli2013learning, taneva2013gem, wang2013bit, balog2013multi}.  These works are based on KBA 2012 task and dataset  and they address the whole problem of entity filtering and ranking.  TREC KBA continued in 2013, but the task underwent some changes. The main change between  the 2012 and 2013 are in the number of entities, the type of entities, the corpus and the relevance rankings.
 The number of entities increased from 29 to 141, and it included 20 Twitter entities. The TREC KBA 2012 corpus is 1.9TB after xz-compression and has  400M documents. By contrast, the KBA 2013 corpus is 6.45 after XZ-compression and GPG encryption. A version with all-non English documented removed  is 4.5 TB and consists of 1 Billion documents. The 2013 corpus subsumed the 2012 corpus and added others from spinn3r, namely main-stream news, forum, arxiv, classified, reviews and meme-tracker.  A more important difference is, however, a change in the definitions of relevance ratings vital and relevant. While in KBA 2012, a document was judged vital if it has citation-worthy content for a given entity, in 2013 it must have the freshliness, that is the content must trigger an editing of the given entity's KB entry.
 While the tasks of 2012 and 2013 are fundamentally the same, the approaches  varied due  to the size of the corpus. In 2013, all participants used filtering to reduce the size of the big corpus.   They used different ways of filtering: many of them used two or more of different name variants from DBpedia such as labels, names, redirects, birth names, alias, nicknames, same-as and alternative names \cite{wang2013bit,dietzumass,liu2013related, zhangpris}.  Although most of the participants used DBpedia name variants none of them used all the name variants.  A few other participants used bold words in the first paragraph of the Wikipedia entity's profiles and anchor texts from other Wikipedia pages  \cite{bouvierfiltering, niauniversity}. One participant used Boolean \emph{and} built from the tokens of the canonical names \cite{illiotrec2013}.
 All of the studies used filtering as their first step to generate a smaller set of documents. And many systems suffered from poor recall and their system performances were highly affected \cite{frank2012building}. Although  systems  used different entity profiles to filter the stream, and achieved different performance levels, there is no study on and the factors and choices that affect the filtering step itself. Of course filtering has been extensively examined in TREC Filtering \cite{robertson2002trec}. However, those studies were isolated in the sense that they were intended to optimize recall. What we have here is a different scenario. Documents have relevance rating. Thus we want to study filtering in connection to  relevance to the entities and thus can be done by coupling filtering to the later stages of the pipeline. This is new to the best of our knowledge and the TREC KBA problem setting and data-sets offer a good opportunity to examine this aspect of filtering.

sigproc.bib

➞

Show inline comments

@@ @@ -138,6 +138,50 @@ @@
   title={A Cross-Lingual Dictionary for English Wikipedia Concepts.},
   author={Spitkovsky, Valentin I and Chang, Angel X},
   booktitle={LREC},
   pages={3168--3175},
   year={2012}
+}
 @inproceedings{ji2011knowledge,
   title={Knowledge base population: Successful approaches and challenges},
   author={Ji, Heng and Grishman, Ralph},
   booktitle={Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1},
   pages={1148--1158},
   year={2011},
   organization={Association for Computational Linguistics}
+}
 @techreport{singh12:wiki-links,
       author    = {Sameer Singh and Amarnag Subramanya and Fernando Pereira and Andrew McCallum},
       title     = {Wikilinks: A Large-scale Cross-Document Coreference Corpus Labeled via Links to {Wikipedia}},
       institute = {University of Massachusetts, Amherst},
       number    = {UM-CS-2012-015},
       year      = {2012}
+}
 @inproceedings{dredze2010entity,
   title={Entity disambiguation for knowledge base population},
   author={Dredze, Mark and McNamee, Paul and Rao, Delip and Gerber, Adam and Finin, Tim},
   booktitle={Proceedings of the 23rd International Conference on Computational Linguistics},
   pages={277--285},
   year={2010},
   organization={Association for Computational Linguistics}
+}
 @inproceedings{dalton2013neighborhood,
   title={A neighborhood relevance model for entity linking},
   author={Dalton, Jeffrey and Dietz, Laura},
   booktitle={Proceedings of the 10th Conference on Open Research Areas in Information Retrieval},
   pages={149--156},
   year={2013},
   organization={LE CENTRE DE HAUTES ETUDES INTERNATIONALES D'INFORMATIQUE DOCUMENTAIRE}
+}
 @inproceedings{davis2012named,
   title={Named entity disambiguation in streaming data},
   author={Davis, Alexandre and Veloso, Adriano and da Silva, Altigran S and Meira Jr, Wagner and Laender, Alberto HF},
   booktitle={Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1},
   pages={815--824},
   year={2012},
   organization={Association for Computational Linguistics}
+}
@@ \ No newline at end of file @@

0 comments (0 inline, 0 general)