HCDA/cikm-paper Changeset - 2594949a76bb · Centrum Wiskunde & Informatica (CWI)

Changeset - 2594949a76bb

Parent rev.

Child rev.

[Not reviewed]

0 1 0

Arjen de Vries (arjen) - 11 years ago 2014-06-12 03:02:11
arjen.de.vries@cwi.nl

trying...

1 file changed with 5 insertions and 1 deletions:

mypaper-final.tex

0 comments (0 inline, 0 general)

mypaper-final.tex

➞

Show inline comments

 All of the studies used filtering as their first step to generate a smaller set of documents. And many systems suffered from poor recall and their system performances were highly affected \cite{frank2012building}. Although  systems  used different entity profiles to filter the stream, and achieved different performance levels, there is no study on and the factors and choices that affect the filtering step itself. Of course filtering has been extensively examined in TREC Filtering \cite{robertson2002trec}. However, those studies were isolated in the sense that they were intended to optimize recall. What we have here is a different scenario. Documents have relevance rating. Thus we want to study filtering in connection to  relevance to the entities and thus can be done by coupling filtering to the later stages of the pipeline. This is new to the best of our knowledge and the TREC KBA problem setting and data-sets offer a good opportunity to examine this aspect of filtering.
 Moreover, there has not been a chance to study at this scale and/or a study into what type of documents defy filtering and why? In this paper, we conduct a manual examination of the documents that are missing and classify them into different categories. We also estimate the general upper bound of recall using the different entities profiles and choose the best profile that results in an increased over all performance as measured by F-measure.
 \section{Method}
 We work with the docuemnts have relavance assessments. For this purpose, we extracted those docuemnts from the big corpus.    We experiment with all KB entities.  For each KB entity, we extract different name variants from DBpedia and Twitter.
 All analyses in this paper are carried out on the documents that have
 relevance assessments associated to them. For this purpose, we
 extracted those documents from the big corpus. We experiment with all
 KB entities. For each KB entity, we extract different name variants
 from DBpedia and Twitter.
+\
 \subsection{Entity Profiling}
 We build profiles for the KB entities of interest. We have two types: Twitter and Wikipedia. Both Entities are selected, on purpose, to be sparse, less-documented.  For the Twitter entities, we visit their respective Twitter pages  and  manually fetch their display names. For the Wikipedia entities, we fetch different name variants from DBpedia, namely  name, label, birth name, alternative names, redirects, nickname, or alias.  The extraction results are in Table \ref{tab:sources}.
 \begin{table}
 \caption{Number of different DBpedia name variants}

0 comments (0 inline, 0 general)