Changeset - 51b8586f2e1d
[Not reviewed]
0 1 0
Arjen de Vries (arjen) - 11 years ago 2014-06-12 05:12:35
arjen.de.vries@cwi.nl
missing sec label fixed
1 file changed with 3 insertions and 3 deletions:
0 comments (0 inline, 0 general)
mypaper-final.tex
Show inline comments
 
@@ -202,13 +202,13 @@ performance. The main contribution of the
 
paper are an in-depth analysis of the factors that affect entity-based
 
stream filtering, identifying optimal entity profiles without
 
compromising precision, describing and classifying relevant documents
 
that are not amenable to filtering , and estimating the upper-bound
 
of recall on entity-based filtering.
 
 
The rest of the paper  is organized as follows. Section \ref{sec:desc} describes the dataset and section \ref{sec:fil} defines the task. In section  \ref{sec:lit}, we discuss related litrature folowed by a discussion of our method in \ref{sec:mthd}. Following that,  we present the experimental resulsy in \ref{sec:expr}, and discuss and analyze them in \ref{sec:analysis}. Towards the end, we discuss the impact of filtering choices on classification in section \ref{sec:impact}, examine and categorize unfilterable docuemnts in section \ref{sec:unfil}. Finally, we present our conclusions in \ref{}{sec:conc}.
 
The rest of the paper  is organized as follows. Section \ref{sec:desc} describes the dataset and section \ref{sec:fil} defines the task. In section  \ref{sec:lit}, we discuss related litrature folowed by a discussion of our method in \ref{sec:mthd}. Following that,  we present the experimental resulsy in \ref{sec:expr}, and discuss and analyze them in \ref{sec:analysis}. Towards the end, we discuss the impact of filtering choices on classification in section \ref{sec:impact}, examine and categorize unfilterable documents in section \ref{sec:unfil}. Finally, we present our conclusions in \ref{}{sec:conc}.
 
 
 
 \section{Data Description}\label{sec:desc}
 
We base this analysis on the TREC-KBA 2013 dataset%
 
\footnote{\url{http://trec-kba.org/trec-kba-2013.shtml}}
 
that consists of three main parts: a time-stamped stream corpus, a set of
 
@@ -1045,15 +1045,15 @@ The high recall and subsequent higher overall performance of Wikipedia entities
 
In the experimental results, we also observed that recall scores in the vital category are higher than in the relevant category. This observation  confirms one commonly held assumption:(frequency) mention is related to relevance.  this is the assumption why term frequency is used an indicator of document relevance in many information retrieval systems. The more  a document mentions an entity explicitly by name, the more likely the document is vital to the entity.
 
 
Across document categories, we observe a pattern in recall of others, followed by news, and then by social. Social documents are the hardest to retrieve. This can be explained by the fact that social documents (tweets and  blogs) are more likely to point to a resource where the entity is mentioned, mention the entities with some short abbreviation, or talk without mentioning the entities, but with some context in mind. By contrast news documents mention the entities they talk about using the common name variants more than social documents do. However, the greater difference in percentage recall between the different entity profiles in the news category indicates news refer to a given entity with different names, rather than by one standard name. By contrast others show least variation in referring to news. Social documents falls in between the two.  The deltas, for Wikipedia entities, between canonical partials and canonicals,  and name-variants and canonicals are high, an indication that canonical partials 
 
and name-variants bring in new relevant documents that can not be retrieved by canonicals. The rest of the two deltas are very small,  suggesting that partial names of name variants do not bring in new relevant documents. 
 
 
 
\section{Unfilterable documents}\label{sec:unfil}
 
%\section{Unfilterable documents}\label{sec:unfil}
 
 
\subsection{Missing vital-relevant documents \label{miss}}
 
\section{Missing vital-relevant documents}\label{sec:unfil}
 
 
% 
 
 
 The use of name-variant partial for filtering is an aggressive attempt to retrieve as many relevant documents as possible at the cost of retrieving irrelevant documents. However, we still miss about  2363(10\%) of the vital-relevant documents.  Why are these documents missed? If they are not mentioned by partial names of name variants, what are they mentioned by? Table \ref{tab:miss} shows the documents that we miss with respect to cleansed and raw corpus.  The upper part shows the number of documents missing from cleansed and raw versions of the corpus. The lower part of the table shows the intersections and exclusions in each corpus.  
 
 
\begin{table}
0 comments (0 inline, 0 general)