From 7babd96c03f9ff71359881709d1d9f70b4636196 2014-06-12 05:21:03
From: Gebrekirstos Gebremeskel <destinycome@gmail.com>
Date: 2014-06-12 05:21:03
Subject: [PATCH] updated

---

diff --git a/mypaper-final.tex b/mypaper-final.tex
index 6b5a6254620784570a75dfe24eca0a7e8e124e0d..13077f056ee452137910c91969c62f7564d2a96f 100644
--- a/mypaper-final.tex
+++ b/mypaper-final.tex
@@ -205,11 +205,11 @@ compromising precision, describing and classifying relevant documents
 that are not amenable to filtering , and estimating the upper-bound
 of recall on entity-based filtering.
 
-<<<<<<< HEAD
+<<<<<<< HEAD
 The rest of the paper  is organized as follows. Section \ref{sec:desc} describes the dataset and section \ref{sec:fil} defines the task. In section  \ref{sec:lit}, we discuss related litrature folowed by a discussion of our method in \ref{sec:mthd}. Following that,  we present the experimental resulsy in \ref{sec:expr}, and discuss and analyze them in \ref{sec:analysis}. Towards the end, we discuss the impact of filtering choices on classification in section \ref{sec:impact}, examine and categorize unfilterable docuemnts in section \ref{sec:unfil}. Finally, we present our conclusions in \ref{sec:conc}.
-=======
+=======
 The rest of the paper  is organized as follows. Section \ref{sec:desc} describes the dataset and section \ref{sec:fil} defines the task. In section  \ref{sec:lit}, we discuss related litrature folowed by a discussion of our method in \ref{sec:mthd}. Following that,  we present the experimental resulsy in \ref{sec:expr}, and discuss and analyze them in \ref{sec:analysis}. Towards the end, we discuss the impact of filtering choices on classification in section \ref{sec:impact}, examine and categorize unfilterable documents in section \ref{sec:unfil}. Finally, we present our conclusions in \ref{}{sec:conc}.
->>>>>>> 51b8586f2e1def3777b3e65737b7ab32c2ff0981
+>>>>>>> 51b8586f2e1def3777b3e65737b7ab32c2ff0981
 
 
  \section{Data Description}\label{sec:desc}
@@ -726,10 +726,12 @@ features we used are 13 and are listed below.
   
 \paragraph*{Google's Cross Lingual Dictionary (GCLD)}
 
-This is a mapping of strings to Wikipedia concepts and vice versa
-\cite{spitkovsky2012cross}. 
+The GCLD corpus estimates two probabilities:
 (1) the probability with which a string is used as anchor text to
 a Wikipedia entity 
+%thus distributing the probability mass over the different entities that it is used as anchor text;
+and (2) the 
+probability that indicates the strength of co-reference of an anchor with respect to other anchors to  a given Wikipedia entity.  We use the product of both for each string.
 
 \paragraph*{jac} 
   Jaccard similarity between the document and the entity's Wikipedia page
@@ -941,49 +943,49 @@ In vital-relevant category (Table \ref{tab:class-vital-relevant}), the performan
 We conducted experiments to study the impacts on recall of 
 different components of the filtering stage of entity-based filtering and ranking pipeline. Specifically 
 we conducted experiments to study the impacts of cleansing, 
-entity profiles, relevance ratings, categories of documents, entity
-profiles. We also measured  impact of the different factors and
+entity profiles, relevance ratings, categories of documents, entity
+profiles. We also measured  impact of the different factors and
 choices  on later stages of the pipeline of our own system. 
 
-Experimental results show that cleansing can remove entire or parts of
-the content of documents making them difficult to retrieve. These
-documents can, otherwise, be retrieved from the raw version. The use
-of the raw corpus brings in documents that can not be retrieved from
-the cleansed corpus. This is true for all entity profiles and for all
-entity types. The  recall difference between the cleansed and raw
-ranges from  6.8\% t 26.2\%. These increases, in actual
-document-entity pairs,  is in thousands. We believe this is a
-substantial increase. However, the recall increases do not always
-translate to improved F-score in overall performance.  In the vital
-relevance ranking for both Wikipedia and aggregate entities, the
-cleansed version performs better than the raw version.  In Twitter
-entities, the raw corpus achieves better except in the case of all
-name-variant, though the difference is negligible.  However, for
-vital-relevant, the raw corpus performs  better across all entity
-profiles and entity types except in partial canonical names of
-Wikipedia entities.
-
-The use of different profiles also shows a big difference in
-recall. While in Wikipedia the use of canonical
-partial achieves better than name-variant, there is a steady increase
-in recall from canonical to canonical partial, to name-variant, and
-to name-variant partial. This pattern is also observed across the
-document categories.  However, here too, the relationship between
-the gain in recall as we move from less richer profile to a more
-richer profile and overall performance as measured by F-score  is not
-linear.
-
+Experimental results show that cleansing can remove entire or parts of
+the content of documents making them difficult to retrieve. These
+documents can, otherwise, be retrieved from the raw version. The use
+of the raw corpus brings in documents that can not be retrieved from
+the cleansed corpus. This is true for all entity profiles and for all
+entity types. The  recall difference between the cleansed and raw
+ranges from  6.8\% t 26.2\%. These increases, in actual
+document-entity pairs,  is in thousands. We believe this is a
+substantial increase. However, the recall increases do not always
+translate to improved F-score in overall performance.  In the vital
+relevance ranking for both Wikipedia and aggregate entities, the
+cleansed version performs better than the raw version.  In Twitter
+entities, the raw corpus achieves better except in the case of all
+name-variant, though the difference is negligible.  However, for
+vital-relevant, the raw corpus performs  better across all entity
+profiles and entity types except in partial canonical names of
+Wikipedia entities.
+
+The use of different profiles also shows a big difference in
+recall. While in Wikipedia the use of canonical
+partial achieves better than name-variant, there is a steady increase
+in recall from canonical to canonical partial, to name-variant, and
+to name-variant partial. This pattern is also observed across the
+document categories.  However, here too, the relationship between
+the gain in recall as we move from less richer profile to a more
+richer profile and overall performance as measured by F-score  is not
+linear.
+
 
 %%%%%%%%%%%%
 
 
-In vital ranking, across all entity profiles and types of corpus,
-Wikipedia's canonical partial  achieves better performance than any
-other Wikipedia entity profiles. In vital-relevant documents too,
-Wikipedia's canonical partial achieves the best result. In the raw
-corpus, it achieves a little less than name-variant partial. For
-Twitter entities, the name-variant partial profile achieves the
-highest F-score across all entity profiles and types of corpus.
+In vital ranking, across all entity profiles and types of corpus,
+Wikipedia's canonical partial  achieves better performance than any
+other Wikipedia entity profiles. In vital-relevant documents too,
+Wikipedia's canonical partial achieves the best result. In the raw
+corpus, it achieves a little less than name-variant partial. For
+Twitter entities, the name-variant partial profile achieves the
+highest F-score across all entity profiles and types of corpus.
 
 
 There are 3 interesting observations: 
@@ -1007,32 +1009,32 @@ transformation and cleasing processes.
 %%%% NEEDS WORK:
 
 Taking both performance (recall at filtering and overall F-score
-during evaluation) into account, there is a clear trade-off between
-using a richer entity-profile and retrieval of irrelevant
-documents. The richer the profile, the more relevant documents it
-retrieves, but also the more irrelevant documents. To put it into
-perspective, lets compare the number of documents that are retrieved
-with  canonical partial and with name-variant partial. Using the raw
-corpus, the former retrieves a total of 2547487 documents and achieves
-a recall of 72.2\%. By contrast, the later retrieves a total of
-4735318 documents and achieves a recall of 90.2\%. The total number of
-documents extracted increases by 85.9\% for a recall gain of 18\%. The
-rest of the documents, that is 67.9\%, are newly introduced irrelevant
-documents.
-
-Perhaps surprising, Wikipedia's canonical partial is the best entity profile for Wikipedia
-entities. Here, the retrieval of
-thousands vital-relevant document-entity pairs by name-variant partial
-does not materialize into an increase in over all performance. Notice
-that none of the participants in TREC KBA considered canonical partial
-as a viable strategy though. We conclude that, at least for our
-system, the remainder of the pipeline needs a different approach to
-handle the correct scoring of the additional documents -- that are
-necessary if we do not want to accept a low recall of the filtering
-step.
-%With this understanding, there  is actually no
-%need to go and fetch different names variants from DBpedia, a saving
-%of time and computational resources.
+during evaluation) into account, there is a clear trade-off between
+using a richer entity-profile and retrieval of irrelevant
+documents. The richer the profile, the more relevant documents it
+retrieves, but also the more irrelevant documents. To put it into
+perspective, lets compare the number of documents that are retrieved
+with  canonical partial and with name-variant partial. Using the raw
+corpus, the former retrieves a total of 2547487 documents and achieves
+a recall of 72.2\%. By contrast, the later retrieves a total of
+4735318 documents and achieves a recall of 90.2\%. The total number of
+documents extracted increases by 85.9\% for a recall gain of 18\%. The
+rest of the documents, that is 67.9\%, are newly introduced irrelevant
+documents.
+
+Perhaps surprising, Wikipedia's canonical partial is the best entity profile for Wikipedia
+entities. Here, the retrieval of
+thousands vital-relevant document-entity pairs by name-variant partial
+does not materialize into an increase in over all performance. Notice
+that none of the participants in TREC KBA considered canonical partial
+as a viable strategy though. We conclude that, at least for our
+system, the remainder of the pipeline needs a different approach to
+handle the correct scoring of the additional documents -- that are
+necessary if we do not want to accept a low recall of the filtering
+step.
+%With this understanding, there  is actually no
+%need to go and fetch different names variants from DBpedia, a saving
+%of time and computational resources.
 
 
 %%%%%%%%%%%%