From 7babd96c03f9ff71359881709d1d9f70b4636196 2014-06-12 05:21:03 From: Gebrekirstos Gebremeskel Date: 2014-06-12 05:21:03 Subject: [PATCH] updated --- diff --git a/mypaper-final.tex b/mypaper-final.tex index 6b5a6254620784570a75dfe24eca0a7e8e124e0d..13077f056ee452137910c91969c62f7564d2a96f 100644 --- a/mypaper-final.tex +++ b/mypaper-final.tex @@ -205,11 +205,11 @@ compromising precision, describing and classifying relevant documents that are not amenable to filtering , and estimating the upper-bound of recall on entity-based filtering. -<<<<<<< HEAD +<<<<<<< HEAD The rest of the paper is organized as follows. Section \ref{sec:desc} describes the dataset and section \ref{sec:fil} defines the task. In section \ref{sec:lit}, we discuss related litrature folowed by a discussion of our method in \ref{sec:mthd}. Following that, we present the experimental resulsy in \ref{sec:expr}, and discuss and analyze them in \ref{sec:analysis}. Towards the end, we discuss the impact of filtering choices on classification in section \ref{sec:impact}, examine and categorize unfilterable docuemnts in section \ref{sec:unfil}. Finally, we present our conclusions in \ref{sec:conc}. -======= +======= The rest of the paper is organized as follows. Section \ref{sec:desc} describes the dataset and section \ref{sec:fil} defines the task. In section \ref{sec:lit}, we discuss related litrature folowed by a discussion of our method in \ref{sec:mthd}. Following that, we present the experimental resulsy in \ref{sec:expr}, and discuss and analyze them in \ref{sec:analysis}. Towards the end, we discuss the impact of filtering choices on classification in section \ref{sec:impact}, examine and categorize unfilterable documents in section \ref{sec:unfil}. Finally, we present our conclusions in \ref{}{sec:conc}. ->>>>>>> 51b8586f2e1def3777b3e65737b7ab32c2ff0981 +>>>>>>> 51b8586f2e1def3777b3e65737b7ab32c2ff0981 \section{Data Description}\label{sec:desc} @@ -726,10 +726,12 @@ features we used are 13 and are listed below. \paragraph*{Google's Cross Lingual Dictionary (GCLD)} -This is a mapping of strings to Wikipedia concepts and vice versa -\cite{spitkovsky2012cross}. +The GCLD corpus estimates two probabilities: (1) the probability with which a string is used as anchor text to a Wikipedia entity +%thus distributing the probability mass over the different entities that it is used as anchor text; +and (2) the +probability that indicates the strength of co-reference of an anchor with respect to other anchors to a given Wikipedia entity. We use the product of both for each string. \paragraph*{jac} Jaccard similarity between the document and the entity's Wikipedia page @@ -941,49 +943,49 @@ In vital-relevant category (Table \ref{tab:class-vital-relevant}), the performan We conducted experiments to study the impacts on recall of different components of the filtering stage of entity-based filtering and ranking pipeline. Specifically we conducted experiments to study the impacts of cleansing, -entity profiles, relevance ratings, categories of documents, entity -profiles. We also measured impact of the different factors and +entity profiles, relevance ratings, categories of documents, entity +profiles. We also measured impact of the different factors and choices on later stages of the pipeline of our own system. -Experimental results show that cleansing can remove entire or parts of -the content of documents making them difficult to retrieve. These -documents can, otherwise, be retrieved from the raw version. The use -of the raw corpus brings in documents that can not be retrieved from -the cleansed corpus. This is true for all entity profiles and for all -entity types. The recall difference between the cleansed and raw -ranges from 6.8\% t 26.2\%. These increases, in actual -document-entity pairs, is in thousands. We believe this is a -substantial increase. However, the recall increases do not always -translate to improved F-score in overall performance. In the vital -relevance ranking for both Wikipedia and aggregate entities, the -cleansed version performs better than the raw version. In Twitter -entities, the raw corpus achieves better except in the case of all -name-variant, though the difference is negligible. However, for -vital-relevant, the raw corpus performs better across all entity -profiles and entity types except in partial canonical names of -Wikipedia entities. - -The use of different profiles also shows a big difference in -recall. While in Wikipedia the use of canonical -partial achieves better than name-variant, there is a steady increase -in recall from canonical to canonical partial, to name-variant, and -to name-variant partial. This pattern is also observed across the -document categories. However, here too, the relationship between -the gain in recall as we move from less richer profile to a more -richer profile and overall performance as measured by F-score is not -linear. - +Experimental results show that cleansing can remove entire or parts of +the content of documents making them difficult to retrieve. These +documents can, otherwise, be retrieved from the raw version. The use +of the raw corpus brings in documents that can not be retrieved from +the cleansed corpus. This is true for all entity profiles and for all +entity types. The recall difference between the cleansed and raw +ranges from 6.8\% t 26.2\%. These increases, in actual +document-entity pairs, is in thousands. We believe this is a +substantial increase. However, the recall increases do not always +translate to improved F-score in overall performance. In the vital +relevance ranking for both Wikipedia and aggregate entities, the +cleansed version performs better than the raw version. In Twitter +entities, the raw corpus achieves better except in the case of all +name-variant, though the difference is negligible. However, for +vital-relevant, the raw corpus performs better across all entity +profiles and entity types except in partial canonical names of +Wikipedia entities. + +The use of different profiles also shows a big difference in +recall. While in Wikipedia the use of canonical +partial achieves better than name-variant, there is a steady increase +in recall from canonical to canonical partial, to name-variant, and +to name-variant partial. This pattern is also observed across the +document categories. However, here too, the relationship between +the gain in recall as we move from less richer profile to a more +richer profile and overall performance as measured by F-score is not +linear. + %%%%%%%%%%%% -In vital ranking, across all entity profiles and types of corpus, -Wikipedia's canonical partial achieves better performance than any -other Wikipedia entity profiles. In vital-relevant documents too, -Wikipedia's canonical partial achieves the best result. In the raw -corpus, it achieves a little less than name-variant partial. For -Twitter entities, the name-variant partial profile achieves the -highest F-score across all entity profiles and types of corpus. +In vital ranking, across all entity profiles and types of corpus, +Wikipedia's canonical partial achieves better performance than any +other Wikipedia entity profiles. In vital-relevant documents too, +Wikipedia's canonical partial achieves the best result. In the raw +corpus, it achieves a little less than name-variant partial. For +Twitter entities, the name-variant partial profile achieves the +highest F-score across all entity profiles and types of corpus. There are 3 interesting observations: @@ -1007,32 +1009,32 @@ transformation and cleasing processes. %%%% NEEDS WORK: Taking both performance (recall at filtering and overall F-score -during evaluation) into account, there is a clear trade-off between -using a richer entity-profile and retrieval of irrelevant -documents. The richer the profile, the more relevant documents it -retrieves, but also the more irrelevant documents. To put it into -perspective, lets compare the number of documents that are retrieved -with canonical partial and with name-variant partial. Using the raw -corpus, the former retrieves a total of 2547487 documents and achieves -a recall of 72.2\%. By contrast, the later retrieves a total of -4735318 documents and achieves a recall of 90.2\%. The total number of -documents extracted increases by 85.9\% for a recall gain of 18\%. The -rest of the documents, that is 67.9\%, are newly introduced irrelevant -documents. - -Perhaps surprising, Wikipedia's canonical partial is the best entity profile for Wikipedia -entities. Here, the retrieval of -thousands vital-relevant document-entity pairs by name-variant partial -does not materialize into an increase in over all performance. Notice -that none of the participants in TREC KBA considered canonical partial -as a viable strategy though. We conclude that, at least for our -system, the remainder of the pipeline needs a different approach to -handle the correct scoring of the additional documents -- that are -necessary if we do not want to accept a low recall of the filtering -step. -%With this understanding, there is actually no -%need to go and fetch different names variants from DBpedia, a saving -%of time and computational resources. +during evaluation) into account, there is a clear trade-off between +using a richer entity-profile and retrieval of irrelevant +documents. The richer the profile, the more relevant documents it +retrieves, but also the more irrelevant documents. To put it into +perspective, lets compare the number of documents that are retrieved +with canonical partial and with name-variant partial. Using the raw +corpus, the former retrieves a total of 2547487 documents and achieves +a recall of 72.2\%. By contrast, the later retrieves a total of +4735318 documents and achieves a recall of 90.2\%. The total number of +documents extracted increases by 85.9\% for a recall gain of 18\%. The +rest of the documents, that is 67.9\%, are newly introduced irrelevant +documents. + +Perhaps surprising, Wikipedia's canonical partial is the best entity profile for Wikipedia +entities. Here, the retrieval of +thousands vital-relevant document-entity pairs by name-variant partial +does not materialize into an increase in over all performance. Notice +that none of the participants in TREC KBA considered canonical partial +as a viable strategy though. We conclude that, at least for our +system, the remainder of the pipeline needs a different approach to +handle the correct scoring of the additional documents -- that are +necessary if we do not want to accept a low recall of the filtering +step. +%With this understanding, there is actually no +%need to go and fetch different names variants from DBpedia, a saving +%of time and computational resources. %%%%%%%%%%%%