diff --git a/mypaper-final.tex b/mypaper-final.tex index 3ae5c2d8eea2a98c1910ded9789b9866e529c80b..6b5a6254620784570a75dfe24eca0a7e8e124e0d 100644 --- a/mypaper-final.tex +++ b/mypaper-final.tex @@ -205,7 +205,11 @@ compromising precision, describing and classifying relevant documents that are not amenable to filtering , and estimating the upper-bound of recall on entity-based filtering. +<<<<<<< HEAD The rest of the paper is organized as follows. Section \ref{sec:desc} describes the dataset and section \ref{sec:fil} defines the task. In section \ref{sec:lit}, we discuss related litrature folowed by a discussion of our method in \ref{sec:mthd}. Following that, we present the experimental resulsy in \ref{sec:expr}, and discuss and analyze them in \ref{sec:analysis}. Towards the end, we discuss the impact of filtering choices on classification in section \ref{sec:impact}, examine and categorize unfilterable docuemnts in section \ref{sec:unfil}. Finally, we present our conclusions in \ref{sec:conc}. +======= +The rest of the paper is organized as follows. Section \ref{sec:desc} describes the dataset and section \ref{sec:fil} defines the task. In section \ref{sec:lit}, we discuss related litrature folowed by a discussion of our method in \ref{sec:mthd}. Following that, we present the experimental resulsy in \ref{sec:expr}, and discuss and analyze them in \ref{sec:analysis}. Towards the end, we discuss the impact of filtering choices on classification in section \ref{sec:impact}, examine and categorize unfilterable documents in section \ref{sec:unfil}. Finally, we present our conclusions in \ref{}{sec:conc}. +>>>>>>> 51b8586f2e1def3777b3e65737b7ab32c2ff0981 \section{Data Description}\label{sec:desc} @@ -934,25 +938,52 @@ In vital-relevant category (Table \ref{tab:class-vital-relevant}), the performan \section{Analysis and Discussion}\label{sec:analysis} -We conducted experiments to study the impacts on recall of +We conducted experiments to study the impacts on recall of different components of the filtering stage of entity-based filtering and ranking pipeline. Specifically we conducted experiments to study the impacts of cleansing, -entity profiles, relevance ratings, categories of documents, entity profiles. We also measured impact of the different factors and choices on later stages of the pipeline. - -Experimental results show that cleansing can remove entire or parts of the content of documents making them difficult to retrieve. These documents can, otherwise, be retrieved from the raw version. The use of the raw corpus brings in documents that can not be retrieved from the cleansed corpus. This is true for all entity profiles and for all entity types. The recall difference between the cleansed and raw ranges from 6.8\% t 26.2\%. These increases, in actual document-entity pairs, is in thousands. We believe this is a substantial increase. However, the recall increases do not always translate to improved F-score in overall performance. In the vital relevance ranking for both Wikipedia and aggregate entities, the cleansed version performs better than the raw version. In Twitter entities, the raw corpus achieves better except in the case of all name-variant, though the difference is negligible. However, for vital-relevant, the raw corpus performs better across all entity profiles and entity types -except in partial canonical names of Wikipedia entities. - -The use of different profiles also shows a big difference in recall. Except in the case of Wikipedia where the use of canonical partial achieves better than name-variant, there is a steady increase in recall from canonical to canonical partial, to name-variant, and to name-variant partial. This pattern is also observed across the document categories. However, here too, the relationship between the gain in recall as we move from less richer profile to a more richer profile and overall performance as measured by F-score is not linear. - - -%%%%% MOVED FROM LATER ON - CHECK FLOW - -There is a trade-off between using a richer entity-profile and retrieval of irrelevant documents. The richer the profile, the more relevant documents it retrieves, but also the more irrelevant documents. To put it into perspective, lets compare the number of documents that are retrieved with canonical partial and with name-variant partial. Using the raw corpus, the former retrieves a total of 2547487 documents and achieves a recall of 72.2\%. By contrast, the later retrieves a total of 4735318 documents and achieves a recall of 90.2\%. The total number of documents extracted increases by 85.9\% for a recall gain of 18\%. The rest of the documents, that is 67.9\%, are newly introduced irrelevant documents. +entity profiles, relevance ratings, categories of documents, entity +profiles. We also measured impact of the different factors and +choices on later stages of the pipeline of our own system. + +Experimental results show that cleansing can remove entire or parts of +the content of documents making them difficult to retrieve. These +documents can, otherwise, be retrieved from the raw version. The use +of the raw corpus brings in documents that can not be retrieved from +the cleansed corpus. This is true for all entity profiles and for all +entity types. The recall difference between the cleansed and raw +ranges from 6.8\% t 26.2\%. These increases, in actual +document-entity pairs, is in thousands. We believe this is a +substantial increase. However, the recall increases do not always +translate to improved F-score in overall performance. In the vital +relevance ranking for both Wikipedia and aggregate entities, the +cleansed version performs better than the raw version. In Twitter +entities, the raw corpus achieves better except in the case of all +name-variant, though the difference is negligible. However, for +vital-relevant, the raw corpus performs better across all entity +profiles and entity types except in partial canonical names of +Wikipedia entities. + +The use of different profiles also shows a big difference in +recall. While in Wikipedia the use of canonical +partial achieves better than name-variant, there is a steady increase +in recall from canonical to canonical partial, to name-variant, and +to name-variant partial. This pattern is also observed across the +document categories. However, here too, the relationship between +the gain in recall as we move from less richer profile to a more +richer profile and overall performance as measured by F-score is not +linear. + %%%%%%%%%%%% -In vital ranking, across all entity profiles and types of corpus, Wikipedia's canonical partial achieves better performance than any other Wikipedia entity profiles. In vital-relevant documents too, Wikipedia's canonical partial achieves the best result. In the raw corpus, it achieves a little less than name-variant partial. For Twitter entities, the name-variant partial profile achieves the highest F-score across all entity profiles and types of corpus. +In vital ranking, across all entity profiles and types of corpus, +Wikipedia's canonical partial achieves better performance than any +other Wikipedia entity profiles. In vital-relevant documents too, +Wikipedia's canonical partial achieves the best result. In the raw +corpus, it achieves a little less than name-variant partial. For +Twitter entities, the name-variant partial profile achieves the +highest F-score across all entity profiles and types of corpus. There are 3 interesting observations: @@ -970,15 +1001,38 @@ image with a text of an entity name which was actually relevant. And that it removes social documents can be explained by the fact that most of the missing of the missing docuemnts from cleansed are social. And all the docuemnts that are missing from raw corpus -social. So in both cases socuial seem to suffer from text +social. So in both cases social seem to suffer from text transformation and cleasing processes. %%%% NEEDS WORK: -2) Taking both performance (recall at filtering and overall F-score -during evaluation) into account, there is a clear trade-off between using a richer entity-profile and retrieval of irrelevant documents. The richer the profile, the more relevant documents it retrieves, but also the more irrelevant documents. To put it into perspective, lets compare the number of documents that are retrieved with canonical partial and with name-variant partial. Using the raw corpus, the former retrieves a total of 2547487 documents and achieves a recall of 72.2\%. By contrast, the later retrieves a total of 4735318 documents and achieves a recall of 90.2\%. The total number of documents extracted increases by 85.9\% for a recall gain of 18\%. The rest of the documents, that is 67.9\%, are newly introduced irrelevant documents. - -Wikipedia's canonical partial is the best entity profile for Wikipedia entities. This is interesting to see that the retrieval of of thousands vital-relevant document-entity pairs by name-variant partial does not translate to an increase in over all performance. It is even more interesting since canonical partial was not considered as contending profile for stream filtering by any of participant to the best of our knowledge. With this understanding, there is actually no need to go and fetch different names variants from DBpedia, a saving of time and computational resources. +Taking both performance (recall at filtering and overall F-score +during evaluation) into account, there is a clear trade-off between +using a richer entity-profile and retrieval of irrelevant +documents. The richer the profile, the more relevant documents it +retrieves, but also the more irrelevant documents. To put it into +perspective, lets compare the number of documents that are retrieved +with canonical partial and with name-variant partial. Using the raw +corpus, the former retrieves a total of 2547487 documents and achieves +a recall of 72.2\%. By contrast, the later retrieves a total of +4735318 documents and achieves a recall of 90.2\%. The total number of +documents extracted increases by 85.9\% for a recall gain of 18\%. The +rest of the documents, that is 67.9\%, are newly introduced irrelevant +documents. + +Perhaps surprising, Wikipedia's canonical partial is the best entity profile for Wikipedia +entities. Here, the retrieval of +thousands vital-relevant document-entity pairs by name-variant partial +does not materialize into an increase in over all performance. Notice +that none of the participants in TREC KBA considered canonical partial +as a viable strategy though. We conclude that, at least for our +system, the remainder of the pipeline needs a different approach to +handle the correct scoring of the additional documents -- that are +necessary if we do not want to accept a low recall of the filtering +step. +%With this understanding, there is actually no +%need to go and fetch different names variants from DBpedia, a saving +%of time and computational resources. %%%%%%%%%%%% @@ -998,9 +1052,9 @@ Across document categories, we observe a pattern in recall of others, followed b and name-variants bring in new relevant documents that can not be retrieved by canonicals. The rest of the two deltas are very small, suggesting that partial names of name variants do not bring in new relevant documents. -\section{Unfilterable documents}\label{sec:unfil} +%\section{Unfilterable documents}\label{sec:unfil} -\subsection{Missing vital-relevant documents \label{miss}} +\section{Missing vital-relevant documents}\label{sec:unfil} %