diff --git a/mypaper-final.tex b/mypaper-final.tex index 46d4f173a323cba3dd16cef14498faef72475eef..50ec63fba60cc61b284fdac6e6fdaf6e0166d24e 100644 --- a/mypaper-final.tex +++ b/mypaper-final.tex @@ -940,25 +940,52 @@ In vital-relevant category (Table \ref{tab:class-vital-relevant}), the performan \section{Analysis and Discussion}\label{sec:analysis} -We conducted experiments to study the impacts on recall of +We conducted experiments to study the impacts on recall of different components of the filtering stage of entity-based filtering and ranking pipeline. Specifically we conducted experiments to study the impacts of cleansing, -entity profiles, relevance ratings, categories of documents, entity profiles. We also measured impact of the different factors and choices on later stages of the pipeline. - -Experimental results show that cleansing can remove entire or parts of the content of documents making them difficult to retrieve. These documents can, otherwise, be retrieved from the raw version. The use of the raw corpus brings in documents that can not be retrieved from the cleansed corpus. This is true for all entity profiles and for all entity types. The recall difference between the cleansed and raw ranges from 6.8\% t 26.2\%. These increases, in actual document-entity pairs, is in thousands. We believe this is a substantial increase. However, the recall increases do not always translate to improved F-score in overall performance. In the vital relevance ranking for both Wikipedia and aggregate entities, the cleansed version performs better than the raw version. In Twitter entities, the raw corpus achieves better except in the case of all name-variant, though the difference is negligible. However, for vital-relevant, the raw corpus performs better across all entity profiles and entity types -except in partial canonical names of Wikipedia entities. - -The use of different profiles also shows a big difference in recall. Except in the case of Wikipedia where the use of canonical partial achieves better than name-variant, there is a steady increase in recall from canonical to canonical partial, to name-variant, and to name-variant partial. This pattern is also observed across the document categories. However, here too, the relationship between the gain in recall as we move from less richer profile to a more richer profile and overall performance as measured by F-score is not linear. - - -%%%%% MOVED FROM LATER ON - CHECK FLOW - -There is a trade-off between using a richer entity-profile and retrieval of irrelevant documents. The richer the profile, the more relevant documents it retrieves, but also the more irrelevant documents. To put it into perspective, lets compare the number of documents that are retrieved with canonical partial and with name-variant partial. Using the raw corpus, the former retrieves a total of 2547487 documents and achieves a recall of 72.2\%. By contrast, the later retrieves a total of 4735318 documents and achieves a recall of 90.2\%. The total number of documents extracted increases by 85.9\% for a recall gain of 18\%. The rest of the documents, that is 67.9\%, are newly introduced irrelevant documents. +entity profiles, relevance ratings, categories of documents, entity +profiles. We also measured impact of the different factors and +choices on later stages of the pipeline of our own system. + +Experimental results show that cleansing can remove entire or parts of +the content of documents making them difficult to retrieve. These +documents can, otherwise, be retrieved from the raw version. The use +of the raw corpus brings in documents that can not be retrieved from +the cleansed corpus. This is true for all entity profiles and for all +entity types. The recall difference between the cleansed and raw +ranges from 6.8\% t 26.2\%. These increases, in actual +document-entity pairs, is in thousands. We believe this is a +substantial increase. However, the recall increases do not always +translate to improved F-score in overall performance. In the vital +relevance ranking for both Wikipedia and aggregate entities, the +cleansed version performs better than the raw version. In Twitter +entities, the raw corpus achieves better except in the case of all +name-variant, though the difference is negligible. However, for +vital-relevant, the raw corpus performs better across all entity +profiles and entity types except in partial canonical names of +Wikipedia entities. + +The use of different profiles also shows a big difference in +recall. While in Wikipedia the use of canonical +partial achieves better than name-variant, there is a steady increase +in recall from canonical to canonical partial, to name-variant, and +to name-variant partial. This pattern is also observed across the +document categories. However, here too, the relationship between +the gain in recall as we move from less richer profile to a more +richer profile and overall performance as measured by F-score is not +linear. + %%%%%%%%%%%% -In vital ranking, across all entity profiles and types of corpus, Wikipedia's canonical partial achieves better performance than any other Wikipedia entity profiles. In vital-relevant documents too, Wikipedia's canonical partial achieves the best result. In the raw corpus, it achieves a little less than name-variant partial. For Twitter entities, the name-variant partial profile achieves the highest F-score across all entity profiles and types of corpus. +In vital ranking, across all entity profiles and types of corpus, +Wikipedia's canonical partial achieves better performance than any +other Wikipedia entity profiles. In vital-relevant documents too, +Wikipedia's canonical partial achieves the best result. In the raw +corpus, it achieves a little less than name-variant partial. For +Twitter entities, the name-variant partial profile achieves the +highest F-score across all entity profiles and types of corpus. Cleansing impacts Twitter @@ -974,23 +1001,38 @@ image with a text of an entity name which was actually relevant. And that it removes social documents can be explained by the fact that most of the missing of the missing docuemnts from cleansed are social. And all the docuemnts that are missing from raw corpus -social. So in both cases socuial seem to suffer from text +social. So in both cases social seem to suffer from text transformation and cleasing processes. %%%% NEEDS WORK: Taking both performance (recall at filtering and overall F-score -during evaluation) into account, there is a clear trade-off between using a richer entity-profile and retrieval of irrelevant documents. The richer the profile, the more relevant documents it retrieves, but also the more irrelevant documents. To put it into perspective, lets compare the number of documents that are retrieved with canonical partial and with name-variant partial. Using the raw corpus, the former retrieves a total of 2547487 documents and achieves a recall of 72.2\%. By contrast, the later retrieves a total of 4735318 documents and achieves a recall of 90.2\%. The total number of documents extracted increases by 85.9\% for a recall gain of 18\%. The rest of the documents, that is 67.9\%, are newly introduced irrelevant documents. - -Wikipedia's canonical partial is the best entity profile for Wikipedia -entities. This is interesting to see that the retrieval of of +during evaluation) into account, there is a clear trade-off between +using a richer entity-profile and retrieval of irrelevant +documents. The richer the profile, the more relevant documents it +retrieves, but also the more irrelevant documents. To put it into +perspective, lets compare the number of documents that are retrieved +with canonical partial and with name-variant partial. Using the raw +corpus, the former retrieves a total of 2547487 documents and achieves +a recall of 72.2\%. By contrast, the later retrieves a total of +4735318 documents and achieves a recall of 90.2\%. The total number of +documents extracted increases by 85.9\% for a recall gain of 18\%. The +rest of the documents, that is 67.9\%, are newly introduced irrelevant +documents. + +Perhaps surprising, Wikipedia's canonical partial is the best entity profile for Wikipedia +entities. Here, the retrieval of thousands vital-relevant document-entity pairs by name-variant partial -does not translate to an increase in over all performance. It is even -more interesting since canonical partial was not considered as -contending profile for stream filtering by any of participant to the -best of our knowledge. With this understanding, there is actually no -need to go and fetch different names variants from DBpedia, a saving -of time and computational resources. +does not materialize into an increase in over all performance. Notice +that none of the participants in TREC KBA considered canonical partial +as a viable strategy though. We conclude that, at least for our +system, the remainder of the pipeline needs a different approach to +handle the correct scoring of the additional documents -- that are +necessary if we do not want to accept a low recall of the filtering +step. +%With this understanding, there is actually no +%need to go and fetch different names variants from DBpedia, a saving +%of time and computational resources. %%%%%%%%%%%%