diff --git a/mypaper-final.tex b/mypaper-final.tex index c4ebf2a9fe4dfe3074dd0657ba457406fcab3742..85e822dd90f168c4f9f7392195108964715da90a 100644 --- a/mypaper-final.tex +++ b/mypaper-final.tex @@ -951,34 +951,34 @@ documents can, otherwise, be retrieved from the raw version. The use of the raw corpus brings in documents that can not be retrieved from the cleansed corpus. This is true for all entity profiles and for all entity types. The recall difference between the cleansed and raw -ranges from 6.8\% t 26.2\%. These increases, in actual +ranges from 6.8\% to 26.2\%. These increases, in actual document-entity pairs, is in thousands. We believe this is a substantial increase. However, the recall increases do not always -translate to improved F-score in overall performance. In the vital +translate to improved max-F on the overall system performance. In the vital relevance ranking for both Wikipedia and aggregate entities, the cleansed version performs better than the raw version. In Twitter entities, the raw corpus achieves better except in the case of all -name-variant, though the difference is negligible. However, for +name-variant, though the difference is negligible. However, for vital-relevant, the raw corpus performs better across all entity -profiles and entity types except in partial canonical names of +profiles and entity types except for the case of partial canonical names of Wikipedia entities. -The use of different profiles also shows a big difference in -recall. While in Wikipedia the use of canonical -partial achieves better than name-variant, there is a steady increase -in recall from canonical to canonical partial, to name-variant, and -to name-variant partial. This pattern is also observed across the -document categories. However, here too, the relationship between -the gain in recall as we move from less richer profile to a more -richer profile and overall performance as measured by F-score is not -linear. +The use of different entity profiles can have a large effect on +recall. While in the case of Wikipedia entities the use of canonical +partial achieves better recall than using name-variants, there seems a +steady increase in recall from canonical to canonical partial, to +name-variant, and to name-variant partial, a pattern that is observed +across the document categories. However, here too, the relationship between +the gain in recall as we move from less richer profile to a +richer profile and the overall CCR performance as measured by max-F is +not simply positive. %%%%%%%%%%%% In vital ranking, across all entity profiles and types of corpus, -Wikipedia's canonical partial achieves better performance than any +Wikipedia's canonical partial representation achieves better performance than any other Wikipedia entity profiles. In vital-relevant documents too, Wikipedia's canonical partial achieves the best result. In the raw corpus, it achieves a little less than name-variant partial. For @@ -986,22 +986,20 @@ Twitter entities, the name-variant partial profile achieves the highest F-score across all entity profiles and types of corpus. -There are 3 interesting observations: - -1) cleansing impacts Twitter -entities and relevant documents. This is validated by the -observation that recall gains in Twitter entities and the relevant +Cleansing impacts Twitter +entities and relevant documents. This is validated by the +observation that recall gains in Twitter entities and the relevant categories in the raw corpus also translate into overall performance -gains. This observation implies that cleansing removes relevant and -social documents than it does vital and news. That it removes relevant -documents more than vital can be explained by the fact that cleansing +gains. This observation implies that cleansing removes more relevant and +social documents than it does vital and news, which may be +explained by the fact that cleansing removes the related links and adverts which may contain a mention of the entities. One example we saw was the the cleansing removed an -image with a text of an entity name which was actually relevant. And -that it removes social documents can be explained by the fact that -most of the missing of the missing docuemnts from cleansed are -social. And all the docuemnts that are missing from raw corpus -social. So in both cases social seem to suffer from text +image with a text of an entity name which was actually relevant. The +removal of predominantly social documents can be explained by the fact that +all of the missing documents from the raw corpus and the majority of +the missing documents from the cleansed corpus belong to +the social category. In both cases, especially the social channel seems to suffer from text transformation and cleasing processes. %%%% NEEDS WORK: