Changeset - a07620ff7ab1
[Not reviewed]
0 1 0
Arjen de Vries (arjen) - 11 years ago 2014-06-12 06:03:38
arjen.de.vries@cwi.nl
more in discussion
1 file changed with 25 insertions and 27 deletions:
0 comments (0 inline, 0 general)
mypaper-final.tex
Show inline comments
 
@@ -951,34 +951,34 @@ documents can, otherwise, be retrieved from the raw version. The use
 
of the raw corpus brings in documents that can not be retrieved from
 
the cleansed corpus. This is true for all entity profiles and for all
 
entity types. The  recall difference between the cleansed and raw
 
ranges from  6.8\% t 26.2\%. These increases, in actual
 
ranges from  6.8\% to 26.2\%. These increases, in actual
 
document-entity pairs,  is in thousands. We believe this is a
 
substantial increase. However, the recall increases do not always
 
translate to improved F-score in overall performance.  In the vital
 
translate to improved max-F on the overall system performance.  In the vital
 
relevance ranking for both Wikipedia and aggregate entities, the
 
cleansed version performs better than the raw version.  In Twitter
 
entities, the raw corpus achieves better except in the case of all
 
name-variant, though the difference is negligible.  However, for
 
name-variant, though the difference is negligible. However, for
 
vital-relevant, the raw corpus performs  better across all entity
 
profiles and entity types except in partial canonical names of
 
profiles and entity types except for the case of partial canonical names of
 
Wikipedia entities.
 

	
 
The use of different profiles also shows a big difference in
 
recall. While in Wikipedia the use of canonical
 
partial achieves better than name-variant, there is a steady increase
 
in recall from canonical to canonical partial, to name-variant, and
 
to name-variant partial. This pattern is also observed across the
 
document categories.  However, here too, the relationship between
 
the gain in recall as we move from less richer profile to a more
 
richer profile and overall performance as measured by F-score  is not
 
linear.
 
The use of different entity profiles can have a large effect on
 
recall. While in the case of Wikipedia entities the use of canonical
 
partial achieves better recall than using name-variants, there seems a
 
steady increase in recall from canonical to canonical partial, to
 
name-variant, and to name-variant partial, a pattern that is observed
 
across the document categories.  However, here too, the relationship between
 
the gain in recall as we move from less richer profile to a
 
richer profile and the overall CCR performance as measured by max-F is
 
not simply positive.
 

	
 
 
%%%%%%%%%%%%
 
 
 
In vital ranking, across all entity profiles and types of corpus,
 
Wikipedia's canonical partial  achieves better performance than any
 
Wikipedia's canonical partial representation achieves better performance than any
 
other Wikipedia entity profiles. In vital-relevant documents too,
 
Wikipedia's canonical partial achieves the best result. In the raw
 
corpus, it achieves a little less than name-variant partial. For
 
@@ -986,22 +986,20 @@ Twitter entities, the name-variant partial profile achieves the
 
highest F-score across all entity profiles and types of corpus.
 
 
 
There are 3 interesting observations: 
 
 
1) cleansing impacts Twitter
 
entities and relevant documents.  This  is validated by the
 
observation that recall  gains in Twitter entities and the relevant
 
Cleansing impacts Twitter
 
entities and relevant documents.  This is validated by the
 
observation that recall gains in Twitter entities and the relevant
 
categories in the raw corpus also translate into overall performance
 
gains. This observation implies that cleansing removes relevant and
 
social documents than it does vital and news. That it removes relevant
 
documents more than vital can be explained by the fact that cleansing
 
gains. This observation implies that cleansing removes more relevant and
 
social documents than it does vital and news, which may be
 
explained by the fact that cleansing
 
removes the related links and adverts which may contain a mention of
 
the entities. One example we saw was the the cleansing removed an
 
image with a text of an entity name which was actually relevant. And
 
that it removes social documents can be explained by the fact that
 
most of the missing of the missing  docuemnts from cleansed are
 
social. And all the docuemnts that are missing from raw corpus
 
social. So in both cases social seem to suffer from text
 
image with a text of an entity name which was actually relevant. The
 
removal of predominantly social documents can be explained by the fact that
 
all of the missing documents from the raw corpus and the majority of
 
the missing documents from the cleansed corpus belong to 
 
the social category. In both cases, especially the social channel seems to suffer from text
 
transformation and cleasing processes. 
 
 
%%%% NEEDS WORK:
0 comments (0 inline, 0 general)