Changeset - 60d8e26cf4e5
[Not reviewed]
0 1 0
Arjen de Vries (arjen) - 11 years ago 2014-06-12 04:59:56
arjen.de.vries@cwi.nl
analysis improvements
1 file changed with 66 insertions and 24 deletions:
0 comments (0 inline, 0 general)
mypaper-final.tex
Show inline comments
 
@@ -940,25 +940,52 @@ In vital-relevant category (Table \ref{tab:class-vital-relevant}), the performan
 
\section{Analysis and Discussion}\label{sec:analysis}
 
 
 
We conducted experiments to study  the impacts on recall of 
 
We conducted experiments to study the impacts on recall of 
 
different components of the filtering stage of entity-based filtering and ranking pipeline. Specifically 
 
we conducted experiments to study the impacts of cleansing, 
 
entity profiles, relevance ratings, categories of documents, entity profiles. We also measured  impact of the different factors and choices  on later stages of the pipeline. 
 
 
Experimental results show that cleansing can remove entire or parts of the content of documents making them difficult to retrieve. These documents can, otherwise, be retrieved from the raw version. The use of the raw corpus brings in documents that can not be retrieved from the cleansed corpus. This is true for all entity profiles and for all entity types. The  recall difference between the cleansed and raw ranges from  6.8\% t 26.2\%. These increases, in actual document-entity pairs,  is in thousands. We believe this is a substantial increase. However, the recall increases do not always translate to improved F-score in overall performance.  In the vital relevance ranking for both Wikipedia and aggregate entities, the cleansed version performs better than the raw version.  In Twitter entities, the raw corpus achieves better except in the case of all name-variant, though the difference is negligible.  However, for vital-relevant, the raw corpus performs  better across all entity profiles and entity types 
 
except in partial canonical names of Wikipedia entities. 
 
 
The use of different profiles also shows a big difference in recall. Except in the case of Wikipedia where the use of canonical partial achieves better than name-variant, there is a steady increase in recall from canonical to  canonical partial, to name-variant, and to name-variant partial. This pattern is also observed across the document categories.  However, here too, the relationship between   the gain in recall as we move from less richer profile to a more richer profile and overall performance as measured by F-score  is not linear. 
 
 
 
%%%%% MOVED FROM LATER ON - CHECK FLOW
 
 
There is a trade-off between using a richer entity-profile and retrieval of irrelevant documents. The richer the profile, the more relevant documents it retrieves, but also the more irrelevant documents. To put it into perspective, lets compare the number of documents that are retrieved with  canonical partial and with name-variant partial. Using the raw corpus, the former retrieves a total of 2547487 documents and achieves a recall of 72.2\%. By contrast, the later retrieves a total of 4735318 documents and achieves a recall of 90.2\%. The total number of documents extracted increases by 85.9\% for a recall gain of 18\%. The rest of the documents, that is 67.9\%, are newly introduced irrelevant documents. 
 
entity profiles, relevance ratings, categories of documents, entity
 
profiles. We also measured  impact of the different factors and
 
choices  on later stages of the pipeline of our own system. 
 
 
Experimental results show that cleansing can remove entire or parts of
 
the content of documents making them difficult to retrieve. These
 
documents can, otherwise, be retrieved from the raw version. The use
 
of the raw corpus brings in documents that can not be retrieved from
 
the cleansed corpus. This is true for all entity profiles and for all
 
entity types. The  recall difference between the cleansed and raw
 
ranges from  6.8\% t 26.2\%. These increases, in actual
 
document-entity pairs,  is in thousands. We believe this is a
 
substantial increase. However, the recall increases do not always
 
translate to improved F-score in overall performance.  In the vital
 
relevance ranking for both Wikipedia and aggregate entities, the
 
cleansed version performs better than the raw version.  In Twitter
 
entities, the raw corpus achieves better except in the case of all
 
name-variant, though the difference is negligible.  However, for
 
vital-relevant, the raw corpus performs  better across all entity
 
profiles and entity types except in partial canonical names of
 
Wikipedia entities.
 

	
 
The use of different profiles also shows a big difference in
 
recall. While in Wikipedia the use of canonical
 
partial achieves better than name-variant, there is a steady increase
 
in recall from canonical to canonical partial, to name-variant, and
 
to name-variant partial. This pattern is also observed across the
 
document categories.  However, here too, the relationship between
 
the gain in recall as we move from less richer profile to a more
 
richer profile and overall performance as measured by F-score  is not
 
linear.
 

	
 
 
%%%%%%%%%%%%
 
 
 
In vital ranking, across all entity profiles and types of corpus, Wikipedia's canonical partial  achieves better performance than any other Wikipedia entity profiles. In vital-relevant documents too, Wikipedia's canonical partial achieves the best result. In the raw corpus, it achieves a little less than name-variant partial. For Twitter entities, the name-variant partial profile achieves the highest F-score across all entity profiles and types of corpus.  
 
In vital ranking, across all entity profiles and types of corpus,
 
Wikipedia's canonical partial  achieves better performance than any
 
other Wikipedia entity profiles. In vital-relevant documents too,
 
Wikipedia's canonical partial achieves the best result. In the raw
 
corpus, it achieves a little less than name-variant partial. For
 
Twitter entities, the name-variant partial profile achieves the
 
highest F-score across all entity profiles and types of corpus.
 
 
 
Cleansing impacts Twitter
 
@@ -974,23 +1001,38 @@ image with a text of an entity name which was actually relevant. And
 
that it removes social documents can be explained by the fact that
 
most of the missing of the missing  docuemnts from cleansed are
 
social. And all the docuemnts that are missing from raw corpus
 
social. So in both cases socuial seem to suffer from text
 
social. So in both cases social seem to suffer from text
 
transformation and cleasing processes. 
 
 
%%%% NEEDS WORK:
 
 
Taking both performance (recall at filtering and overall F-score
 
during evaluation) into account, there is a clear trade-off between using a richer entity-profile and retrieval of irrelevant documents. The richer the profile, the more relevant documents it retrieves, but also the more irrelevant documents. To put it into perspective, lets compare the number of documents that are retrieved with  canonical partial and with name-variant partial. Using the raw corpus, the former retrieves a total of 2547487 documents and achieves a recall of 72.2\%. By contrast, the later retrieves a total of 4735318 documents and achieves a recall of 90.2\%. The total number of documents extracted increases by 85.9\% for a recall gain of 18\%. The rest of the documents, that is 67.9\%, are newly introduced irrelevant documents. 
 
 
Wikipedia's canonical partial is the best entity profile for Wikipedia
 
entities. This is interesting  to see that the retrieval of of
 
during evaluation) into account, there is a clear trade-off between
 
using a richer entity-profile and retrieval of irrelevant
 
documents. The richer the profile, the more relevant documents it
 
retrieves, but also the more irrelevant documents. To put it into
 
perspective, lets compare the number of documents that are retrieved
 
with  canonical partial and with name-variant partial. Using the raw
 
corpus, the former retrieves a total of 2547487 documents and achieves
 
a recall of 72.2\%. By contrast, the later retrieves a total of
 
4735318 documents and achieves a recall of 90.2\%. The total number of
 
documents extracted increases by 85.9\% for a recall gain of 18\%. The
 
rest of the documents, that is 67.9\%, are newly introduced irrelevant
 
documents.
 

	
 
Perhaps surprising, Wikipedia's canonical partial is the best entity profile for Wikipedia
 
entities. Here, the retrieval of
 
thousands vital-relevant document-entity pairs by name-variant partial
 
does not translate to an increase in over all performance. It is even
 
more interesting since canonical partial was not considered as
 
contending profile for stream filtering by any of participant to the
 
best of our knowledge. With this understanding, there  is actually no
 
need to go and fetch different names variants from DBpedia, a saving
 
of time and computational resources.
 
does not materialize into an increase in over all performance. Notice
 
that none of the participants in TREC KBA considered canonical partial
 
as a viable strategy though. We conclude that, at least for our
 
system, the remainder of the pipeline needs a different approach to
 
handle the correct scoring of the additional documents -- that are
 
necessary if we do not want to accept a low recall of the filtering
 
step.
 
%With this understanding, there  is actually no
 
%need to go and fetch different names variants from DBpedia, a saving
 
%of time and computational resources.
 
 
 
%%%%%%%%%%%%
0 comments (0 inline, 0 general)