HCDA/cikm-paper Changeset - 60d8e26cf4e5 · Centrum Wiskunde & Informatica (CWI)

@@ -940,25 +940,52 @@ In vital-relevant category (Table \ref{tab:class-vital-relevant}), the performan

\section{Analysis and Discussion}\label{sec:analysis}

We conducted experiments to study  the impacts on recall of

We conducted experiments to study the impacts on recall of

different components of the filtering stage of entity-based filtering and ranking pipeline. Specifically

we conducted experiments to study the impacts of cleansing,

entity profiles, relevance ratings, categories of documents, entity profiles. We also measured  impact of the different factors and choices  on later stages of the pipeline.

Experimental results show that cleansing can remove entire or parts of the content of documents making them difficult to retrieve. These documents can, otherwise, be retrieved from the raw version. The use of the raw corpus brings in documents that can not be retrieved from the cleansed corpus. This is true for all entity profiles and for all entity types. The  recall difference between the cleansed and raw ranges from  6.8\% t 26.2\%. These increases, in actual document-entity pairs,  is in thousands. We believe this is a substantial increase. However, the recall increases do not always translate to improved F-score in overall performance.  In the vital relevance ranking for both Wikipedia and aggregate entities, the cleansed version performs better than the raw version.  In Twitter entities, the raw corpus achieves better except in the case of all name-variant, though the difference is negligible.  However, for vital-relevant, the raw corpus performs  better across all entity profiles and entity types

except in partial canonical names of Wikipedia entities.

The use of different profiles also shows a big difference in recall. Except in the case of Wikipedia where the use of canonical partial achieves better than name-variant, there is a steady increase in recall from canonical to  canonical partial, to name-variant, and to name-variant partial. This pattern is also observed across the document categories.  However, here too, the relationship between   the gain in recall as we move from less richer profile to a more richer profile and overall performance as measured by F-score  is not linear.

%%%%% MOVED FROM LATER ON - CHECK FLOW

There is a trade-off between using a richer entity-profile and retrieval of irrelevant documents. The richer the profile, the more relevant documents it retrieves, but also the more irrelevant documents. To put it into perspective, lets compare the number of documents that are retrieved with  canonical partial and with name-variant partial. Using the raw corpus, the former retrieves a total of 2547487 documents and achieves a recall of 72.2\%. By contrast, the later retrieves a total of 4735318 documents and achieves a recall of 90.2\%. The total number of documents extracted increases by 85.9\% for a recall gain of 18\%. The rest of the documents, that is 67.9\%, are newly introduced irrelevant documents.

entity profiles, relevance ratings, categories of documents, entity

profiles. We also measured  impact of the different factors and

choices  on later stages of the pipeline of our own system.

Experimental results show that cleansing can remove entire or parts of

the content of documents making them difficult to retrieve. These

documents can, otherwise, be retrieved from the raw version. The use

of the raw corpus brings in documents that can not be retrieved from

the cleansed corpus. This is true for all entity profiles and for all

entity types. The  recall difference between the cleansed and raw

ranges from  6.8\% t 26.2\%. These increases, in actual

document-entity pairs,  is in thousands. We believe this is a

substantial increase. However, the recall increases do not always

translate to improved F-score in overall performance.  In the vital

relevance ranking for both Wikipedia and aggregate entities, the

cleansed version performs better than the raw version.  In Twitter

entities, the raw corpus achieves better except in the case of all

name-variant, though the difference is negligible.  However, for

vital-relevant, the raw corpus performs  better across all entity

profiles and entity types except in partial canonical names of

Wikipedia entities.

The use of different profiles also shows a big difference in

recall. While in Wikipedia the use of canonical

partial achieves better than name-variant, there is a steady increase

in recall from canonical to canonical partial, to name-variant, and

to name-variant partial. This pattern is also observed across the

document categories.  However, here too, the relationship between

the gain in recall as we move from less richer profile to a more

richer profile and overall performance as measured by F-score  is not

linear.

%%%%%%%%%%%%

In vital ranking, across all entity profiles and types of corpus, Wikipedia's canonical partial  achieves better performance than any other Wikipedia entity profiles. In vital-relevant documents too, Wikipedia's canonical partial achieves the best result. In the raw corpus, it achieves a little less than name-variant partial. For Twitter entities, the name-variant partial profile achieves the highest F-score across all entity profiles and types of corpus.

In vital ranking, across all entity profiles and types of corpus,

Wikipedia's canonical partial  achieves better performance than any

other Wikipedia entity profiles. In vital-relevant documents too,

Wikipedia's canonical partial achieves the best result. In the raw

corpus, it achieves a little less than name-variant partial. For

Twitter entities, the name-variant partial profile achieves the

highest F-score across all entity profiles and types of corpus.

Cleansing impacts Twitter

@@ -974,23 +1001,38 @@ image with a text of an entity name which was actually relevant. And

that it removes social documents can be explained by the fact that

most of the missing of the missing  docuemnts from cleansed are

social. And all the docuemnts that are missing from raw corpus

social. So in both cases socuial seem to suffer from text

social. So in both cases social seem to suffer from text

transformation and cleasing processes.

%%%% NEEDS WORK:

Taking both performance (recall at filtering and overall F-score

during evaluation) into account, there is a clear trade-off between using a richer entity-profile and retrieval of irrelevant documents. The richer the profile, the more relevant documents it retrieves, but also the more irrelevant documents. To put it into perspective, lets compare the number of documents that are retrieved with  canonical partial and with name-variant partial. Using the raw corpus, the former retrieves a total of 2547487 documents and achieves a recall of 72.2\%. By contrast, the later retrieves a total of 4735318 documents and achieves a recall of 90.2\%. The total number of documents extracted increases by 85.9\% for a recall gain of 18\%. The rest of the documents, that is 67.9\%, are newly introduced irrelevant documents.

Wikipedia's canonical partial is the best entity profile for Wikipedia

entities. This is interesting  to see that the retrieval of of

during evaluation) into account, there is a clear trade-off between

using a richer entity-profile and retrieval of irrelevant

documents. The richer the profile, the more relevant documents it

retrieves, but also the more irrelevant documents. To put it into

perspective, lets compare the number of documents that are retrieved

with  canonical partial and with name-variant partial. Using the raw

corpus, the former retrieves a total of 2547487 documents and achieves

a recall of 72.2\%. By contrast, the later retrieves a total of

4735318 documents and achieves a recall of 90.2\%. The total number of

documents extracted increases by 85.9\% for a recall gain of 18\%. The

rest of the documents, that is 67.9\%, are newly introduced irrelevant

documents.

Perhaps surprising, Wikipedia's canonical partial is the best entity profile for Wikipedia

entities. Here, the retrieval of

thousands vital-relevant document-entity pairs by name-variant partial

does not translate to an increase in over all performance. It is even

more interesting since canonical partial was not considered as

contending profile for stream filtering by any of participant to the

best of our knowledge. With this understanding, there  is actually no

need to go and fetch different names variants from DBpedia, a saving

of time and computational resources.

does not materialize into an increase in over all performance. Notice

that none of the participants in TREC KBA considered canonical partial

as a viable strategy though. We conclude that, at least for our

system, the remainder of the pipeline needs a different approach to

handle the correct scoring of the additional documents -- that are

necessary if we do not want to accept a low recall of the filtering

step.

%With this understanding, there  is actually no

%need to go and fetch different names variants from DBpedia, a saving

%of time and computational resources.

%%%%%%%%%%%%