Changeset - 7fa970bcee0e
[Not reviewed]
0 1 0
Gebrekirstos Gebremeskel - 11 years ago 2014-06-10 14:10:12
destinycome@gmail.com
updated
1 file changed with 5 insertions and 5 deletions:
0 comments (0 inline, 0 general)
mypaper-final.tex
Show inline comments
 
@@ -144,7 +144,7 @@ documents (news, blog, tweets) can influence filtering.
 
 
 \section{Data and Probelm description}
 
We use TREC KBA-CCR-2013 dataset \footnote{http://http://trec-kba.org/trec-kba-2013.shtml} and problem setting. The dataset consists of a time-stamped  stream corpus, a set of KB entities, and a set of relevance judgments. 
 
\subsection{Stream corpus} The stream corpus comes in two versions: raw and cleaned. The raw  and cleansed versions are 6.45TB and 4.5TB respectively,  after xz-compression and GPG encryption. The raw data is a  dump of  raw HTML pages. The cleansed version is the raw data after its HTML tags are stripped off and non-English docuemnts removed. The stream corpus is organized in hourly folders each of which contains many  chunk files. Each chunk file contains between hundreds and hundreds of thousands of serialized  thrift objects. One thrift object is one document. A document could be a blog article, a news article, or a social media post (including tweet).  The stream corpus comes from three sources: TREC KBA 2012 (social, news and linking) \footnote{http://trec-kba.org/kba-stream-corpus-2013.shtml}, arxiv\footnote{http://arxiv.org/}, and spinn3r\footnote{http://spinn3r.com/}. Table \ref{tab:streams}   shows the sources, the number of hourly directories, and the number of chunk files. 
 
\subsection{Stream corpus} The stream corpus comes in two versions: raw and cleaned. The raw  and cleansed versions are 6.45TB and 4.5TB respectively,  after xz-compression and GPG encryption. The raw data is a  dump of  raw HTML pages. The cleansed version is the raw data after its HTML tags are stripped off and non-English documents removed. The stream corpus is organized in hourly folders each of which contains many  chunk files. Each chunk file contains between hundreds and hundreds of thousands of serialized  thrift objects. One thrift object is one document. A document could be a blog article, a news article, or a social media post (including tweet).  The stream corpus comes from three sources: TREC KBA 2012 (social, news and linking) \footnote{http://trec-kba.org/kba-stream-corpus-2013.shtml}, arxiv\footnote{http://arxiv.org/}, and spinn3r\footnote{http://spinn3r.com/}. Table \ref{tab:streams}   shows the sources, the number of hourly directories, and the number of chunk files. 
 
 
\begin{table*}
 
\caption{retrieved documents to different sources }
 
@@ -175,12 +175,12 @@ We use TREC KBA-CCR-2013 dataset \footnote{http://http://trec-kba.org/trec-kba-2
 
 The KB entities consist of 20 Twitter entities and 121 Wikipedia entities. The selected entities are, on purpose, sparse. The entities consist of 71 people, 1 organization, and 24 facilities.  
 
\subsection{Relevance judgments}
 
 
TREC-KBA provided relevance judgments for training and testing. Relevance judgments are given to a document-entity pairs. Documents with citation-worthy content to a given entity are annotated  as \emph{vital},  while documents with tangentially relevant content, or docuemnts that lack freshliness o  with content that can be useful for initial KB-dossier are annotated as \emph{relevant}. Documents with no relevant content are labeled \emph{neutral} and spam is labeled as \emph{garbage}.  The inter-annotator agreement on vital in 2012 was 70\% while in 2013 it is 76\%. This is due to the more refined definition of vital and the distinction made between vital and relevant. 
 
TREC-KBA provided relevance judgments for training and testing. Relevance judgments are given to a document-entity pairs. Documents with citation-worthy content to a given entity are annotated  as \emph{vital},  while documents with tangentially relevant content, or documents that lack freshliness o  with content that can be useful for initial KB-dossier are annotated as \emph{relevant}. Documents with no relevant content are labeled \emph{neutral} and spam is labeled as \emph{garbage}.  The inter-annotator agreement on vital in 2012 was 70\% while in 2013 it is 76\%. This is due to the more refined definition of vital and the distinction made between vital and relevant. 
 
 
 
 
 \subsection{Problem description}
 
 Given a stream of documents of news items, blogs and social media on one hand and KB entities (Wikipedia, Twitter)  on the other,  we study the factors and choices that affect filtering perfromance. Specifically, we conduct in-depth analysis on the cleansing step, the entity-profile construction, the docuemnt category of the stream items, and the type of entities (Wikipedia or Twitter). We also study the impact of chouces on classification performance. Finally, we conduct manual examination of the relevant docuemnts that defy filtering. We strive to answer the following research questions:
 
 Given a stream of documents of news items, blogs and social media on one hand and KB entities (Wikipedia, Twitter)  on the other,  we study the factors and choices that affect filtering perfromance. Specifically, we conduct in-depth analysis on the cleansing step, the entity-profile construction, the document category of the stream items, and the type of entities (Wikipedia or Twitter). We also study the impact of choices on classification performance. Finally, we conduct manual examination of the relevant documents that defy filtering. We strive to answer the following research questions:
 
 
 
 \begin{enumerate}
 
  \item Does cleansing affect filtering and subsequent performance
 
@@ -429,14 +429,14 @@ In most of the  deltas, news followed by social followed by others show greater
 
% The  biggest delta  observed is in Twitter entities between partials of all name variants and partials of canonicals (93\%). delta. Both of them are for news category.  For Wikipedia entities, the highest delta observed is 19.5\% in cano\_part - cano followed by 17.5\% in all\_part in relevant.  
 
  
 
  \subsection{Entity Type: Wikipedia and Twitter)}
 
From Table \ref{tab:name} shows the difefrence between Wikipedia and Twitter entities.  Wikipedia entities' canonical names achieve a recall of 70\%, and partial names of canonical names achieve a recall of 86.1\%. This is an increase in recall of 16.1\%. By contrast, the increase in recall of partial names of all name variants over just all name variants is 8.3.  The high increase in recall when moving from canonical names  to their partial names, in comparison to the lower increase when moving from all name variants to their partial names can be explained by saturation. This is to mean that documents have already been extracted by the different name variants and thus using partial name does not bring in many new documents. One interesting observation is that, on Wikipedia entities, partial names of canonical names achieve better results than name variants.  This holds in both cleansed and raw extractions. %In the raw extraction, the difference is about 3.7. 
 
From Table \ref{tab:name} shows the difference between Wikipedia and Twitter entities.  Wikipedia entities' canonical names achieve a recall of 70\%, and partial names of canonical names achieve a recall of 86.1\%. This is an increase in recall of 16.1\%. By contrast, the increase in recall of partial names of all name variants over just all name variants is 8.3.  The high increase in recall when moving from canonical names  to their partial names, in comparison to the lower increase when moving from all name variants to their partial names can be explained by saturation. This is to mean that documents have already been extracted by the different name variants and thus using partial name does not bring in many new documents. One interesting observation is that, on Wikipedia entities, partial names of canonical names achieve better results than name variants.  This holds in both cleansed and raw extractions. %In the raw extraction, the difference is about 3.7. 
 
 
In Twitter entities, however, it is different. Both canonical and their partial names perform the same and the recall is very low. Canonical names and partial canonical names are the same for Twitter entities because they are one word names. For example in https://twitter.com/roryscovel, ``roryscovel`` is the canonical name and its partial name is also the same.  That they perform very low is because the canonical names of Twitter entities are not really names; they are usually arbitrarily created user names. It shows that  documents do not refer to Twitter entities by their user names. They refer to them by their display names, which is reflected in the recall (67.9\%). The use of partial names of all name variants increases the recall to 88.2\%.
 
 
When we talk at an aggregate-level (both Twitter and Wikipedia entities), we observe two important patterns. 1) we see that recall increases as we move from canonical names to canonical partial names, to all name variants, and to partial names of name variants. But we saw that that is not the case in Wikipedia entities.  The influence, therefore, comes from Twitter entities. 2) Using canonical names retrieves the least number of vital or relevant documents, and the partial names of all name variants retrieves the most number of documents. The difference in performance is 31.9\% on all entities, 20.7\% on Wikipedia entities, and 79.5\% on Twitter entities. This is a significant performance difference. 
 
 
 
The tables in \ref{tab:name} and \ref{tab:source-delta} show, recall for Wikipedia entities are higher than for Twitter. This indicates that Wikipedia entities are easier to match in documents than Twitter. This can be due to two reasons: 1) Wikipedia entities are relatively well described than Twitter entities. The fact that we can retrieve different name variants from DBpedia is a measure of relative description. By contrast, we have only two names for Twitter entities:their user names and their display names which we collect from their Twitter pages. 2) DBpedia entities are less obscure and that is why they are not in Wikipedia anyways. Another point is that mentioned by their display names more than they are by their user names. We also observed that social documents mention Twitter entities by their user names more than news suggesting a distinction between the standard in news and social documents. 
 
The tables in \ref{tab:name} and \ref{tab:source-delta} show, recall for Wikipedia entities are higher than for Twitter. This indicates that Wikipedia entities are easier to match in documents than Twitter. This can be due to two reasons: 1) Wikipedia entities are relatively well described than Twitter entities. The fact that we can retrieve different name variants from DBpedia is a measure of relative description. By contrast, we have only two names for Twitter entities: their user names and their display names which we collect from their Twitter pages. 2) DBpedia entities are less obscure and that is why they are not in Wikipedia anyways. Another point is that mentioned by their display names more than they are by their user names. We also observed that social documents mention Twitter entities by their user names more than news suggesting a distinction between the standard in news and social documents. 
 
 
 
0 comments (0 inline, 0 general)