From 430e35705d08bb7bb2b1308c35cdee76e2f72a11 2014-06-10 11:11:59 From: Gebrekirstos Gebremeskel Date: 2014-06-10 11:11:59 Subject: [PATCH] updated --- diff --git a/mypaper-final.tex b/mypaper-final.tex index 9f3bb830407d156c07a75fda5aff64290ceee19e..e4c179f93b32dd0aeb11dbbb78ca700df73f7d0b 100644 --- a/mypaper-final.tex +++ b/mypaper-final.tex @@ -143,8 +143,8 @@ documents (news, blog, tweets) can influence filtering. \section{Data and Task description} -We use the TREC KBA-CCR-2013 dataset \footnote{http://http://trec-kba.org/trec-kba-2013.shtml}. The dataset consists of a time-stamped stream corpus, a set of KB entities, and a set of relevance judgments. -\subsection{Stream corpus} The stream corpus comes in two versions: raw and cleaned. The raw and cleansed versions are 6.45TB and 4.5TB respectively, after xz-compression and GPG encryption. The raw data is a dump of raw HTML pages. The cleansed version is the raw data after its HTML tags are stripped off. The stream corpus is organized in hourly folders each of which contain many chunk files. Each chunk file contains between hundreds and hundreds of thousands serialized thrift objects. One thrift object is one document. A document could be a blog article, a news article, or a social media post (including tweet). The stream corpus comes from three sources: TREC KBA 2012 (social, news and linking) \footnote{http://trec-kba.org/kba-stream-corpus-2013.shtml}, arxiv\footnote{http://arxiv.org/}, and spinn3r\footnote{http://spinn3r.com/}. Table \ref{tab:streams} shows the sources and the number of hourly directories, and number of chunk files. +We use TREC KBA-CCR-2013 dataset \footnote{http://http://trec-kba.org/trec-kba-2013.shtml} and problem setting. The dataset consists of a time-stamped stream corpus, a set of KB entities, and a set of relevance judgments. +\subsection{Stream corpus} The stream corpus comes in two versions: raw and cleaned. The raw and cleansed versions are 6.45TB and 4.5TB respectively, after xz-compression and GPG encryption. The raw data is a dump of raw HTML pages. The cleansed version is the raw data after its HTML tags are stripped off and non-English docuemnts removed. The stream corpus is organized in hourly folders each of which contains many chunk files. Each chunk file contains between hundreds and hundreds of thousands of serialized thrift objects. One thrift object is one document. A document could be a blog article, a news article, or a social media post (including tweet). The stream corpus comes from three sources: TREC KBA 2012 (social, news and linking) \footnote{http://trec-kba.org/kba-stream-corpus-2013.shtml}, arxiv\footnote{http://arxiv.org/}, and spinn3r\footnote{http://spinn3r.com/}. Table \ref{tab:streams} shows the sources, the number of hourly directories, and the number of chunk files. \begin{table*} \caption{retrieved documents to different sources } @@ -172,10 +172,10 @@ We use the TREC KBA-CCR-2013 dataset \footnote{http://http://trec-kba.org/trec-k \subsection{KB entities} - The KB entities consist of 20 Twitter entities and 121 Wikipedia entities. The selected entities are, on purpose, sparse. The entities consist of 71 people, 1 organizations, and 24 facilities. + The KB entities consist of 20 Twitter entities and 121 Wikipedia entities. The selected entities are, on purpose, sparse. The entities consist of 71 people, 1 organization, and 24 facilities. \subsection{Relevance judgments} -TREC-KBA provided relevance judgments for training and testing. Relevance judgments are given to a document-entity pairs. Documents with citation-worthy content to a given entity are annotated as \emph{vital}, while documents with tangentially relevant content or that lack freshliness or with content that can be useful for initial KB-dossier are annotated as \emph{relevant}. Documents with no relevant content are labeled \emph{neutral} and spam is labeled as \emph{garbage}. +TREC-KBA provided relevance judgments for training and testing. Relevance judgments are given to a document-entity pairs. Documents with citation-worthy content to a given entity are annotated as \emph{vital}, while documents with tangentially relevant content, or docuemnts that lack freshliness o with content that can be useful for initial KB-dossier are annotated as \emph{relevant}. Documents with no relevant content are labeled \emph{neutral} and spam is labeled as \emph{garbage}. The inter-annotator agreement on vital in 2012 was 70\% while in 2013 it is 76\%. This is due to the more refined definition of vital and the distinction made between vital and relevant. @@ -278,7 +278,7 @@ Most (more than 80\%) of the annotation documents are in the test set. Some anno \section{Experiments and Results} We conducted experiments to study the effect of cleansing, different entity profiles, types of entities, category of documents, relevance ranks (vital or relevant), and the impact on classification. For ease of understanding, we present the results in two categories: cleansing, and document categories. In each case we study the number of annotated documents that are retrieved. - \subsection{Cleansing: raw vs. cleansed} + \subsection{Cleansing: raw or cleansed} \begin{table} \caption{Central or relevant documents that are retrieved under different name variants , upper part from cleansed, lower part from raw} \begin{center} @@ -312,12 +312,6 @@ The upper part of Table \ref{tab:name} shows the recall performances on the clea \subsection{Entity Profiles} If we look at the recall performances for the raw corpus, filtering documents by canonical names achieves a recall of 59\%. Adding other name variants improves the recall to 79.8\%, an increase of 20.8\%. This means 20.8\% of documents mentioned the entities by other names rather than by their canonical names. Canonical partial names achieve a recall of 72\% and the partial names of all name variants achieves 90.2\%. This says that 18.2\% of documents mentioned the entities by partial names of other non-canonical name variants. -\subsection{Entity Type( Wikipedia vs. Twitter)} -Recall performances on Wikipedia entities show that canonical names achieve a recall of 70\%, and partial names of canonical names achieve a recall of 86.1\%. This is an increase in recall of 16.1\%. By contrast, the increase in recall of partial names of all name variants over just all name variants is 8.3. The high increase in recall when moving from canonical names to their partial names, in comparison to the lower increase when moving from all name variants to their partial names can be explained by saturation. This is to mean that documents have already been extracted by the different name variants and thus using partial name does not bring in many new documents. One interesting observation is that, on Wikipedia entities, partial names of canonical names achieve better results than name variants. This holds in both cleansed and raw extractions. %In the raw extraction, the difference is about 3.7. - -In Twitter entities, however, it is different. Both canonical and their partial names perform the same and the recall is very low. Canonical names and partial canonical names are the same for Twitter entities because they are one word names. For example in https://twitter.com/roryscovel, ``roryscovel`` is the canonical name and its partial name is also the same. That they perform very low is because the canonical names of Twitter entities are not really names; they are usually arbitrarily created user names. It shows that people do not refer to Twitter entities by their user names. They refer to them by their display names, which is reflected in the recall (67.9\%). The use of partial names of all name variants increases the recall to 88.2\%. - -When we talk at an aggregate-level (both Twitter and Wikipedia entities), we observe two important patterns. 1) we see that recall increases as we move from canonical names to canonical partial names, to all name variants, and to partial names of name variants. But we saw that that is not the case in Wikipedia entities. The influence, therefore, comes from Twitter entities. 2) Using canonical names retrieves the least number of vital or relevant documents, and the partial names of all name variants retrieves the most number of documents. The difference in performance is 31.9\% on all entities, 20.7\% on Wikipedia entities, and 79.5\% on Twitter entities. This is a significant performance difference. \subsection{Breakdown of results by document source category} @@ -395,11 +389,11 @@ The results of the different entity profiles on the raw corpus are broken down The 8 document source categories are regrouped into three for two reasons: 1) some groups are very similar to each other. Mainstream-news and news are similar. The reason they exist separately, in the first place, is because they were collected from two different sources, by different groups and at different times. we call them news from now on. The same is true with weblog and social, and we call them social from now on. 2) some groups have so small number of annotations that treating them independently does not make much sense. Majority of vital or relevant annotations are social (social and weblog) (63.13\%). News (mainstream +news) make up 30\%. Thus, news and social make up about 93\% of all annotations. The rest make up about 7\% and are all grouped as others. -The results of the breakdown by document categopries is presented in a multi-dimensional table shown in \ref{tab:source-delta}. There are three outer columns for all entities, Wikipedia and Twitter. Each of the outer columns consist of the document categories of other,news and social. The rows consist of Vital, relevant and total each of which have the four entity profiles. +The results of the breakdown by document categories is presented in a multi-dimensional table shown in \ref{tab:source-delta}. There are three outer columns for all entities, Wikipedia and Twitter. Each of the outer columns consist of the document categories of other,news and social. The rows consist of Vital, relevant and total each of which have the four entity profiles. - \subsection{Vital vs. relevant} + \subsection{ Relevance Rating: Vital and relevant} When comparing the recall performances in vital and relevant, we observe that canonical names achieve better in vital than in relevant. This is specially true with Wikipedia entities. For example, the recall for news is 80.1 and for social is 76, while the corresponding recall in relevant is 75.6 and 63.2 respectively. We can generally see that the recall in vital are better than the recall in relevant suggesting that relevant documents are more probable to mention the entities and when they do, using some of their common name variants. @@ -415,9 +409,9 @@ The results of the breakdown by document categopries is presented in a multi-dim \subsection{Document category: others. news and social} -The recall for Wikipedia entities in \ref{tab:name} ranged from 61.8\% (cannonical names) to 77.9\% (partial names of name variants. We looked at how these recall is distributted across the three document categories. In Table \ref{tab:source-delta}, Wikipedia column, we see, across all entity profiles, that others have a higher recall followed by news. Social documents achieve the lowest recall. While the news recall ranged from 76.4\% to 98.4\%, the recall for social documents ranged from 65.7\% to 86.8\%. Others achieve higher than news and news achieve higher than social. This pattern holds across all name variants in Wikipedia entities. Notice that the others category stands for arxiv (scientific documents), classifieds, forums and linking. +The recall for Wikipedia entities in \ref{tab:name} ranged from 61.8\% (canonical names) to 77.9\% (partial names of name variants. We looked at how these recall is distributed across the three document categories. In Table \ref{tab:source-delta}, Wikipedia column, we see, across all entity profiles, that others have a higher recall followed by news. Social documents achieve the lowest recall. While the news recall ranged from 76.4\% to 98.4\%, the recall for social documents ranged from 65.7\% to 86.8\%. Others achieve higher than news and news achieve higher than social. This pattern holds across all name variants in Wikipedia entities. Notice that the others category stands for arxiv (scientific documents), classifieds, forums and linking. -In Twitter entities, however, the pattern is different. In cannonical names (and their partials), social documents achieve higher recall than news . This suggests that social documents refer to Twitter entities by their cannonical names (user names) more than news. In partial names of all name variants, news achieve better results than social. The difference in recall between cannonical and partial names of all name variants shows that news do not refer to Twitter entties by their user names, they refer to them with their display names. +In Twitter entities, however, the pattern is different. In canonical names (and their partials), social documents achieve higher recall than news . This suggests that social documents refer to Twitter entities by their canonical names (user names) more than news. In partial names of all name variants, news achieve better results than social. The difference in recall between canonical and partial names of all name variants shows that news do not refer to Twitter entities by their user names, they refer to them with their display names. Overall, across all entities types and all entity profiles, others achieve better recall than news, and news, in turn, achieve higher recall than social documents. This suggests that social documents are the hardest to retrieve. This of course makes sense since social posts are short and are more likely to point to other resources, or use short informal names. @@ -425,14 +419,21 @@ Overall, across all entities types and all entity profiles, others achieve bette -We computed four percentage increases in recall (deltas) between the difefrent entity profiles (see \ref{tab:source-delta2}. The first delta is the recall percentage between partial names of canonical names and canonical names. The second is the delta between name variants and canonical names. The third is the difference between partial names of name variants and partial names of canonical names and the fourth between partial names of name variants and name variants. we believe these four deltas offer a clear meaning. The delta between all name variants and canonical names shows the percentage of documents that the new name variants retrieve, but the canonical name does not. Similarly, the delta between partial names of name variants and partial names of canonical names shows the percentage of document-entity pairs that can be gained by the partial names of the name variants. +We computed four percentage increases in recall (deltas) between the different entity profiles (see \ref{tab:source-delta2}. The first delta is the recall percentage between partial names of canonical names and canonical names. The second is the delta between name variants and canonical names. The third is the difference between partial names of name variants and partial names of canonical names and the fourth between partial names of name variants and name variants. we believe these four deltas offer a clear meaning. The delta between all name variants and canonical names shows the percentage of documents that the new name variants retrieve, but the canonical name does not. Similarly, the delta between partial names of name variants and partial names of canonical names shows the percentage of document-entity pairs that can be gained by the partial names of the name variants. -In most of the deltas, news followed by social followed by others show greater difference. This suggests s that news refer to entities by different names, rather than by a certain standard name. This is counter-intuitive since one would expect news to mention entities by some consistent name(s) thereby reducing the difference. The deltas, for Wikipedia entities, between canonical partials and canonicals, and all name variants and canonicals are high suggesting that partial names and all other name variants bring in new docuemnts that can not be retrieved by canonical names. The rest of the two deltas are very small suggesting that partial names of all name variants do not bring in new relevant docuemnts. In Twitter entities, name variants bring in new documents. +In most of the deltas, news followed by social followed by others show greater difference. This suggests s that news refer to entities by different names, rather than by a certain standard name. This is counter-intuitive since one would expect news to mention entities by some consistent name(s) thereby reducing the difference. The deltas, for Wikipedia entities, between canonical partials and canonicals, and all name variants and canonicals are high suggesting that partial names and all other name variants bring in new documents that can not be retrieved by canonical names. The rest of the two deltas are very small suggesting that partial names of all name variants do not bring in new relevant documents. In Twitter entities, name variants bring in new documents. % The biggest delta observed is in Twitter entities between partials of all name variants and partials of canonicals (93\%). delta. Both of them are for news category. For Wikipedia entities, the highest delta observed is 19.5\% in cano\_part - cano followed by 17.5\% in all\_part in relevant. -\subsection{Wikipedia versus Twitter} -The tables in \ref{tab:name} and \ref{tab:source-delta} show, recall for Wikipedia entities are higher than for Twitter. This indicates that Wikipedia entities are easier to match in docuemnts than Twitter. This can be due to two reasons: 1) Wikipedia entities are relatively well described rthan Twitter entities. The fact that we can retrieve difeffernt name variants from DBpedia is a measure of relative description. By contrast, we have only two names for Twitter entities:their user names and theur display names wh9ich we collect from their Twitter pages. 2) DbPedia entities are less obscure and that is why they are not in Wikipedia anyways. Another point is that mentioned by their display names more than they are by their user names. We also observed that social docuemnts mention Twitter entities by their user names more than news suggesting a disnction between the standard in news and social documents. + \subsection{Entity Type: Wikipedia and Twitter)} +From Table \ref{tab:name} shows the difefrence between Wikipedia and Twitter entities. Wikipedia entities' canonical names achieve a recall of 70\%, and partial names of canonical names achieve a recall of 86.1\%. This is an increase in recall of 16.1\%. By contrast, the increase in recall of partial names of all name variants over just all name variants is 8.3. The high increase in recall when moving from canonical names to their partial names, in comparison to the lower increase when moving from all name variants to their partial names can be explained by saturation. This is to mean that documents have already been extracted by the different name variants and thus using partial name does not bring in many new documents. One interesting observation is that, on Wikipedia entities, partial names of canonical names achieve better results than name variants. This holds in both cleansed and raw extractions. %In the raw extraction, the difference is about 3.7. + +In Twitter entities, however, it is different. Both canonical and their partial names perform the same and the recall is very low. Canonical names and partial canonical names are the same for Twitter entities because they are one word names. For example in https://twitter.com/roryscovel, ``roryscovel`` is the canonical name and its partial name is also the same. That they perform very low is because the canonical names of Twitter entities are not really names; they are usually arbitrarily created user names. It shows that documents do not refer to Twitter entities by their user names. They refer to them by their display names, which is reflected in the recall (67.9\%). The use of partial names of all name variants increases the recall to 88.2\%. + +When we talk at an aggregate-level (both Twitter and Wikipedia entities), we observe two important patterns. 1) we see that recall increases as we move from canonical names to canonical partial names, to all name variants, and to partial names of name variants. But we saw that that is not the case in Wikipedia entities. The influence, therefore, comes from Twitter entities. 2) Using canonical names retrieves the least number of vital or relevant documents, and the partial names of all name variants retrieves the most number of documents. The difference in performance is 31.9\% on all entities, 20.7\% on Wikipedia entities, and 79.5\% on Twitter entities. This is a significant performance difference. + + +The tables in \ref{tab:name} and \ref{tab:source-delta} show, recall for Wikipedia entities are higher than for Twitter. This indicates that Wikipedia entities are easier to match in documents than Twitter. This can be due to two reasons: 1) Wikipedia entities are relatively well described than Twitter entities. The fact that we can retrieve different name variants from DBpedia is a measure of relative description. By contrast, we have only two names for Twitter entities:their user names and their display names which we collect from their Twitter pages. 2) DBpedia entities are less obscure and that is why they are not in Wikipedia anyways. Another point is that mentioned by their display names more than they are by their user names. We also observed that social documents mention Twitter entities by their user names more than news suggesting a distinction between the standard in news and social documents. @@ -522,9 +523,9 @@ It seems, the raw corpus has more effect on Twitter entities performances. An in \subsection{Missing relevant documents \label{miss}} -There is a trade-off between using a richer entity-profile and retrieval of irrelevant documents. The richer the profile, the more relevant documents it retrieves, but also the more irrelevant documents. To put into perspective, lets compare the number of documents that are retrieved with partial names of name variants and partial names of canonical names. Using the raw corpus, the partial names of canonical names extracts a total of 2547487 and achieves a recall of 72.2\%. By contrast, the partial names of name variants extracts a total 4735318 documents and achieves a recall of 90.2\%. The total number of documents extracted increases by 85.9\% for a recall gain of 18\%. The rest of the documents, that is 67.9\%, are newly introduced irrelevant documents. There is an advantage in excluding irrelevant documents from filtering because they confuse the later stages of the pipeline. +There is a trade-off between using a richer entity-profile and retrieval of irrelevant documents. The richer the profile, the more relevant documents it retrieves, but also the more irrelevant documents. To put into perspective, lets compare the number of documents that are retrieved with partial names of name variants and partial names of canonical names. Using the raw corpus, the partial names of canonical names extracts a total of 2547487 and achieves a recall of 72.2\%. By contrast, the partial names of name variants extracts a total of 4735318 documents and achieves a recall of 90.2\%. The total number of documents extracted increases by 85.9\% for a recall gain of 18\%. The rest of the documents, that is 67.9\%, are newly introduced irrelevant documents. There is an advantage in excluding irrelevant documents from filtering because they confuse the later stages of the pipeline. - The use of the partial names of name variants for filtering is, therefore, an aggressive attempt to retrieve as many relevant documents as possible at the cost retrieving irrelevant documents. However, we still miss about 2363(10\%) documents. Why are these documents missed? If they are not mentioned by partial names of name variants, what are they mentioned by? Table \ref{tab:miss} shows the documents that miss with respect to cleansed and raw corpus. The upper part shows the number of documents missing from cleansed and raw versions of the corpus. The lower part of the table shows the intersections and exclusions in each corpus. + The use of the partial names of name variants for filtering is an aggressive attempt to retrieve as many relevant documents as possible at the cost of retrieving irrelevant documents. However, we still miss about 2363(10\%) the vital-relevant documents. Why are these documents missed? If they are not mentioned by partial names of name variants, what are they mentioned by? Table \ref{tab:miss} shows the documents that we miss with respect to cleansed and raw corpus. The upper part shows the number of documents missing from cleansed and raw versions of the corpus. The lower part of the table shows the intersections and exclusions in each corpus. \begin{table} \caption{The number of documents that are missing from raw and cleansed extractions. } @@ -549,11 +550,11 @@ Raw & 276 & 4951 & 5227 \\ \label{tab:miss} \end{table} - It is normal to assume that the set of document-entity pairs extracted from cleansed are a sub-set of those that are extracted from the raw corpus. We find that that is not the case. There are 217 unique Entity-document pairs that are retrieved from the cleansed corpus, but not from the raw. 57 of them are vital. Similarly, there are 3081 document-entity pairs that are missing from cleansed, but are present in raw. 1065 of them are vital. Examining the content of the documents reveals that it is due to a missing part of text from a corresponding document. All the documents that we miss from the raw corpus are social, particularly from the category social (not from weblogs). These are document such as tweets and other posts from other social media. To meet the format of the raw data, some of them must have been converted later after collection and on the way lost a part or all of their content. It is similar for the documents that we miss from cleansed: a part or the content is lost in + It is normal to assume that the set of document-entity pairs extracted from cleansed are a sub-set of those that are extracted from the raw corpus. We find that that is not the case. There are 217 unique entity-document pairs that are retrieved from the cleansed corpus, but not from the raw. 57 of them are vital. Similarly, there are 3081 document-entity pairs that are missing from cleansed, but are present in raw. 1065 of them are vital. Examining the content of the documents reveals that it is due to a missing part of text from a corresponding document. All the documents that we miss from the raw corpus are social, particularly from the category social (not from weblogs). These are document such as tweets and other posts from other social media. To meet the format of the raw data, some of them must have been converted later after collection and on the way lost a part or all of their content. It is similar for the documents that we miss from cleansed: a part or the content is lost in converting. In both cases the mention of the entity happened to be on the part of the text that is cut out during conversion. - The interesting set of of relevance judgments are those that we miss from both raw and cleansed extractions. These are 2146 unique document-entity pairs, 219 of them are with vital relevance judgments. The total number of entities in the missed vital annotations is 28 Wikipedia and 7 Twitter, making a total of 35. Looking into document categories shows that the great majority (86.7\%) of the documents are social. This suggests that social (tweets and blogs) can talk about the entities without mentioning them by name. This is, of course, inline with intuition. + The interesting set of relevance judgments are those that we miss from both raw and cleansed extractions. These are 2146 unique document-entity pairs, 219 of them are with vital relevance judgments. The total number of entities in the missed vital annotations is 28 Wikipedia and 7 Twitter, making a total of 35. Looking into document categories shows that the great majority (86.7\%) of the documents are social. This suggests that social (tweets and blogs) can talk about the entities without mentioning them by name. This is, of course, inline with intuition. @@ -578,23 +579,23 @@ converting. In both cases the mention of the entity happened to be on the part % \label{tab:miss-category} % \end{table*} -However, it is interesting to look into the actual content of the documents to gain an insight into the ways a document can talk about an entity without mentioning the entity . We collected 35 documents, one for each entity, for manual examination. Here below we present the reasons. -\paragraph{Outgoing link mentions} a post (tweet) with an outgoing link which mentions the entity. -\paragraph{Event place - Event} a document that talks about an event is vital to the location entity where it takes place. For example Maha Music Festival takes place in Lewis and Clark\_Landing, and a document talking about the festival is vital for the park. There are also cases where an event's address places the event in a park and due to the the document becomes vital to the park. This basically being mentioned by address -\paragraph{Entity -related entity} A document about an important figure such as artist, athlete can be vital to another. This is specially true if the two are contending for the same title, one has snatched a title, and award from the other. -\paragraph{Organization - main activity} A document that talks about about an area on which the company is active is vital for the organization. For example, Atacocha is a mining company and and an news item on mining waste was annotated vital. -\paragraph{Entity - class} If an entity belongs to a certain class (group) and a news item about the class can be vital for the individual members. FrankandOak is named innovative company and a news item that talks about a class of innovative companies is relevant for a it. Other examples are: a big event of which an entity is related such an Film awards for actors. -\paragraph{Artist - work} documents that discuss the work of artists can be relevant to the artists. Such cases include books or films being vital for the book author or the director (actor) of the film. robocop is film whose screenplay is by Joshua Zetumer. An blog that talks about the film was judged vital for Joshua Zetumer. +However, it is interesting to look into the actual content of the documents to gain an insight into the ways a document can talk about an entity without mentioning the entity. We collected 35 documents, one for each entity, for manual examination. Here below we present the reasons. +\paragraph{Outgoing link mentions} A post (tweet) with an outgoing link which mentions the entity. +\paragraph{Event place - Event} A document that talks about an event is vital to the location entity where it takes place. For example Maha Music Festival takes place in Lewis and Clark\_Landing, and a document talking about the festival is vital for the park. There are also cases where an event's address places the event in a park and due to that the document becomes vital to the park. This is basically being mentioned by address which belongs to alarger space. +\paragraph{Entity -related entity} A document about an important figure such as artist, athlete can be vital to another. This is specially true if the two are contending for the same title, one has snatched a title, or award from the other. +\paragraph{Organization - main activity} A document that talks about about an area on which the company is active is vital for the organization. For example, Atacocha is a mining company and a news item on mining waste was annotated vital. +\paragraph{Entity - group} If an entity belongs to a certain group (class), a news item about the group can be vital for the individual members. FrankandOak is named innovative company and a news item that talks about the group of innovative companies is relevant for a it. Other examples are: a big event of which an entity is related such an Film awards for actors. +\paragraph{Artist - work} Documents that discuss the work of artists can be relevant to the artists. Such cases include books or films being vital for the book author or the director (actor) of the film. Robocop is film whose screenplay is by Joshua Zetumer. A blog that talks about the film was judged vital for Joshua Zetumer. \paragraph{Politician - constituency} A major political event in a certain constituency is vital for the politician from that constituency. A good example is a weblog that talks about two north Dakota counties being drought disasters. The news is vital for Joshua Boschee, a politician, a member of North Dakota democratic party. -\paragraph{head -organization} a document that talks about an organization of which the entity is the head can be vital for the entity. Jasper\_Schneider is USDA Rural Development state director for North Dakota and an article about problems of primary health centers in North Dakota is judged vital for him. +\paragraph{head - organization} A document that talks about an organization of which the entity is the head can be vital for the entity. Jasper\_Schneider is USDA Rural Development state director for North Dakota and an article about problems of primary health centers in North Dakota is judged vital for him. \paragraph{World Knowledge} Some things are impossible to know without your world knowledge. For example ''refreshments, treats, gift shop specials, "bountiful, fresh and fabulous holiday decor," a demonstration of simple ways to create unique holiday arrangements for any home; free and open to the public`` is judged relevant to Hjemkomst\_Center. This is a social media post, and unless one knows the person posting it, there is no way that this text shows that. Similarly ''learn about the gray wolf's hunting and feeding behaviors and watch the wolves have their evening meal of a full deer carcass; $15 for members, $20 for nonmembers`` is judged vital to Red\_River\_Zoo. \paragraph{No document content} Some documents were found to have no content \paragraph{Not clear why} It is not clear why some documents are annotated vital for some entities. -Although they have different document ids, many of the documents have the same content. In the vital annotation, there are only three (88 mainstream, social, 401 weblog). In the 35 document vital document-entity pairs we examined, 22 are social, and 13 are news. + % To gain more insight, I sampled for each 35 entities, one document-entity pair and looked into the contents. The results are in \ref{tab:miss from both} % % \begin{table*} @@ -645,7 +646,7 @@ Although they have different document ids, many of the documents have the same c % \label{tab:miss from both} % \end{table*} - +We also observed that although docuemnts have different document ids, several of them have the same content. In the vital annotation, there are only three (88 news, and 409 weblog). In the 35 vital document-entity pairs we examined, and 13 are news and 22 are social.