From fe74b611603ffc8760c53b9805c5d891a40bfc5d 2014-06-12 02:54:53 From: Gebrekirstos Gebremeskel Date: 2014-06-12 02:54:53 Subject: [PATCH] updated --- diff --git a/mypaper-final.tex b/mypaper-final.tex index 4a875e5c1ea876e5986e082bec336dd3b634d6b9..8f32858ba5e05b086746204d99dc0ea941d59592 100644 --- a/mypaper-final.tex +++ b/mypaper-final.tex @@ -498,7 +498,59 @@ The tables in \ref{tab:name} and \ref{tab:source-delta} show recall for Wikipedi \section{Impact on classification} - In the overall experimental setup, classification, ranking, and evaluationn are kept constant. Following \cite{balog2013multi} settings, we use WEKA's\footnote{http://www.cs.waikato.ac.nz/∼ml/weka/} Classification Random Forest. Features we use incude similarity features such as cosine and jaccard, document-entity features such as docuemnt mentions entity in title, in body, frequency of mention, etc., and related entity features such as page rank scores. In total we sue The features consist of similarity measures between the KB entiities profile text, document-entity features such as + In the overall experimental setup, classification, ranking, and evaluationn are kept constant. Following \cite{balog2013multi} settings, we use WEKA's\footnote{http://www.cs.waikato.ac.nz/∼ml/weka/} Classification Random Forest. However, we use fewer numbers of features which we found to be more effective. We determined the effectiveness of the features by running the classification algorithm using the fewer features we implemented and their features. Our feature implementations achieved better results. The total numbers of features we used are 13 and are listed below. + +\paragraph{Google's Cross Lingual Dictionary (GCLD)} + +This is a mapping of strings to Wikipedia concepts and vice versa +\cite{spitkovsky2012cross}. +(1) the probability with which a string is used as anchor text to +a Wikipedia entity + +\paragraph{jac} + Jaccard similarity between the document and the entity's Wikipedia page +\paragraph{cos} + Cosine similarity between the document and the entity's Wikipedia page +\paragraph{kl} + KL-divergence between the document and the entity's Wikipedia page + + \paragraph{PPR} +For each entity, we computed a PPR score from +a Wikipedia snapshot and we kept the top 100 entities along +with the corresponding scores. + + +\paragraph{Surface Form (sForm)} +For each Wikipedia entity, we gathered DBpedia name variants. These +are redirects, labels and names. + + +\paragraph{Context (contxL, contxR)} +From the WikiLink corpus \cite{singh12:wiki-links}, we collected +all left and right contexts (2 sentences left and 2 sentences +right) and generated n-grams between uni-grams and quadro-grams +for each left and right context. +Finally, we select the 5 most frequent n-grams for each context. + +\paragraph{FirstPos} + Term position of the first occurrence of the target entity in the document + body +\paragraph{LastPos } + Term position of the last occurrence of the target entity in the document body + +\paragraph{LengthBody} Term count of document body +\paragraph{LengthAnchor} Term count of document anchor + +\paragraph{FirstPosNorm} + Term position of the first occurrence of the target entity in the document + body normalised by the document length +\paragraph{MentionsBody } + No. of occurrences of the target entity in the document body + + + + + Features we use incude similarity features such as cosine and jaccard, document-entity features such as docuemnt mentions entity in title, in body, frequency of mention, etc., and related entity features such as page rank scores. In total we sue The features consist of similarity measures between the KB entiities profile text, document-entity features such as In here, we present results showing how the choices in corpus, entity types, and entity profiles impact these latest stages of the pipeline. In tables \ref{tab:class-vital} and \ref{tab:class-vital-relevant}, we show the performances in max-F. \begin{table*} \caption{vital performance under different name variants(upper part from cleansed, lower part from raw)}