diff --git a/mypaper-final.tex b/mypaper-final.tex index 4a5c5135b464d6ce878882b4d6ef597dc24b914b..4a875e5c1ea876e5986e082bec336dd3b634d6b9 100644 --- a/mypaper-final.tex +++ b/mypaper-final.tex @@ -111,17 +111,18 @@ knowledge base curators, who need to continuously screen the media for updates regarding the knowledge base entries they manage. Automatic system support for this entity-centric information processing problem requires complex pipe\-lines involving both natural language -processing and information retrieval components. The default pipeline +processing and information retrieval components. The pipeline +encountered in a variety of systems that approach this problem involves four stages: filtering, classification, ranking (or scoring), -and evaluation. Filtering is an initial step, that reduces the +and evaluation. Filtering is only an initial step, that reduces the web-scale corpus of news and other relevant information sources that may contain entity mentions into a working set of documents that should be more manageable for the subsequent stages. -This step has a large impact on the recall that can be achieved. -Keeping the subsequent steps constant, we therefore zoom in into the -filtering stage, and conduct an in-depth analysis of the main design -decisions here: -cleansing noisy web data, the methods to create entity profiles, the +Nevertheless, this step has a large impact on the recall that can be +maximally attained! Therefore, in this study, we have focused on just +this filtering stage and conduct an in-depth analysis of the main design +decisions here: how to cleans the noisy text obtained online, +the methods to create entity profiles, the types of entities of interest, document type, and the grade of relevance of the document-entity pair under consideration. We analyze how these factors (and the design choices made in their