8.8.08

Searching, and Searching, and Searching

via Tree of Life. An article in Cell concerning Text Mining.
It is difficult to benchmark the efficiency of IR (information
retrieval) engines, especially their recall, because the
complete set of documents relevant to almost any search is
inherently ill defined. Nevertheless, estimates show that the
most popular search engines, such as Google, have both
precision and recall below 0.3 (Shafi and Rather, 2005). In
other words, every time we do a search, more than 70% of
the documents in the output are irrelevant, whereas more
than 70% of all relevant documents never appear in the
engine's output.
Wow! That's why even with Google, I can't find the info I'm looking for. Normally I use google to get to places I know I want to go, whenever I want to use it to discover material its an exercise in tedium.

The article defines:

IR: Information Retrieval
NER: Named Entity Recognition
IE: Information Extracion
QA: Questions and Answers
TS: Text Summarization

and points to some text mining web resources

BLIMP (Biomedical Literature-Mining Publications)
Alexander Morgan's compilation of BioNLP resources and references
Resource links compiled by Dietrich Rebholz-Schuhmann
Text-mining resources compiled by Robert Futrelle
A list of links to current NER, IR, and IE engines
Marti Hearst's What Is Text Mining?

No comments: