Available Indexes

Add new comment

Hi Dorthea, Well that's really another blog post. We aren't doing stemming or lemmatization partially because it can increase recall at the expense of precision, which may not be desirable when searching the full text of 10 million books. Also stemming/lemmatization is language-specific and we try to avoid language specific processing because the OCR for all 400+ languages is in the one field.

You might want to look at Jaques Savoy's work. He has an article ion Hungarian
Savoy, J.: Searching Strategies for the Hungarian Language. Information Processing & Management, 44(1), 2008, p. 310-324. http://members.unine.ch/jacques.savoy/Papers/HuIPM.pdf

He also has a review for several European languages that I can't find at the moment. You might look around on his web pages: http://members.unine.ch/jacques.savoy/clef/index.html

There is also the Solr language page which talks about stemming for Finnish
http://wiki.apache.org/solr/LanguageAnalysis#Finnish, but you also want to consider decompounding.

See :
V. Hollink, J. Kamps, C. Monz, and M. de Rijke. Monolingual Document Retrieval for European Languages. Information Retrieval 7(1-2), pages 33-52, 2004. http://staff.science.uva.nl/~vhollink/InformationRetrieval.pdf

I think I have a few other articles, I'll take a look.

Tom

You are browsing an archive of the HathiTrust website. In July 2023, we launched a new site at www.hathitrust.org.