Permalink Submitted by Tom Burton-West on December 9, 2011
Hi Dorthea, Well that's really another blog post. We aren't doing stemming or lemmatization partially because it can increase recall at the expense of precision, which may not be desirable when searching the full text of 10 million books. Also stemming/lemmatization is language-specific and we try to avoid language specific processing because the OCR for all 400+ languages is in the one field.
You might want to look at Jaques Savoy's work. He has an article ion Hungarian
Savoy, J.: Searching Strategies for the Hungarian Language. Information Processing & Management, 44(1), 2008, p. 310-324. http://members.unine.ch/jacques.savoy/Papers/HuIPM.pdf
re:agglutinative languages
Hi Dorthea, Well that's really another blog post. We aren't doing stemming or lemmatization partially because it can increase recall at the expense of precision, which may not be desirable when searching the full text of 10 million books. Also stemming/lemmatization is language-specific and we try to avoid language specific processing because the OCR for all 400+ languages is in the one field.
You might want to look at Jaques Savoy's work. He has an article ion Hungarian
Savoy, J.: Searching Strategies for the Hungarian Language. Information Processing & Management, 44(1), 2008, p. 310-324. http://members.unine.ch/jacques.savoy/Papers/HuIPM.pdf
He also has a review for several European languages that I can't find at the moment. You might look around on his web pages: http://members.unine.ch/jacques.savoy/clef/index.html
There is also the Solr language page which talks about stemming for Finnish
http://wiki.apache.org/solr/LanguageAnalysis#Finnish, but you also want to consider decompounding.
See :
V. Hollink, J. Kamps, C. Monz, and M. de Rijke. Monolingual Document Retrieval for European Languages. Information Retrieval 7(1-2), pages 33-52, 2004. http://staff.science.uva.nl/~vhollink/InformationRetrieval.pdf
I think I have a few other articles, I'll take a look.
Tom