Available Indexes

Add new comment

Hello Tomás,

Yes we are indexing Japanese content with Solr. CJK processing in Solr is complicated and Japanese in particular is interesting due to the multiple scripts (Kanji, Hiragana, Katakana, and Romanji).

Until fairly recently there was a bug/feature in the Lucene Query parser that caused CJK to be searched as a phrase regardless of the tokenizer used (See: https://issues.apache.org/jira/browse/LUCENE-2458)

We index materials in over 400 languages so indexing each language separately is not scalable. For this reason we couldn't use the CJK Tokenizer out-of-the-box. If I am reading the code correctly it also looks like the current CJK Tokenizer only distinguishes between Latin and non-Latin scripts, so if you gave it Thai, Devangari, or Arabic, it would also generate overlapping bigrams for those scripts, which is probably not desirable.

For a number of reasons we are currently using the ICUTokenizer. Unfortunately the ICUTokenizer currently tokenizes Kanji into unigrams. This results in lots of false drops. (See this for an interesting example of the problem: http://www.basistech.com/knowledge-center/products/n-gram-vs-morphologic...)

We have opened an issue to provide an option for the ICUTokenizer to output bigrams but have not had time to work on it. Robert Muir did a bunch of work on it and I believe he will be doing additional work on it in the near future, but haven't checked in with him for a while (https://issues.apache.org/jira/browse/LUCENE-2906).

There are also some facilities for mapping Hiragana to Katakana (https://issues.apache.org/jira/browse/SOLR-814 and https://issues.apache.org/jira/browse/SOLR-822) but I'm don't yet sufficiently understand the implications to know whether this is a good idea. (See these two articles on orthographic variation in Japanese: Halpern: http://www.kanji.org/cjk/reference/japvar.htm
Kummer, Womser-Hacker and Kando: http://www.nii.ac.jp/TechReports/05-011E.pdf)

With our current configuration to get decent results for Japanese, the user has to manually segment Kanji with spaces and put quotes around multi-character words. We hope to change this in a future upgrade.

Tom

You are browsing an archive of the HathiTrust website. In July 2023, we launched a new site at www.hathitrust.org.