Hi Tom,

Thank you for your detailed comments.

To answer your last question first: I believe the ICUTokenizer labels kanji and hanja as Han and therefore treats them the same as the regular Hanzi.

I have done some thinking about how to provide searching across Traditional and Simplified Chinese, but since mapping from one character set to the other is lossy and/or inaccurate, I've been trying to determine how best to give a higher ranking to documents that match the character set the user enters. I'll try to write up my current thoughts on this and post another reply tomorrow.

I am not suprised by the character variant issue. Are there available resources for mapping character variants?

I don't sufficiently understand the issues of Japanese orthographic variation. There was some interesting work on phonetic similarity done by Kummer, Womser-Hacker and Kando, but I haven't seen any follow-up. Apparently there is also an increasing trend among users of Japenese search engines to use Katakana for words normally written in Kanji http://www.citeulike.org/user/tbw/tag/japanese

Would mapping hirigana to katakana make any sense?

Thank you again for your comments


