Available Indexes

Add new comment

This is a nice writeup. You've certainly got the basic idea of things for dealing with Chinese, but there are some other things you're gonna want to consider as well. Chinese users will expect to have the ability to find text regardless of whether they enter the query in Simplified or Traditional Chinese. Someone searching for 中国 is going to want to find 中國 and vice versa. In addition to SC/TC, there are other character variants that you will want to handle, especially in a historical archive. For example, 敎 is an older form of 教: users will expect to search for 教本 and get results for 敎本. Variants are really important, actually: even across locales that use Traditional Chinese the "standard" character may differ. Japanese will bring an interesting wrinkle to this as well: kanji usage has changed over time, and there is an expectation that variants are interchangeable as in Chinese. Naturally the accepted "modern" form in Japan may certainly be different from that used in Korea or Greater China. Additionally you have intersearchability between hiragana and katakana to contend with, and with OCR that hasn't been corrected confusion between similar hiragana and katakana characters: I've seen cases where katakana ヘ and hiragana へ get confused such that you'll have a sequence of hiragana with a katakana ヘ smack in the middle. Presumably if you continue down the n-gram route for ideographs you'll do that for kanji and hanja as well?
You are browsing an archive of the HathiTrust website. In July 2023, we launched a new site at www.hathitrust.org.