Available Indexes

Add new comment

Yes, this might seem to be surprising, but there are a lot of words out there. Still, billions of words from only 555,000 documents seem to be surprising. In comparison, I recently indexed a ClueWeb09 SubsetB collection of 50 million mostly English pages. I collected only around 200 million unique words that contained only Latin letters and digits (i.e., I ignored words with non-ascii chars). I would suggest that n-gram sequences is the major contributor to the number of unique words in your case.
You are browsing an archive of the HathiTrust website. In July 2023, we launched a new site at www.hathitrust.org.