Too Many Words Again!

Submitted by Tom Burton-West on October 5, 2010

After Mike McCandless increased the limit of unique words in a Lucene/Solr index segment from 2.4 billion words to around 274 billion words, we thought we didn't need to worry about having too many words (See http://www.hathitrust.org/blogs/large-scale-search/too-many-words). We recently discovered that we were wrong!

At the end of August we started seeing the percent of user queries taking more than 30 seconds starting to climb. At first we weren't too concerned. Seven new nodes had just been added to the Isilon cluster that we use to store our Solr indexes and the internal data redistribution within the cluster was resulting in a high I/O load. Since slow queries are the ones with the highest disk I/O requirements, we expected performance to suffer during the data redistribution. However, we started getting nervous when the percentage of user queries taking more than 30 seconds exceeded 10% and kept climbing. We also noticed the initial query we run when Solr is first started every day did not return within 60 seconds for the 3 shards on our server number 1. Our sysadmins investigated and we discovered we were getting Java Out Of Memory (OOM) errors for the shards on that server:

" java.lang.OutOfMemoryError: GC overhead limit exceeded"

That particular error message generally means that the Java Garbage Collector is spending too much time trying to collect garbage without recovering much memory. The usual solution is to increase the amount of memory allocated to the JVM.

We also got an interesting Solr Exception, but assume that that exception was triggered by memory problems:

"org.apache.solr.common.SolrException: Impossible Exception"

We noticed that that the OOM errors were only occuring on server number 1, and when we checked the size of the indexes, we found the index sizes ranged from 405GB to 412GB. None of our other servers had indexes over 400GB. We temporarily rolled back to the last index before the OOM errors occurred in the logs and increased the memory allocated to the JVM for the tomcats running the 3 Solr instances on each of our production servers from 16GB to 20GB.

We then did some investigation using our test server to try to determine what was going on. We pointed our test server at a snapshot of those indexes and turned on Garbage Collection logging:

"java -Xloggc:/l/local/apache-tomcat-dev/logs/GClog -XX:+PrintGCDetails -XX:+PrintGCTimeStamps ..."

We discovered that with 20GB, even though we weren't getting OOM errors we were getting frequent "stop-the-world" full garbage collections that took around 30 seconds but did not recover much memory. We increased the memory allocated to the JVM to 32 GB and that reduced the frequency of the full garbage collections. We then pointed the test server at the most recent index (which had 10 days of index growth since the indexes we rolled back to in production) and found that the 32GB was enough to prevent the OOMs and the frequent full garbage collections. We wanted to put the current index into production, so as a temporary measure we increased the memory allocated to the JVM to 32 GB on the production servers and resumed serving the current (updated nightly) index.

We then set up some tests on our test server and monitored memory use with jconsole. What we saw is that when the first query hit the server, total memory use which was around 1 GB, rocketed up to around 18 GB and never went below 18GB. As more queries hit the server, memory use increased to about 24GB before garbage collection kicked in. We still got some full GCs with about 20 second pauses. (Out of the 5 full GCs listed in jconsole below, the first two were less than a second and occurred when we first started up.)

Memory Use

In order to determine what was taking up the 18GB of memory, we started the server again, hit it with the first query and then used jmap to get a list of the objects using the most memory. These are in descending order of memory use. (The chart below is the reformatted output for the top 20 objects. See http://www.hathitrust.org/node/300 for the original.)

-jmap -heap:histogram <PID>

Rank	class name	instances	GB
		(millions)
1	[C (Java UTF16 character arrays)	86	4.5
2	java.lang.String	86	3.4
3	[I (Java integer arrays)	1	1.7
4	org.apache.lucene.index.TermInfo	29	1.1
5	org.apache.lucene.index.TermInfo	29	1.1
6	org.apache.lucene.index.TermInfo	29	1.1
7	org.apache.lucene.index.Term	29	0.9
8	org.apache.lucene.index.Term	29	0.9
9	org.apache.lucene.index.Term	29	0.9
		instances	GB
10	[J	253	0.69
11	[Lorg.apache.lucene.index.Term;	2	0.23
12	[Lorg.apache.lucene.index.TermInfo;	2	0.23
13	[Lorg.apache.lucene.index.Term;	2	0.23
14	[Lorg.apache.lucene.index.TermInfo;	2	0.23
15	[Lorg.apache.lucene.index.Term;	2	0.23
16	[Lorg.apache.lucene.index.TermInfo;	2	0.23
17	java.lang.ref.SoftReference	380304	0.02
18	[B	33739	0.01
19	<constMethodKlass>	47027	0.01
20	[Ljava.util.HashMap$Entry;	7833	0.01
	Total		17.9

What the above shows is that the data structure that represents the "index to the index" (the Lucene tii file) in memory is taking up most of the memory. (See Lucene File Formats for details on the tii and tis files.)

If you look at lines 4-6 and 7-9 you will see 3 sets of Term and TermInfo objects; one set for each index. There are about 29 million of these for each index. By default the tii file holds every 128th item in the tis file so the total number of terms in each index is (29 million x 128) or about 3.5 billion. This is consistent with what we know about our indexes. [1]

Each of the 3 shards has 2 segments. Looking at #11 you see a pair of arrays of "org.apache.lucene.index.Term" objects and #12 is a pair of arrays of Terminfo objects. There are 3 of these sets, one for each shard/index.

At the top of the list, item #1at 4.5 GB consists of character arrays and item #2 at 3.4GB consists of Strings. We asked on the mailing list to confirm that these other data structures were related to the in-memory representation of the tii index and Mike McCandless replied:

http://www.lucidimagination.com/search/document/2ffbf96375cd6618/solr_memory_use_jmap_and_terminfos_tii

Every indexed term that's loaded into RAM creates 4 objects (TermInfo,
Term, String, char[]), as you see in your profiler output.  And each
object has a number of fields, the header required by the JRE, GC
cost, etc. [2]

Even though the tii files took only about 2.2 GB on disk (about 750MB per index), once they are read into memory they occupy about 18 GB.

In Solr 1.4 and above there is a feature that lets you configure an “index divisor” for Solr. If you set this to 2, then Solr will only load every other entry from the tii file into memory; thus halving memory use for the tii file representation in memory. The downside is that once you have a file pointer into the tis file and seek to it, in the worst case you have to scan twice as many entries.[3]

Here is how Solr is configured to set it to 2:

<!-- To set   the termInfosIndexDivisor, do this: -->

<indexReaderFactory   class="org.apache.solr.core.StandardIndexReaderFactory">
  <int name="termInfosIndexDivisor">2</int>
</indexReaderFactory >

We upgraded the Solr on our test server to Solr 1.4.1 (in production we are currently using a pre-1.4 development release) and ran some tests with different settings.

We set the termInfosIndexDivisor to 2 and then to 4 and ran a query against all 3 shards to cause the tii files to get loaded into memory. We then ran jmap to get a histogram dump of the heap. The table below shows the total memory use for the top 20 classes for each configuration including a base configuration where we don’t set the index divisor.

	Base (current production config)	Index divisor =2	Index divisor =4
Total mem use for top 20 classes (GB)	17.9	9.6	6.1

We ran some preliminary tests and thus far have seen no significant impact in terms of response time for index divisors of 2, 4,8, and 16 with base memory use dropping as low as a little over 1 GB [4]. We plan to do a few more tests to decide on which divisor to use and then to work on JVM tuning (We should be able to eliminate long stop-the-world collections with the proper settings.) Once we get that done, we plan to upgrade our production Solrs to 1.4.1 and reduce the memory allocated to the JVM from 32 GB to some level possibly as low as 8 GB. That will leave even more memory for the OS disk cache. When we finish the tests and come up with a configuration and JVM settings we will report it in this blog.

Next up "Adventures in Garbage Collection" and "Why are there So Many Words?"

[1] In February when we had about 550,000 documents per index we had about 2.5 billion terms.(See: http://www.hathitrust.org/blogs/large-scale-search/too-many-words). However, those indexes were optimized to one segment. The number of segments has a dramatic effect on the total size of the tii data structures in memory. The current indexes hold around 670,000 documents each, but contain two segments. The tii and tis files contain unique terms per segment, so the reason the total number of terms in memory increased by nearly 50% (2.5 billion to 3.5 billion) while the number of documents increased only about 20% is because of many common terms that are duplicated between the two segments. If we optimized down to 1 segment we would probably have closer to 2.8 billion terms rather than 3.5. In looking at the logs, we realized that what triggered the OOM was not just the size of the index, but we had switched from optimizing to 2 segments to optimizing to 3 segments at the end of August. (We have switched back to 2 segments in production until we resolve the memory issues.)

[2] “Every indexed term that's loaded into RAM creates 4 objects (TermInfo,Term, String, char[])” See: http://www.lucidimagination.com/search/document/5762939e49a25c15/solr_memory_use_jmap_and_terminfos_tii#2ffbf96375cd6618 See also the Lucene file formats doc: http://lucene.apache.org/java/2_9_3/fileformats.html#Term%20Dictionary and http://chbits.blogspot.com/2010/07/lucenes-ram-usage-for-searching.html. Note that the flexible indexing branch (i.e. 4.x) is much more efficient in how it handles the in-memory representation of the tii file. Mike loaded one of our files into the flex branch and compared it to the 3.x branch. "In 3.x, just loading terms index takes 73 seconds, and consumes 3974 MB of RAM. In 4.0, this takes 2.2 sec to open and consumes 401 MB RAM (keeping the terms index interval @ 128). That's nearly 10X less RAM and 33X faster open time."

[3] The reason that it’s a linear scan of the tis file instead of a binary search has to do with decompression of the entries:

“The entry for each term in the terms dict stores a long file offset pointer, into the .frq file, and another long for the .prx file. But, these longs are delta-coded, so as you scan you have to sum up these deltas to get the absolute file pointers. The terms index (once loaded into RAM) has absolute longs, too. So when looking up a term, we first bin search to the nearest indexed term less than what you seek, then seek to that spot in the terms dict, then scan, summing the deltas.”

Mike McCandless http://lucene.472066.n3.nabble.com/Understanding-Lucene-s-File-Format-tc1514994.html#a1514994

[4] We suspect that the reason we aren't seeing a significant performance impact even with a divisor of 16 (which means 16 times as many entries might have to be scanned in the worst case) is probably because the amount of of data that has to be scanned fits into a read buffer (so there is only one disk seek).The difference between a linear scan of 128 entries and 2048 entries (16 * 128) in memory is probably lost in the noise.

Comments

Why are you keeping So Many Words?

Permalink Submitted by Toke Eskildsen (not verified) on October 6, 2010

I might be jumping the gun on your next blog post here. I tried searching for words with 5 random characters (hhjrt, wnttl, ghhhd and so on) and got a fair amount of hits each time. I tried doing this 10 times and so far haven't found any combination that did not result in a match.

I know that the source of the many terms is OCR-errors and that it is very problematic to preform correct spell checking on them. Nevertheless all the nonsense-words damages the search as (guessing here) noone tries to search for "wntt" when the need the word "want". Wouldn't a bad spelling correction be better than no correction?

Creating a merged dictionary for all the languages, applying it to words of, guessing again, length 6 or less and forcing the corrector to give some suggestion (maybe together with the original term if the probability for a correct guess is below a certain threshold) should give a significantly reduced number of unique terms and an increase in correct matches at the cost of a significant increase in false positives.

As false positives can be reduced by adding more search terms, while the missing correct matches due to OCR errors are harder to dig out, wouldn't this be a fair trade off?

Re: Why are you keeping so many words?

Permalink Submitted by Tom Burton-West on October 11, 2010

Hi Toke,

Are you suggesting a query-time fix to try to improve search results or some kind of analysis fix that would prevent "bad" words from getting into the index in the first place?

As far as preventing "bad" words from getting into the index, it's a really hard problem with 400+ languages. We don't want to accidentally remove a "good" word, so if we used a dictionary-based approach we would need very comprehensive dictionaries. Some problems with a dictionary based approach are that we have works in a wide variety of disciplines (meaning we would need some kind of academic/technical dictionaries), a wide variety of time periods (just the variations in English between 1400 and now are quite large), and many works contain proper names and place names either in the vernacular or transliterated.

I do plan to go into lots more detail in a future blog post. However, we are always looking for ideas on how to reduce the amount of dirty OCR in the index.

Tom

Re: Why are you keeping so many words?

Permalink Submitted by Toke Eskildsen (not verified) on October 13, 2010

There's an underlying premise to my suggestion: Out of the 2 billion+ words that you index, most of them are OCR-garbage. If the word "elephant" is indexed as "eleplant", it is not found. There is a huge amount of false negatives.

I am suggesting preventing the bad words in the first place plus a mirrored query parser. Let me try and describe an extreme and probably unusable solution:

Take a dictionary of current use English words only. Force a suggestion for all the OCR words, even when the probability of a match is insanely low. Index the words that the dictionary suggests. Do the same check for the query before a search is performed.

You say that you do not want to eliminate good words, but the point is that they are not. If "mxyzptlk" is OCR'ed properly, but corrected to "myrtle" by the spell checker, the record will still be found as the query parser also corrects "mxyzptlk" to "myrtle". For display, the original OCR'ed text is used.

This extreme solution probably introduces so many false positives that it is unusable. However, it can be improved gradually by adding dictionaries: The more dictionaries, the false positives, while the number of false negatives will always be (a lot) lower than the current solution.

Interesting, thanks. As

Permalink Submitted by Jonathan Rochkind (not verified) on November 12, 2010

Interesting, thanks. As someone with a much much smaller index, but also someone with much much less hardware and RAM available, I'm wondering if I should use of the index divisor to reduce memory use too.

termsInfoDivisor on smaller index?

Permalink Submitted by Tom Burton-West on October 11, 2010

Hi Jonathan, It should reduce memory use, but might make performance worse. You probably would want to run some performance comparison tests to be sure the trade-off is worth it. Let me know what you find out.

Our case is probably an extreme one. Even though we have a large index, we have an exceptionally large number of unique terms in proportion to the index size. This is due partially to having 400+ languages, but primarily due to dirty OCR and how it interacts with our filtering/analysis process. (More details in a blog post to follow).

In our case, our performance bottleneck is disk I/O because of the size of our indexes. So if we increase the index divisor by the 16, the potentially 16x longer scan of the tis file may have an impact, but its not observable due to the dominance of disk I/O in our performance. ( We suspect that even 128 * 16 or 2048 entries from the tis file only take one disk seek, and once the data is in memory, the in-memory linear scan is extremely fast compared to the disk I/O required to get the *frq and *prx data into memory. )

If you have a small enough index and/or high enough request rate (qps) so that you are CPU bound rather than I/O bound, you may notice a performance impact. Best way to find out is to run some tests.

BTW: there is also an indexing time setting (termIndexDefaults) that sets the ratio of the tii to the tis file, but I really like the search time setting since we can tweak it without re-indexing. Also the flex branch (trunk/4.x) is much more efficient in handling the tii in-memory data structures. Some more detailed discussion about the trade-offs and various options here:http://lucene.472066.n3.nabble.com/Solr-memory-use-jmap-and-TermInfos-tii-tc1455421.html#a1455421

Tom

Hapax!

Permalink Submitted by Lance Norskog (not verified) on February 17, 2011

('Hapax legomenon' means 'word which appears once'. I love the phrase.)A common strategy for finding mispellings is to pull all facets with count=1. In your case, this is also a good winnow for OCR-burps. If you OCR something twice, do you get the same OCR-burps? Why do this? Because you can build a large database of such burps, and you get a training set for a classifier! Once you have this, you can start attacking single-count facets. If nobody has done this before, it would make a good grad student project. Also, doing the OCR twice is a good experimental design.

re:Hapax!

Permalink Submitted by Tom Burton-West on February 18, 2011

Thanks for the suggestion Lance.

I assume facet count=1 is the same as all tokens in the OCR field that occur only once (tf=1 df=1). We were thinking about looking at the lucene index pruning contribution and removing hapax. However, the problem with actually removing the hapax is that in large corpora about 50% of the unique tokens are hapax so we would remove lots of real words. Because these words have a very high idf, if the word occurs in a query it would bring the document containing the word to the top, so removing these really hurts retrieval for those queries that would include a hapax. As you suggest, a large proportion of the OCR errors will occur only once so in some ways this might be a good way for training a classifier. However, there are still many many OCR errors that occur more than once. The promised blog post about OCR errors is partially written, but I keep getting diverted by other issues.

Main menu

Too Many Words Again!

Comments

Why are you keeping So Many Words?

Re: Why are you keeping so many words?

Re: Why are you keeping so many words?

Interesting, thanks. As

termsInfoDivisor on smaller index?

Hapax!

re:Hapax!

Add new comment

Our Research Center

Main menu

Too Many Words Again!

Comments

Why are you keeping So Many Words?

Re: Why are you keeping so many words?

Re: Why are you keeping so many words?

Interesting, thanks. As

termsInfoDivisor on smaller index?

Hapax!

re:Hapax!

Add new comment

Search form

Our Research Center