Context Navigation

Notice: In order to edit this ticket you need to be either: a Product Owner, The owner or the reporter of the ticket, or, in case of a Task not yet assigned, a team_member"

Bug #1059 (closed)

Opened 11 years ago

Closed 10 years ago

OutOfMemory exception thrown after 2 hours of importing

Reported by:	jamoore	Owned by:	jamoore
Priority:	blocker	Cc:	cxallan, bwzloranger, jrswedlow
Sprint:	n.a.
Total Remaining Time:	n.a.

Description

After 2 hours of importing a screen, the JBoss server hung with a OOM. The thread dump shows no threads hung. A head dump shows 100s of megabytes tied up in char[] (-->) ultimately linked to the JBoss class loader. Possible cause:

https://jira.jboss.org/jira/browse/JBAS-4593

As in the jira ticket, there are more than 10,000 JMX InvocationContexts still in memory. This issue, however, was supposedly fixed in JBoss 4.2.2

References

Attachments (1)

gc_testing.zip (367.4 KB) - added by jmoore 11 years ago.: Attaching a summary of various gc tests. From README.txt: Running various gc parameters under jboss/blitz &/or jprofiler. (Note: ConcMarkSweep? doesn't work well under jprofiler) gc1: standard gc under jboss. When requesting a full gc, boom gc2: parallel gc " " . Same. gc3: concurrent gc. Couldn't request full gc (jprofiler doesn't support) but ran and ran gc4: standard gc under blitz. same as gc1 gc5: concurrent gc under blitz. jprofiler could request full. ran and ran. gc6: going back to standard gc to see how long (without jprofiler) things run. No stop. Seems to be an issue with heavy load (also caused by profiler) gc7: trying gc6 with profiler to confirm. No full gc requested. Still OOM. gc8: Retring gc7 under jprofiler looking at allocations. Disconnected jprofiler before OOM, and blitz recovered. gc9: Adding NewRatio?=8 to gc8 test a hypothesis. Does work better, survives multiple gcs. gen2 filled up and was completely cleared each time. gc10: Doubled NewRatio?. Works even better. Full gcs got memory down into the 200M range. (Dependent on how close to the limit) -- may have spoken too soon: the increase from 600M to 700M (absolute) max hung somewhat. no throughput, etc. -- in general the throughput is getting worse, and the last gc @ 700M only got down in the 300M range. -- despite the large mountain peaks, does keep going gc11: Trying -XX:+ScavengeBeforeFullGC rather than NewRatio?. Without a call to gc, was able to survive a full gc @ 700M down to ~300M -- calling several myself was not significantly helpful gc12: Trying -XX:NewSize=big. Seems to work well. A steady (expected) ratchet effect. Memory growth->Clearing gc13: Trying gc12 with NewRatio?=8

Download all attachments as: .zip

Change History (7)

comment:1 Changed 11 years ago by jmoore

So, this almost certainly has nothing to do with JBAS-4593, and rather is not a traditional memory leak at all. What's most likely happening is that so many short-lived objects are being created with indexing & a screen import running at the same time, that the garbage collector is not keeping up. This explains why simple profiling shows no memory increase after a small import and a garbage collection.

Some other symptoms:

The import succeeds without indexing
Indexing (without import) fails when run long enough under a profiler, because of the added strain.
"Failure" in this case consists of OOM's being thrown, often if not always when a GC is begun (from profiler): all memory values rocket to the maximum and the gc's consume 100% of processing time
This particular type of failure (without import) can be fixed via:
- -XX:+UseConcMarkSweepGC
- -XNewRatio=8
- -XNewSize=512M

All of which better handle the large number of short-lived objects. Attempts to do the same under import still failed.

Changed 11 years ago by jmoore

Attachment gc_testing.zip added

Attaching a summary of various gc tests. From README.txt: Running various gc parameters under jboss/blitz &/or jprofiler. (Note: ConcMarkSweep? doesn't work well under jprofiler) gc1: standard gc under jboss. When requesting a full gc, boom gc2: parallel gc " " . Same. gc3: concurrent gc. Couldn't request full gc (jprofiler doesn't support) but ran and ran gc4: standard gc under blitz. same as gc1 gc5: concurrent gc under blitz. jprofiler could request full. ran and ran. gc6: going back to standard gc to see how long (without jprofiler) things run. No stop. Seems to be an issue with heavy load (also caused by profiler) gc7: trying gc6 with profiler to confirm. No full gc requested. Still OOM. gc8: Retring gc7 under jprofiler looking at allocations. Disconnected jprofiler before OOM, and blitz recovered. gc9: Adding NewRatio?=8 to gc8 test a hypothesis. Does work better, survives multiple gcs. gen2 filled up and was completely cleared each time. gc10: Doubled NewRatio?. Works even better. Full gcs got memory down into the 200M range. (Dependent on how close to the limit) -- may have spoken too soon: the increase from 600M to 700M (absolute) max hung somewhat. no throughput, etc. -- in general the throughput is getting worse, and the last gc @ 700M only got down in the 300M range. -- despite the large mountain peaks, does keep going gc11: Trying -XX:+ScavengeBeforeFullGC rather than NewRatio?. Without a call to gc, was able to survive a full gc @ 700M down to ~300M -- calling several myself was not significantly helpful gc12: Trying -XX:NewSize=big. Seems to work well. A steady (expected) ratchet effect. Memory growth->Clearing gc13: Trying gc12 with NewRatio?=8

comment:2 Changed 11 years ago by jmoore

Cc jason added

Possible solutions to this include:

Move indexing to another process. Cons: May require too much memory for many since Hibernate will be running twice, also complicates deployment in the JBoss case
Throttle indexing and/or import Cons: To get this in for milestone:3.0-Beta3.1 would require pushing things back. The OmeroThrottling infrastructure is in place, but the checks are not being performed.
Ship with improved GC settings. Cons: We still have not found the optimal settings but this can most likely be done. Also complicates deployment, since the optimal settings are dependent on the resources available on the server machine.

comment:3 Changed 11 years ago by jmoore

Milestone changed from 3.0-Beta3.1 to 3.0-Beta4

Pushing. We may have to release a point release for the server-only if users run into this issue.

comment:4 Changed 10 years ago by jmoore

The primary method for working around this is to get the full text indexer into a separate process. And some (very high) hard-throttling limits will be in place on a per thread basis (e.g. one thread can't load/write 100K objects in a single method call) Then testing will have to show what needs to be changed to make the importer's huge imports more successful.

comment:5 Changed 10 years ago by jmoore

r3723, r3759, r3760 (with workaround), et al. move indexing out to its own process.

comment:6 Changed 10 years ago by jmoore

Resolution set to fixed
Status changed from new to closed

With the indexer in a separate thread and the numerous improvements to the importer, this seems to be solved. Obviously, other memory issues will pop up again, but closing for now.

Note: See TracTickets for help on using tickets. You may also have a look at Agilo extensions to the ticket.

Download in other formats: