Task #12012 (closed)
Opened 10 years ago
Closed 10 years ago
BUG: Importing a large number of files fails
Reported by: | spli | Owned by: | dlindner |
---|---|---|---|
Priority: | critical | Milestone: | 5.0.1 |
Component: | Insight | Version: | 5.0.0-rc3 |
Keywords: | n.a. | Cc: | rleigh, java@… |
Resources: | n.a. | Referenced By: | n.a. |
References: | n.a. | Remaining Time: | n.a. |
Sprint: | n.a. |
Description (last modified by jamoore)
OMERO.clients-5.0.0-ice35-b19.mac.zip, gretzky: OMERO.web 5.0.0-ice33-b19.
Import a dataset with a large number of files, e.g. /ome/team/simon/digits/train-png.tar.bz (42,000 files split into 10 directories).
The import failed after importing 2492 files. The server appears to have open file handles to a log file for each image:
omero@gretzky:~$ lsof -u omero | grep /nightshade/OMERO/omero-c6100-dec-5-2013-valuable/OMERO-merge-develop-FS/ManagedRepository/ |wc -l 2503
e.g.
... /nightshade/OMERO/omero-c6100-dec-5-2013-valuable/OMERO-merge-develop-FS/ManagedRepository/user-4_5/2014-02/24/17-38-55.156.log ...
Insight has managed to get itself stuck at 100% cpu with an Abnormal Termination:
java.lang.Exception: Abnormal termination due to an uncaught exception. java.lang.StackOverflowError at java.awt.AWTEventMulticaster.componentHidden(AWTEventMulticaster.java:172) ...1022 copies of this line cut... at java.awt.AWTEventMulticaster.componentHidden(AWTEventMulticaster.java:172) Abnormal termination due to an uncaught exception. java.lang.StackOverflowError at java.awt.AWTEventMulticaster.componentHidden(AWTEventMulticaster.java:172) ...1022 copies of this line cut... at java.awt.AWTEventMulticaster.componentHidden(AWTEventMulticaster.java:172) Exception in thread "AWT-EventQueue-0" at org.openmicroscopy.shoola.env.ui.UserNotifierImpl.showErrorDialog(UserNotifierImpl.java:191) at org.openmicroscopy.shoola.env.ui.UserNotifierImpl.notifyError(UserNotifierImpl.java:291) at org.openmicroscopy.shoola.env.AbnormalExitHandler.doTermination(AbnormalExitHandler.java:147) at org.openmicroscopy.shoola.env.AbnormalExitHandler.terminate(AbnormalExitHandler.java:85) at org.openmicroscopy.shoola.env.AWTExceptionHanlder.handle(AWTExceptionHanlder.java:99) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at java.awt.EventDispatchThread.handleException(EventDispatchThread.java:412) at java.awt.EventDispatchThread.processException(EventDispatchThread.java:318) at java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:312) at java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:211) at java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:201) at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:196) at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:188) at java.awt.EventDispatchThread.run(EventDispatchThread.java:122)
Whilst this dialog is displayed Insight appears to continue attempting to upload files, judging by omeroinsight.log:
... 2014-02-24 17:42:14,583 INFO [ ome.formats.importer.ImportCandidates] ( Threa d-18) 1 file(s) parsed into 1 group(s) with 1 call(s) to setId in 1ms. (32ms tot al) [0 unknowns] 2014-02-24 17:42:14,583 INFO [ ome.formats.importer.ImportConfig] ( Threa d-18) OMERO Version: 5.0.0-ice35-b19 2014-02-24 17:42:14,584 INFO [ ome.formats.importer.ImportConfig] ( Threa d-18) Bioformats version: 5.0.0-ice35-b19 revision: cd2a5db date: 24 February 20 14 2014-02-24 17:42:14,584 DEBUG [ o.o.shoola.util.NetworkChecker] ( Threa d-18) Cached networkup: true 2014-02-24 17:42:14,608 INFO [ ome.formats.importer.ImportConfig] ( Thread-18) OMERO Version: 5.0.0-ice35-b19 2014-02-24 17:42:14,608 INFO [ ome.formats.importer.ImportConfig] ( Thread-18) Bioformats version: 5.0.0-ice35-b19 revision: cd2a5db date: 24 February 2014 2014-02-24 17:42:14,608 INFO [ ome.formats.importer.ImportCandidates] ( Thread-18) Depth: 4 Metadata Level: MINIMUM 2014-02-24 17:42:14,623 INFO [ loci.formats.ImageReader] ( Thread-18) APNGReader initializing /Users/simon/omero-test/tmp/train-png/1/00001447.png 2014-02-24 17:42:14,640 INFO [ ome.formats.importer.ImportCandidates] ( Thread-18) 1 file(s) parsed into 1 group(s) with 1 call(s) to setId in 1ms. (32ms total) [0 unknowns] 2014-02-24 17:42:14,640 INFO [ ome.formats.importer.ImportConfig] ( Thread-18) OMERO Version: 5.0.0-ice35-b19 2014-02-24 17:42:14,640 INFO [ ome.formats.importer.ImportConfig] ( Thread-18) Bioformats version: 5.0.0-ice35-b19 revision: cd2a5db date: 24 February 2014 2014-02-24 17:42:14,640 DEBUG [ o.o.shoola.util.NetworkChecker] ( Thread-18) Cached networkup: true 2014-02-24 17:42:14,661 INFO [ ome.formats.importer.ImportConfig] ( Thread-18) OMERO Version: 5.0.0-ice35-b19 2014-02-24 17:42:14,661 INFO [ ome.formats.importer.ImportConfig] ( Thread-18) Bioformats version: 5.0.0-ice35-b19 revision: cd2a5db date: 24 February 2014 2014-02-24 17:42:14,661 INFO [ ome.formats.importer.ImportCandidates] ( Thread-18) Depth: 4 Metadata Level: MINIMUM 2014-02-24 17:42:14,677 INFO [ loci.formats.ImageReader] ( Thread-18) APNGReader initializing /Users/simon/omero-test/tmp/train-png/1/00039986.png ...
Though there is no sign of any imports happening server-side.
Blitz-0.log is full of errors:
grep ERROR OMERO-CURRENT/var/log/Blitz-0.log ... 2014-02-24 18:06:44,215 ERROR [ omero.util.ServantHolder] (l.Server-2) servantsPerSession reached for 51689bc8-e36f-430d-9f81-11d0182904de: 10000 2014-02-24 18:06:47,298 ERROR [ omero.util.ServantHolder] (l.Server-4) servantsPerSession reached for 51689bc8-e36f-430d-9f81-11d0182904de: 10000 2014-02-24 18:06:47,360 ERROR [ omero.util.ServantHolder] (l.Server-2) servantsPerSession reached for 51689bc8-e36f-430d-9f81-11d0182904de: 10000 2014-02-24 18:06:49,916 ERROR [ omero.util.ServantHolder] (l.Server-3) servantsPerSession reached for 51689bc8-e36f-430d-9f81-11d0182904de: 10000 2014-02-24 18:06:50,043 ERROR [ omero.util.ServantHolder] (l.Server-2) servantsPerSession reached for 51689bc8-e36f-430d-9f81-11d0182904de: 10000 ...
Change History (21)
comment:1 Changed 10 years ago by rleigh
comment:2 Changed 10 years ago by spli
File handles were not cleaned up after 15 minutes. However after Insight was started again and a new import kicked off they were closed.
Note I've seen the same issue with a different server on the 5.1 branch... handles remained open after over 1 hour but seemed to be cleared when a new import started.
comment:3 Changed 10 years ago by jamoore
- Description modified (diff)
- Milestone changed from Unscheduled to 5.0.1
- Priority changed from major to critical
I missed an obvious point earlier: there's no reason that a single fileset should have that many servants / filehandles left open. i.e. there is an additional problem.
However, what I said about the failure to call close on abnormal termination still holds: that will leave an import unclosed which is "A Bad Thing (TM)". We should A) try to do whatever cleanup is possible in such cases (or avoid abnormal terminiations??) and perhaps B) check for closed imports on connect. This will have to be fairly careful to also test the connected client count.
comment:4 Changed 10 years ago by jamoore
rleigh: how did you trigger the same issue?
comment:5 Changed 10 years ago by rleigh
josh: I imported a few directories of images on gretzky with insight. All the imports started failing after there were around ~800 import logs left open. Verified with lsof at the time by spli.
comment:6 Changed 10 years ago by jamoore
Roger: in your case there were only import log files, right? And Simon: in yours there were also data files?
comment:7 Changed 10 years ago by rleigh
IIRC there were files other than import log files also open, but only a handful in comparison to the log files.
comment:8 Changed 10 years ago by spli
I've just reproduced this on ome-ci-c6-07 (5.0.0-ice34-b19, Java 1.7 as opposed to 1.6 on gretzky). Also failed at the same point, after uploading 2492 files (note the importer reports 2496 files as imported, though only 2492 are visible in the web client). 2128 open file handles on the server, almost all to log files.
comment:9 Changed 10 years ago by jamoore
Ok knowing that it's only the log files is somewhat re-assuring: See https://trello.com/c/P4wOlFcd/86-close-sifted-logs
The timeout for these files is currently 30 minutes. You can set it in etc/logback.xml to something lower if you'd like to test. (How long did your whole import take?)
As for what happened to the 4 imported but unseen files, that would require more investigation.
comment:10 Changed 10 years ago by spli
Around 45 minutes, then Insight crashed with around 1600 files still to go.
comment:11 Changed 10 years ago by jamoore
Can you try reproducing with the CLI?
comment:12 Changed 10 years ago by spli
bin/omero import -s gretzky.openmicroscopy.org.uk -c -d 2113 train-png/0/000* worked (4132 files imported)
comment:13 Changed 10 years ago by cblackburn
See Updates to slf4j/logback on openmicroscopy
and Updates to slf4j/logback on bioformats
These should allow a time-out to be set for the SiftingAppender or for markers to be investigated as a way of closing the import log files.
comment:14 Changed 10 years ago by jamoore
Moving the pure logback issue to #12021 (yes lucky bug number!). This issue should be solely used for the insight crash on the large number of files.
comment:15 Changed 10 years ago by bpindelski
- Owner set to bpindelski
comment:16 Changed 10 years ago by bpindelski
- Component changed from Import to Insight
comment:17 Changed 10 years ago by bpindelski
After analysing the heap taken after Insight threw an OOM exception, it can be seen that the biggest memory consumers are 30800 instances of org.openmicroscopy.shoola.agents.fsimporter.util.FileImportComponent? (643 MB heap space) and 72800 instances of org.openmicroscopy.shoola.env.data.util.StatusLabel? (513 MB heap space). I wonder if the first solution to this problem would require introducing some kind of paging system to the importer window?
comment:18 Changed 10 years ago by bpindelski
The method where this excessive allocation happens is org.openmicroscopy.shoola.agents.fsimporter.util.FileImportComponent.insertFiles(Map<File, StatusLabel>).
comment:19 Changed 10 years ago by bpindelski
One solution suggested during the team meeting was to hard-limit the number of files handled by the importer window. This would require adding a setting to one of Insight's config XML files and using it during import. Then a dialog would be displayed and the import aborted.
comment:20 Changed 10 years ago by dlindner
- Owner changed from bpindelski to dlindner
Limit number of files to import to 2000
comment:21 Changed 10 years ago by dlindner
- Resolution set to fixed
- Status changed from new to closed
This is exactly the issue I triggered last Thursday: lsof showed hundreds of open import logs.