Task #10464 (closed)
Opened 11 years ago
Closed 10 years ago
Bug: HDF5 isn't threadsafe by default (Tables openTable() / addData() sometimes returns None, occasionally Ice::ObjectNotExistException)
Reported by: | spli | Owned by: | spli |
---|---|---|---|
Priority: | critical | Milestone: | 5.0.3 |
Component: | Services | Version: | 4.4.8 |
Keywords: | n.a. | Cc: | python-team@…, analysis@… |
Resources: | n.a. | Referenced By: | n.a. |
References: | n.a. | Remaining Time: | 0.0d |
Sprint: | n.a. |
Description
Another mysterious intermittent bug. Occasionally openTable(ofile) returns None in the following (I know name isn't required for getObject if the id is given, but that should be irrelevant):
attrs = {'id': long(tableId)} if tableName: attrs['name'] = tableName ofile = conn.getObject("OriginalFile", attributes = attrs) table = res.openTable(ofile)
Retrying it immediately succeeds (table is a valid handle).
Looking through Tables-0.log for the table OriginalFile? ID (1688):
2013-02-24 20:34:41,226 INFO [ omero.tables.TablesI] (Dummy- 3 ) getTable: 1688 {'omero.session.uuid': '138230fc-0ceb-4736-bab2-1d15b4c5f4d 4', 'omero.client.uuid': '32cb3552-6cc3-4526-be4d-27a51304a7f6'} 2013-02-24 20:34:41,257 INFO [ omero.remote] (Dummy- 11 ) Meth: TablesI.getTable 2013-02-24 20:34:41,258 INFO [ omero.tables.TablesI] (Dummy- 11 ) getTable: 1688 {'omero.session.uuid': '138230fc-0ceb-4736-bab2-1d15b4c5f4d 4', 'omero.client.uuid': '32cb3552-6cc3-4526-be4d-27a51304a7f6'} 2013-02-24 20:34:41,276 INFO [ omero.tables.HdfStorage] (Dummy- 11 ) Size: 0 - Attaching Table-FDDF8540-1E9F-404B-BABF-C87FDDA6335E to /Users/s imon/var/omero.data/Files/Dir-001/1688 2013-02-24 20:34:41,276 INFO [ omero.remote] (Dummy- 11 ) Rslt: FDDF8540-1E9F-404B-BABF-C87FDDA6335E -t @ Tables-0.TablesAdapter 2013-02-24 20:34:41,277 INFO [ omero.tables.HdfStorage] (Dummy- 3 ) Size: 1 - Attaching Table-CDD126B0-9BA6-49E5-939F-5AF655DB335E to /Users/s imon/var/omero.data/Files/Dir-001/1688 2013-02-24 20:34:41,278 INFO [ omero.remote] (Dummy- 3 ) Rslt: CDD126B0-9BA6-49E5-939F-5AF655DB335E -t @ Tables-0.TablesAdapter
Now it looks like it's been opened twice, so the first openTable() call should've worked.
Corresponding section of Blitz-0.log:
2013-02-24 20:34:36,173 INFO [ ome.services.blitz.fire.Registry] (l.Serv er-4) Found 2 table services(s) : [Tables-0 -t @ Tables-0.TablesAdapter, Tables -t:tcp -h 192.168.1.84 -p 65353:tcp -h 192.168.56.1 -p 65353] 2013-02-24 20:34:41,225 INFO [ ome.services.util.ServiceHandler] (l.Server-2) Executor.doWork -- ome.services.blitz.impl.SharedResourcesI.checkOriginalFilePermissions(1688) 2013-02-24 20:34:41,226 INFO [ ome.services.util.ServiceHandler] (l.Server-2) Args: [null, InternalSF@50258881] 2013-02-24 20:34:41,227 INFO [ ome.security.basic.EventHandler] (l.Server-2) Auth: user=2,group=3,event=null(User),sess=138230fc-0ceb-4736-bab2-1d15b4c5f4d4 2013-02-24 20:34:41,239 INFO [ org.perf4j.TimingLogger] (l.Server-2) start[1361738081226] time[13] tag[omero.call.success.ome.services.blitz.impl.SharedResourcesI$4.doWork] 2013-02-24 20:34:41,239 INFO [ ome.services.util.ServiceHandler] (l.Server-2) Rslt: ome.model.core.OriginalFile:Id_1688 2013-02-24 20:34:41,240 INFO [ ome.services.blitz.fire.Registry] (l.Server-2) Found 2 table services(s) : [Tables-0 -t @ Tables-0.TablesAdapter, Tables -t:tcp -h 192.168.1.84 -p 65353:tcp -h 192.168.56.1 -p 65353] 2013-02-24 20:34:41,268 INFO [ ome.services.util.ServiceHandler] (l.Server-3) Meth: interface ome.api.IAdmin.canUpdate 2013-02-24 20:34:41,268 INFO [ ome.services.util.ServiceHandler] (l.Server-3) Args: [ome.model.core.OriginalFile:Id_1688] 2013-02-24 20:34:41,272 INFO [ ome.services.util.ServiceHandler] (l.Server-5) Meth: interface ome.api.IAdmin.canUpdate 2013-02-24 20:34:41,272 INFO [ ome.services.util.ServiceHandler] (l.Server-5) Args: [ome.model.core.OriginalFile:Id_1688] 2013-02-24 20:34:41,273 INFO [ ome.security.basic.EventHandler] (l.Server-5) Auth: user=2,group=3,event=null(User),sess=138230fc-0ceb-4736-bab2-1d15b4c5f4d4 2013-02-24 20:34:41,274 INFO [ ome.security.basic.EventHandler] (l.Server-3) Auth: user=2,group=3,event=null(User),sess=138230fc-0ceb-4736-bab2-1d15b4c5f4d4 2013-02-24 20:34:41,275 INFO [ org.perf4j.TimingLogger] (l.Server-5) start[1361738081272] time[3] tag[omero.call.success.ome.logic.AdminImpl.canUpdate] 2013-02-24 20:34:41,275 INFO [ ome.services.util.ServiceHandler] (l.Server-5) Rslt: true 2013-02-24 20:34:41,277 INFO [ org.perf4j.TimingLogger] (l.Server-3) start[1361738081268] time[9] tag[omero.call.success.ome.logic.AdminImpl.canUpdate] 2013-02-24 20:34:41,277 INFO [ ome.services.util.ServiceHandler] (l.Server-3) Rslt: true
This error seems to only occur when opening multiple tables, but is frequent enough to warrant using this workaround:
def openRetry(ofile, n): for i in xrange(n): t = self.res.openTable(ofile) if t: return t print 'Failed to open table %d (attempt %d)' % \ (ofile.getId().val, i + 1) raise Exception('Failed to open table %d' % ofile.getId().val)
On rare occasions an Ice::ObjectNotExistException is thrown, this is on a local server which should rule out network issues, the Table definitely exists and is accessble by the current user. Tables-0.log:
2013-02-24 00:32:27,853 INFO [ omero.remote] (Dummy- 11 ) Meth: TablesI.getTable 2013-02-24 00:32:27,853 INFO [ omero.tables.TablesI] (Dummy- 11 ) getTable: 1603 {'omero.session.uuid': 'b88e52e2-cea5-4619-9355-0a08026a9a2 d', 'omero.client.uuid': '4d5818ed-1dbe-4719-9ad4-9caccbd52fef'} 2013-02-24 00:32:27,905 INFO [ omero.remote] (Dummy- 11 ) Excp: exception ::Ice::ObjectNotExistException { id = { name = b88e52e2-cea5-4619-9355-0a08026a9a2d c 2013-02-24 00:32:27,906 WARNI [ omero.remote] (Dummy- 11 ) <function getTable at 0x11167faa0> raised a non-ServerError (<class 'Ice.O bjectNotExistException'>): exception ::Ice::ObjectNotExistException { id = { name = b88e52e2-cea5-4619-9355-0a08026a9a2d category = session-4d5818ed-1dbe-4719-9ad4-9caccbd52fef } facet = operation = getAdminService }
Attachments (2)
Change History (16)
comment:1 Changed 11 years ago by spli
- Owner set to spli
comment:2 Changed 11 years ago by spli
- Priority changed from minor to critical
- Version set to 4.4.8
comment:3 Changed 10 years ago by spli
newTable is failing during Pychrm feature calculation:
-! 02/26/14 12:05:25.972 warning: Proxy keep alive failed. Traceback (most recent call last): File "./script", line 267, in <module> runScript() File "./script", line 255, in runScript message += processImages(client, scriptParams) + '\n' File "./script", line 191, in processImages prefixChannel=prefixChannel) File "./script", line 112, in extractFeatures ftb.createTable(ftall.names, ft.version) File "build/bdist.linux-x86_64/egg/OmeroPychrm/PychrmStorage.py", line 132, in createTable File "build/bdist.linux-x86_64/egg/OmeroPychrm/TableConnection.py", line 380, in createNewTable File "build/bdist.linux-x86_64/egg/OmeroPychrm/TableConnection.py", line 240, in newTable File "/opt/hudson/workspace/ANALYSIS-OMERO-5.1-latest-ice34/OMERO.server-5.1.0-m0-214-d2ad149-ice34-b48/lib/python/omero_SharedResources_ice.py", line 243, in newTable return _M_omero.grid.SharedResources._op_newTable.invoke(self, ((repoId, path), _ctx)) Ice.ConnectionLostException: Ice.ConnectionLostException: recv() returned zero
comment:4 Changed 10 years ago by spli
- Milestone changed from Unscheduled to 5.0.3
Try and reproduce with the latest 5 or assume it's fixed. Could it be related to the problem in #8266 ?
comment:5 Changed 10 years ago by jamoore
The categories here certainly don't look as corrupt as they did in the Processor case, but there could be a relationship. There are probably a few minor changes that can be made to SharedResourcesI to test, but only if a repeated invocation eventually shows a failure. Moving to 5.0.3 makes good sense.
comment:6 Changed 10 years ago by spli
#12372 suggest this is still a problem, and that it occurs fairly often.
comment:7 Changed 10 years ago by spli
I can reliably kill OMERO.tables on trout, though it's not clear whether this is a pytables problem, or a combination of OMERO and pytables
# tmp-tables.py import omero, omero.gateway client = omero.client('trout.openmicroscopy.org') client.createSession('username', 'password') conn = omero.gateway.BlitzGateway(client_obj=client) res = conn.getSharedResources() def r(fid, n): t = res.openTable(omero.model.OriginalFileI(fid)) assert t col = t.getHeaders()[0] col.values = [n] t.addData([col]) t.close() def run(total): t = res.newTable(0, 'test.h5') t.initialize([omero.grid.LongColumn('lc')]) f = t.getOriginalFile() t.close() for n in xrange(total): print n r(f.getId().val, n) conn.deleteObjectDirect(f) def run2(totalopen, totalnew): for n in xrange(totalnew): print 'New: %d' % n run(totalopen)
In IPython:
: execfile('tmp-tables.py') : run2(100,100)
Set this running in a few (I was using 5) terminals simultaneously.
You may get Ice Connection lost errors:
UnknownLocalException: exception ::Ice::UnknownLocalException { unknown = TcpTransceiver.cpp:221: Ice::ConnectionLostException: connection lost: recv() returned zero } > /Users/simon/omero-test/OMERO-51/lib/python/omero_Tables_ice.py(1067)addData() 1066 def addData(self, cols, _ctx=None): -> 1067 return _M_omero.grid.Table._op_addData.invoke(self, ((cols, ), _ctx)) 1068
Or the more serious looking error in master.err:
-! 06/20/14 15:34:58.155 OMERO.Glacier2: warning: dispatch exception: TcpTransceiver.cpp:221: Ice::ConnectionLostException: connection lost: recv() returned zero identity: 70F68E66-6E99-4D3F-9EFD-597777749AB9 facet: operation: addData remote host: 10.31.1.79 remote port: 54076 *** glibc detected *** python: double free or corruption (fasttop): 0x00007f00550446f0 *** ======= Backtrace: ========= /lib64/libc.so.6[0x39d9e76166] ...
See the attached master.err for the full backtrace
comment:8 Changed 10 years ago by spli
- Summary changed from Bug: Tables openTable() sometimes returns None, occasionally Ice::ObjectNotExistException to Bug: Tables openTable() / addData() sometimes returns None, occasionally Ice::ObjectNotExistException
comment:9 Changed 10 years ago by spli
- Remaining Time set to 2
- Status changed from new to accepted
comment:10 Changed 10 years ago by spli
I can reproduce this locally
$ sw_vers ProductName: Mac OS X ProductVersion: 10.9.3 BuildVersion: 13D65 $ java -version java version "1.8.0_05" Java(TM) SE Runtime Environment (build 1.8.0_05-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode) $ python --version Python 2.7.7 $ pip freeze Cython==0.20.1 Genshi==0.7 Pillow==2.4.0 PyGithub==1.25.0 argparse==1.2.1 backports.ssl-match-hostname==3.4.0.2 flake8==2.1.0 gnureadline==6.3.3 ipython==2.1.0 matplotlib==1.3.1 mccabe==0.2.1 nose==1.3.3 numexpr==2.4 numpy==1.8.1 pep8==1.5.7 psutil==2.1.1 pyflakes==0.8.1 pyparsing==2.0.2 python-dateutil==2.2 scipy==0.14.0 six==1.7.2 sleekxmpp==1.3.1 tables==3.1.1 tornado==3.2.2 virtualenv==1.11.6 wsgiref==0.1.2 yaclifw==0.1.1
Testing script: https://gist.github.com/manics/b03acf51254b16c30d3b (ran python tables-crash.py 100 100 from five terminals concurrently) causes a segfault in OMERO.tables, Python crash log attached
comment:11 Changed 10 years ago by spli
- Cc python-team@… added
- Summary changed from Bug: Tables openTable() / addData() sometimes returns None, occasionally Ice::ObjectNotExistException to Bug: HDF5 isn't threadsafe by default (Tables openTable() / addData() sometimes returns None, occasionally Ice::ObjectNotExistException)
I think this is a problem with the HDF5 library not being threadsafe by default.
brew uninstall hdf5 brew install hdf5 --enable-threadsafe
fixes things locally (python tables-crash.py 100 100 in 5 windows completes successfully).
--enable-threadsafe isn't enabled by default because it conflicts with the C++ and fortran bindings (amongst others)
- Homebrew: https://github.com/Homebrew/homebrew-science/blob/038dab515f0961de7a846a6e215708928118115a/hdf5.rb#L39
- Fedora/EPEL: http://pkgs.fedoraproject.org/cgit/hdf5.git/tree/hdf5.spec?h=el6&id=a24d683a1f871865935d0f364734becac932cedd#n123
- rpmforge (version on trout via spacewalk): https://github.com/repoforge/rpms/blob/b2ddc7a635b914c81b0a038168a6faebf30915aa/specs/hdf5/hdf5.spec#L62
comment:12 Changed 10 years ago by spli
Possible fix which should avoid a need for a custom compilation options with HDF5: https://github.com/openmicroscopy/openmicroscopy/pull/2688
comment:13 Changed 10 years ago by jamoore
Note: I've been stuck on:
2014-06-30 07:37:04,920 INFO [ omero.remote] (Dummy-3 ) Excp: Expected bytes, got path Traceback (most recent call last): File "/opt/ome1/dist/lib/python/omero/util/decorators.py", line 61, in exc_handler rv = func(*args, **kwargs) File "/opt/ome1/dist/lib/python/omero/util/decorators.py", line 39, in handler rv = func(*args, **kwargs) File "/opt/ome1/dist/lib/python/omero/tables.py", line 915, in getTable storage = HDFLIST.getOrCreate(file_path) File "/opt/ome1/dist/lib/python/omero/util/decorators.py", line 81, in with_lock return func(*args, **kwargs) File "/opt/ome1/dist/lib/python/omero/tables.py", line 141, in getOrCreate return HdfStorage(hdfpath) # Adds itself. File "/opt/ome1/dist/lib/python/omero/tables.py", line 184, in __init__ self.__hdf_file = HDFLIST.addOrThrow(file_path, self) File "/opt/ome1/dist/lib/python/omero/util/decorators.py", line 81, in with_lock return func(*args, **kwargs) File "/opt/ome1/dist/lib/python/omero/tables.py", line 125, in addOrThrow hdffile = hdfstorage.openfile("a") File "/opt/ome1/dist/lib/python/omero/tables.py", line 217, in openfile title="OMERO HDF Measurement Storage", rootUEP="/") File "/usr/lib/python2.7/dist-packages/tables/_past.py", line 35, in oldfunc return obj(*args, **kwargs) File "/usr/lib/python2.7/dist-packages/tables/file.py", line 318, in open_file return File(filename, mode, title, root_uep, filters, **kwargs) File "/usr/lib/python2.7/dist-packages/tables/file.py", line 791, in __init__ self._g_new(filename, mode, **params) File "hdf5extension.pyx", line 327, in tables.hdf5extension.File._g_new (tables/hdf5extension.c:3498) File "utilsextension.pyx", line 569, in tables.utilsextension.encode_filename (tables/utilsextension.c:4140) TypeError: Expected bytes, got path
which turns out to be a missing str(self.__hdf_path). I'll open a PR against your PR. Testing is now progressing.
comment:14 Changed 10 years ago by spli
- Remaining Time changed from 2 to 0
- Resolution set to fixed
- Status changed from accepted to closed
mtbcarroll came across a suspiciously similarly sounding issue with table.initialize
And see also #11116 for a similar issue with table.read.