Notice: In order to edit this ticket you need to be either: a Product Owner, The owner or the reporter of the ticket, or, in case of a Task not yet assigned, a team_member"

Task #10464 (closed)

Opened 8 years ago

Closed 6 years ago

Bug: HDF5 isn't threadsafe by default (Tables openTable() / addData() sometimes returns None, occasionally Ice::ObjectNotExistException)

Reported by: spli Owned by: spli
Priority: critical Milestone: 5.0.3
Component: Services Version: 4.4.8
Keywords: n.a. Cc: python-team@…, analysis@…
Resources: n.a. Referenced By: n.a.
References: n.a. Remaining Time: 0.0d
Sprint: n.a.

Description

Another mysterious intermittent bug. Occasionally openTable(ofile) returns None in the following (I know name isn't required for getObject if the id is given, but that should be irrelevant):

attrs = {'id': long(tableId)}
if tableName:
    attrs['name'] = tableName
ofile = conn.getObject("OriginalFile", attributes = attrs)
table = res.openTable(ofile)

Retrying it immediately succeeds (table is a valid handle).

Looking through Tables-0.log for the table OriginalFile? ID (1688):

2013-02-24 20:34:41,226 INFO  [                    omero.tables.TablesI] (Dummy-
3   ) getTable: 1688 {'omero.session.uuid': '138230fc-0ceb-4736-bab2-1d15b4c5f4d
4', 'omero.client.uuid': '32cb3552-6cc3-4526-be4d-27a51304a7f6'}
2013-02-24 20:34:41,257 INFO  [                            omero.remote] (Dummy-
11  )  Meth: TablesI.getTable
2013-02-24 20:34:41,258 INFO  [                    omero.tables.TablesI] (Dummy-
11  ) getTable: 1688 {'omero.session.uuid': '138230fc-0ceb-4736-bab2-1d15b4c5f4d
4', 'omero.client.uuid': '32cb3552-6cc3-4526-be4d-27a51304a7f6'}
2013-02-24 20:34:41,276 INFO  [                 omero.tables.HdfStorage] (Dummy-
11  ) Size: 0 - Attaching Table-FDDF8540-1E9F-404B-BABF-C87FDDA6335E to /Users/s
imon/var/omero.data/Files/Dir-001/1688
2013-02-24 20:34:41,276 INFO  [                            omero.remote] (Dummy-
11  )  Rslt: FDDF8540-1E9F-404B-BABF-C87FDDA6335E -t @ Tables-0.TablesAdapter
2013-02-24 20:34:41,277 INFO  [                 omero.tables.HdfStorage] (Dummy-
3   ) Size: 1 - Attaching Table-CDD126B0-9BA6-49E5-939F-5AF655DB335E to /Users/s
imon/var/omero.data/Files/Dir-001/1688
2013-02-24 20:34:41,278 INFO  [                            omero.remote] (Dummy-
3   )  Rslt: CDD126B0-9BA6-49E5-939F-5AF655DB335E -t @ Tables-0.TablesAdapter

Now it looks like it's been opened twice, so the first openTable() call should've worked.

Corresponding section of Blitz-0.log:

2013-02-24 20:34:36,173 INFO  [        ome.services.blitz.fire.Registry] (l.Serv
er-4) Found 2 table services(s) : [Tables-0 -t @ Tables-0.TablesAdapter, Tables 
-t:tcp -h 192.168.1.84 -p 65353:tcp -h 192.168.56.1 -p 65353]
2013-02-24 20:34:41,225 INFO  [        ome.services.util.ServiceHandler] (l.Server-2)  Executor.doWork -- ome.services.blitz.impl.SharedResourcesI.checkOriginalFilePermissions(1688)
2013-02-24 20:34:41,226 INFO  [        ome.services.util.ServiceHandler] (l.Server-2)  Args:    [null, InternalSF@50258881]
2013-02-24 20:34:41,227 INFO  [         ome.security.basic.EventHandler] (l.Server-2)  Auth:    user=2,group=3,event=null(User),sess=138230fc-0ceb-4736-bab2-1d15b4c5f4d4
2013-02-24 20:34:41,239 INFO  [                 org.perf4j.TimingLogger] (l.Server-2) start[1361738081226] time[13] tag[omero.call.success.ome.services.blitz.impl.SharedResourcesI$4.doWork]
2013-02-24 20:34:41,239 INFO  [        ome.services.util.ServiceHandler] (l.Server-2)  Rslt:    ome.model.core.OriginalFile:Id_1688
2013-02-24 20:34:41,240 INFO  [        ome.services.blitz.fire.Registry] (l.Server-2) Found 2 table services(s) : [Tables-0 -t @ Tables-0.TablesAdapter, Tables -t:tcp -h 192.168.1.84 -p 65353:tcp -h 192.168.56.1 -p 65353]
2013-02-24 20:34:41,268 INFO  [        ome.services.util.ServiceHandler] (l.Server-3)  Meth:    interface ome.api.IAdmin.canUpdate
2013-02-24 20:34:41,268 INFO  [        ome.services.util.ServiceHandler] (l.Server-3)  Args:    [ome.model.core.OriginalFile:Id_1688]
2013-02-24 20:34:41,272 INFO  [        ome.services.util.ServiceHandler] (l.Server-5)  Meth:    interface ome.api.IAdmin.canUpdate
2013-02-24 20:34:41,272 INFO  [        ome.services.util.ServiceHandler] (l.Server-5)  Args:    [ome.model.core.OriginalFile:Id_1688]
2013-02-24 20:34:41,273 INFO  [         ome.security.basic.EventHandler] (l.Server-5)  Auth:    user=2,group=3,event=null(User),sess=138230fc-0ceb-4736-bab2-1d15b4c5f4d4
2013-02-24 20:34:41,274 INFO  [         ome.security.basic.EventHandler] (l.Server-3)  Auth:    user=2,group=3,event=null(User),sess=138230fc-0ceb-4736-bab2-1d15b4c5f4d4
2013-02-24 20:34:41,275 INFO  [                 org.perf4j.TimingLogger] (l.Server-5) start[1361738081272] time[3] tag[omero.call.success.ome.logic.AdminImpl.canUpdate]
2013-02-24 20:34:41,275 INFO  [        ome.services.util.ServiceHandler] (l.Server-5)  Rslt:    true
2013-02-24 20:34:41,277 INFO  [                 org.perf4j.TimingLogger] (l.Server-3) start[1361738081268] time[9] tag[omero.call.success.ome.logic.AdminImpl.canUpdate]
2013-02-24 20:34:41,277 INFO  [        ome.services.util.ServiceHandler] (l.Server-3)  Rslt:    true

This error seems to only occur when opening multiple tables, but is frequent enough to warrant using this workaround:

def openRetry(ofile, n):
    for i in xrange(n):
        t = self.res.openTable(ofile)
        if t:
            return t
        print 'Failed to open table %d (attempt %d)' % \
            (ofile.getId().val, i + 1)
    raise Exception('Failed to open table %d' % ofile.getId().val)

On rare occasions an Ice::ObjectNotExistException is thrown, this is on a local server which should rule out network issues, the Table definitely exists and is accessble by the current user. Tables-0.log:

2013-02-24 00:32:27,853 INFO  [                            omero.remote] (Dummy-
11  )  Meth: TablesI.getTable
2013-02-24 00:32:27,853 INFO  [                    omero.tables.TablesI] (Dummy-
11  ) getTable: 1603 {'omero.session.uuid': 'b88e52e2-cea5-4619-9355-0a08026a9a2
d', 'omero.client.uuid': '4d5818ed-1dbe-4719-9ad4-9caccbd52fef'}
2013-02-24 00:32:27,905 INFO  [                            omero.remote] (Dummy-
11  )  Excp: exception ::Ice::ObjectNotExistException
{   
    id =
    {   
        name = b88e52e2-cea5-4619-9355-0a08026a9a2d
        c
2013-02-24 00:32:27,906 WARNI [                            omero.remote] (Dummy-
11  ) <function getTable at 0x11167faa0> raised a non-ServerError (<class 'Ice.O
bjectNotExistException'>): exception ::Ice::ObjectNotExistException
{   
    id =
    {   
        name = b88e52e2-cea5-4619-9355-0a08026a9a2d
        category = session-4d5818ed-1dbe-4719-9ad4-9caccbd52fef
    }
    facet =
    operation = getAdminService
}

Attachments (2)

master.err (43.3 KB) - added by spli 6 years ago.
trout master.err
python2.7_2014-06-25-160313_ls27151.crash (66.1 KB) - added by spli 6 years ago.
OS X python tables crash log

Download all attachments as: .zip

Change History (16)

comment:1 Changed 8 years ago by spli

  • Owner set to spli

comment:2 Changed 7 years ago by spli

  • Priority changed from minor to critical
  • Version set to 4.4.8

mtbcarroll came across a suspiciously similarly sounding issue with table.initialize

  File "/tmp/ve/lib/python2.7/site-packages/pyslid/features.py", line 724, in link
    table.initialize(columns)
AttributeError: 'NoneType' object has no attribute 'initialize'

And see also #11116 for a similar issue with table.read.

comment:3 Changed 7 years ago by spli

newTable is failing during Pychrm feature calculation:

-! 02/26/14 12:05:25.972 warning: Proxy keep alive failed.
Traceback (most recent call last):
  File "./script", line 267, in <module>
    runScript()
  File "./script", line 255, in runScript
    message += processImages(client, scriptParams) + '\n'
  File "./script", line 191, in processImages
    prefixChannel=prefixChannel)
  File "./script", line 112, in extractFeatures
    ftb.createTable(ftall.names, ft.version)
  File "build/bdist.linux-x86_64/egg/OmeroPychrm/PychrmStorage.py", line 132, in createTable
  File "build/bdist.linux-x86_64/egg/OmeroPychrm/TableConnection.py", line 380, in createNewTable
  File "build/bdist.linux-x86_64/egg/OmeroPychrm/TableConnection.py", line 240, in newTable
  File "/opt/hudson/workspace/ANALYSIS-OMERO-5.1-latest-ice34/OMERO.server-5.1.0-m0-214-d2ad149-ice34-b48/lib/python/omero_SharedResources_ice.py", line 243, in newTable
    return _M_omero.grid.SharedResources._op_newTable.invoke(self, ((repoId, path), _ctx))
Ice.ConnectionLostException: Ice.ConnectionLostException:
recv() returned zero

comment:4 Changed 6 years ago by spli

  • Milestone changed from Unscheduled to 5.0.3

Try and reproduce with the latest 5 or assume it's fixed. Could it be related to the problem in #8266 ?

comment:5 Changed 6 years ago by jamoore

The categories here certainly don't look as corrupt as they did in the Processor case, but there could be a relationship. There are probably a few minor changes that can be made to SharedResourcesI to test, but only if a repeated invocation eventually shows a failure. Moving to 5.0.3 makes good sense.

comment:6 Changed 6 years ago by spli

#12372 suggest this is still a problem, and that it occurs fairly often.

comment:7 Changed 6 years ago by spli

I can reliably kill OMERO.tables on trout, though it's not clear whether this is a pytables problem, or a combination of OMERO and pytables

# tmp-tables.py

import omero, omero.gateway
client = omero.client('trout.openmicroscopy.org')
client.createSession('username', 'password')
conn = omero.gateway.BlitzGateway(client_obj=client)
res = conn.getSharedResources()


def r(fid, n):
    t = res.openTable(omero.model.OriginalFileI(fid))
    assert t

    col = t.getHeaders()[0]
    col.values = [n]
    t.addData([col])
    t.close()

def run(total):
    t = res.newTable(0, 'test.h5')
    t.initialize([omero.grid.LongColumn('lc')])
    f = t.getOriginalFile()
    t.close()

    for n in xrange(total):
        print n
        r(f.getId().val, n)
    conn.deleteObjectDirect(f)

def run2(totalopen, totalnew):
    for n in xrange(totalnew):
        print 'New: %d' % n
        run(totalopen)

In IPython:

: execfile('tmp-tables.py')
: run2(100,100)

Set this running in a few (I was using 5) terminals simultaneously.

You may get Ice Connection lost errors:

UnknownLocalException: exception ::Ice::UnknownLocalException
{
    unknown = TcpTransceiver.cpp:221: Ice::ConnectionLostException:
connection lost: recv() returned zero
}
> /Users/simon/omero-test/OMERO-51/lib/python/omero_Tables_ice.py(1067)addData()
   1066         def addData(self, cols, _ctx=None):
-> 1067             return _M_omero.grid.Table._op_addData.invoke(self, ((cols, ), _ctx))
   1068

Or the more serious looking error in master.err:

-! 06/20/14 15:34:58.155 OMERO.Glacier2: warning: dispatch exception: TcpTransceiver.cpp:221: Ice::ConnectionLostException:
   connection lost: recv() returned zero
   identity: 70F68E66-6E99-4D3F-9EFD-597777749AB9
   facet: 
   operation: addData
   remote host: 10.31.1.79 remote port: 54076
*** glibc detected *** python: double free or corruption (fasttop): 0x00007f00550446f0 ***
======= Backtrace: =========
/lib64/libc.so.6[0x39d9e76166]
...

See the attached master.err for the full backtrace

Changed 6 years ago by spli

trout master.err

comment:8 Changed 6 years ago by spli

  • Summary changed from Bug: Tables openTable() sometimes returns None, occasionally Ice::ObjectNotExistException to Bug: Tables openTable() / addData() sometimes returns None, occasionally Ice::ObjectNotExistException

comment:9 Changed 6 years ago by spli

  • Remaining Time set to 2
  • Status changed from new to accepted

comment:10 Changed 6 years ago by spli

I can reproduce this locally

$ sw_vers
ProductName:    Mac OS X
ProductVersion: 10.9.3
BuildVersion:   13D65

$ java -version
java version "1.8.0_05"
Java(TM) SE Runtime Environment (build 1.8.0_05-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode)

$ python --version
Python 2.7.7

$ pip freeze
Cython==0.20.1
Genshi==0.7
Pillow==2.4.0
PyGithub==1.25.0
argparse==1.2.1
backports.ssl-match-hostname==3.4.0.2
flake8==2.1.0
gnureadline==6.3.3
ipython==2.1.0
matplotlib==1.3.1
mccabe==0.2.1
nose==1.3.3
numexpr==2.4
numpy==1.8.1
pep8==1.5.7
psutil==2.1.1
pyflakes==0.8.1
pyparsing==2.0.2
python-dateutil==2.2
scipy==0.14.0
six==1.7.2
sleekxmpp==1.3.1
tables==3.1.1
tornado==3.2.2
virtualenv==1.11.6
wsgiref==0.1.2
yaclifw==0.1.1

Testing script: https://gist.github.com/manics/b03acf51254b16c30d3b (ran python tables-crash.py 100 100 from five terminals concurrently) causes a segfault in OMERO.tables, Python crash log attached

Changed 6 years ago by spli

OS X python tables crash log

comment:11 Changed 6 years ago by spli

  • Cc python-team@… added
  • Summary changed from Bug: Tables openTable() / addData() sometimes returns None, occasionally Ice::ObjectNotExistException to Bug: HDF5 isn't threadsafe by default (Tables openTable() / addData() sometimes returns None, occasionally Ice::ObjectNotExistException)

I think this is a problem with the HDF5 library not being threadsafe by default.

brew uninstall hdf5
brew install hdf5 --enable-threadsafe

fixes things locally (python tables-crash.py 100 100 in 5 windows completes successfully).

--enable-threadsafe isn't enabled by default because it conflicts with the C++ and fortran bindings (amongst others)

See also http://www.hdfgroup.org/hdf5-quest.html#tsafe

Last edited 6 years ago by spli (previous) (diff)

comment:12 Changed 6 years ago by spli

Possible fix which should avoid a need for a custom compilation options with HDF5: https://github.com/openmicroscopy/openmicroscopy/pull/2688

Last edited 6 years ago by spli (previous) (diff)

comment:13 Changed 6 years ago by jamoore

Note: I've been stuck on:

2014-06-30 07:37:04,920 INFO  [                            omero.remote] (Dummy-3   )  Excp: Expected bytes, got path
Traceback (most recent call last):
  File "/opt/ome1/dist/lib/python/omero/util/decorators.py", line 61, in exc_handler
    rv = func(*args, **kwargs)
  File "/opt/ome1/dist/lib/python/omero/util/decorators.py", line 39, in handler
    rv = func(*args, **kwargs)
  File "/opt/ome1/dist/lib/python/omero/tables.py", line 915, in getTable
    storage = HDFLIST.getOrCreate(file_path)
  File "/opt/ome1/dist/lib/python/omero/util/decorators.py", line 81, in with_lock
    return func(*args, **kwargs)
  File "/opt/ome1/dist/lib/python/omero/tables.py", line 141, in getOrCreate
    return HdfStorage(hdfpath) # Adds itself.
  File "/opt/ome1/dist/lib/python/omero/tables.py", line 184, in __init__
    self.__hdf_file = HDFLIST.addOrThrow(file_path, self)
  File "/opt/ome1/dist/lib/python/omero/util/decorators.py", line 81, in with_lock
    return func(*args, **kwargs)
  File "/opt/ome1/dist/lib/python/omero/tables.py", line 125, in addOrThrow
    hdffile = hdfstorage.openfile("a")
  File "/opt/ome1/dist/lib/python/omero/tables.py", line 217, in openfile
    title="OMERO HDF Measurement Storage", rootUEP="/")
  File "/usr/lib/python2.7/dist-packages/tables/_past.py", line 35, in oldfunc
    return obj(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/tables/file.py", line 318, in open_file
    return File(filename, mode, title, root_uep, filters, **kwargs)
  File "/usr/lib/python2.7/dist-packages/tables/file.py", line 791, in __init__
    self._g_new(filename, mode, **params)
  File "hdf5extension.pyx", line 327, in tables.hdf5extension.File._g_new (tables/hdf5extension.c:3498)
  File "utilsextension.pyx", line 569, in tables.utilsextension.encode_filename (tables/utilsextension.c:4140)
TypeError: Expected bytes, got path

which turns out to be a missing str(self.__hdf_path). I'll open a PR against your PR. Testing is now progressing.

comment:14 Changed 6 years ago by spli

  • Remaining Time changed from 2 to 0
  • Resolution set to fixed
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets. You may also have a look at Agilo extensions to the ticket.

1.3.13-PRO © 2008-2011 Agilo Software all rights reserved (this page was served in: 0.77734 sec.)

We're Hiring!