Task #10662

Opened 11 years ago

Closed 9 years ago

ome-mage / emdb instability

Reported by: khgillen Owned by: wmoore
This has existed as Mantis ticket: https://mantis.lifesci.dundee.ac.uk/view.php?id=97773 since the server was provisioned.

Server provisioning and deployment took place in Dec 2012. Since then, we have had intermittent outages which appear to correct themselves after a period of time, downtimes usually lasting around 5 to 15mins duration.

Checks at a 5m resolution:

Uptime: 93.50%

1d 23h 30m

Number of Downtimes

Traceback (most recent call last):

File "/mage/staging/OMERO-CURRENT/lib/python/django/core/handlers/base.py", line 92, in get_response

response = callback(request, *callback_args, callback_kwargs)

File "/mage/staging/OMERO-CURRENT/lib/python/omeroweb/webemdb/views.py", line 820, in index

conn = getConnection(request)

File "/mage/staging/OMERO-CURRENT/lib/python/omeroweb/webemdb/views.py", line 1126, in getConnection

logger.debug('emdb connection: %s server %s' % (conn._sessionUuid, blitz.host))

AttributeError?: 'NoneType?' object has no attribute '_sessionUuid'
Another stacktrace:

GET:<QueryDict: {}>,
POST:<QueryDict: {}>,
META:{'DOCUMENT_ROOT': '/var/www/html',

'HTTP_ACCEPT': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'HTTP_ACCEPT_ENCODING': 'gzip, deflate',
'HTTP_CONNECTION': 'keep-alive',
'HTTP_HOST': 'emdb.openmicroscopy.org.uk',
'HTTP_USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit?/536.29.13 (KHTML, like Gecko) Version/6.0.4 Safari/536.29.13',
'PATH': '/sbin:/usr/sbin:/bin:/usr/bin',
'PATH_INFO': u'/webemdb/',
'PATH_TRANSLATED': '/mage/staging/OMERO-CURRENT/var/omero.fcgi/webemdb/',
'REMOTE_PORT': '59698',
'REQUEST_URI': '/webemdb/',
'SCRIPT_FILENAME': '/mage/staging/OMERO-CURRENT/var/omero.fcgi',
'SCRIPT_URI': 'http://emdb.openmicroscopy.org.uk/webemdb/',
'SCRIPT_URL': '/webemdb/',
'SERVER_ADMIN': 'root@localhost',
'SERVER_NAME': 'emdb.openmicroscopy.org.uk',
'SERVER_PORT': '80',
'SERVER_SOFTWARE': 'Apache/2.2.15 (CentOS)',
'wsgi.errors': <flup.server.fcgi_base.TeeOutputStream? object at 0x4238990>,
'wsgi.input': <flup.server.fcgi_base.InputStream? object at 0x5363650>,
'wsgi.multiprocess': True,
'wsgi.multithread': False,
'wsgi.run_once': False,
'wsgi.url_scheme': 'http',
'wsgi.version': (1, 0)}>

We seem to be getting logging statements from the code below when connection fails.

Just turned debug on so we get the stack traces from here too:

$ bin/omero config set omero.web.debug true

def _createConnection (server_id, sUuid=None, username=None, passwd=None, host=None, port=None, retry=True, group=None, try_super=False, secure=False, anonymous=False, useragent=None):
    Attempts to create a L{omero.gateway.BlitzGateway} connection.
    Tries to join an existing session for the specified user, using sUuid.
    @param server_id:   Way of referencing the server, used in connection dict keys. Int or String
    @param sUuid:       Session ID - used for attempts to join sessions etc without password
    @param username:    User name to log on with
    @param passwd:      Password
    @param host:        Host name
    @param port:        Port number
    @param retry:       Boolean
    @param group:       String? TODO: parameter is ignored. 
    @param try_super:   If True, try to log on as super user, 'system' group
    @param secure:      If True, use an encrypted connection
    @param anonymous:   Boolean
    @param useragent:   Log which python clients use this connection. E.g. 'OMERO.webadmin'
    @return:            The connection
    @rtype:             L{omero.gateway.BlitzGateway}
        blitzcon = client_wrapper(username, passwd, host=host, port=port, group=None, try_super=try_super, secure=secure, anonymous=anonymous, useragent=useragent)
        blitzcon.server_id = server_id
        blitzcon.user = UserProxy(blitzcon)
        if blitzcon._anonymous and hasattr(blitzcon.c, 'onEventLogs'):
            logger.debug('Connecting weblitz_cache to eventslog')
            def eventlistener (e):
                return webgateway_cache.eventListener(server_id, e)
        return blitzcon
        if not retry:
            return None
        logger.error("Critical error during connect, retrying after _purge")
        return _createConnection(server_id, sUuid, username, passwd, retry=False, host=host, port=port, group=None, try_super=try_super, anonymous=anonymous, useragent=useragent)

def _purge (force=False):
    if force or len(connectors) > CONNECTOR_POOL_SIZE:
        keys = connectors.keys()
        for i in range(int(len(connectors)*CONNECTOR_POOL_KEEP)):
                c = connectors.pop(keys[i])
        logger.info('reached connector_pool_size (%d), size after purge: (%d)' %
                    (CONNECTOR_POOL_SIZE, len(connectors)))

Can we close that ticket?
I will assume we can

comment:5 Changed 9 years ago by wmoore

Yep - all fine now.

