Warning: Can't synchronize with repository "(default)" (/home/git/ome.git does not appear to be a Git repository.). Look in the Trac log for more information.
Notice: In order to edit this ticket you need to be either: a Product Owner, The owner or the reporter of the ticket, or, in case of a Task not yet assigned, a team_member"

Task #8266 (closed)

Opened 12 years ago

Closed 10 years ago

Last modified 10 years ago

Bug: "No Processor Available"

Reported by: wmoore Owned by: jamoore
Priority: critical Milestone: 5.0.2
Component: Scripting Version: 4.4.8
Keywords: BACKPORT-4.4 Cc: wmoore, cxallan, spli, jamoore, drussell-x
Resources: n.a. Referenced By: n.a.
References: n.a. Remaining Time: 0.0d
Sprint: n.a.

Description (last modified by jmoore)

There's been some investigation of this, so let's put everything we know in one place...

nightshade was suffering from this error for all scripts, E.g.

devteam:

looks like it's completely dead in the water:
2012-02-09 20:44:24,290 INFO  [                       omero.util.Server] (MainThread) ********************************************************************************
2012-02-09 20:44:24,297 INFO  [                       omero.util.Server] (MainThread) Starting
2012-02-09 20:44:24,415 INFO  [                    omero.util.Resources] (Thread-2  ) Starting
2012-02-09 20:45:22,610 INFO  [                             omero.utils] (MainThread) Failed to get session on attempt 1
2012-02-09 20:45:32,612 INFO  [                             omero.utils] (MainThread) Failed to get session on attempt 2
2012-02-09 20:45:42,790 INFO  [                             omero.utils] (MainThread) Failed to get session on attempt 3
2012-02-09 20:45:52,792 INFO  [                             omero.utils] (MainThread) Failed to get session on attempt 4
2012-02-09 20:46:02,794 INFO  [                             omero.utils] (MainThread) Failed to get session on attempt 5
2012-02-09 20:46:12,913 INFO  [              omero.processor.ProcessorI] (MainThread) Registering processor Processor -t @ Processor-0.ProcessorAdapter
2012-02-09 20:46:12,946 INFO  [                       omero.util.Server] (MainThread) Entering main loop

wait that's february.

This was fixed with

$ bin/omero admin ice "server stop Processor-0"

Possible workaround: add a maxExecutions parameter so that after a number of jobs the process restarts (via icegridnode). This would possible require some locking, but if 2 processors were running, this would likely not be a problem. This would only until processor.py is replaced with an MQ #7902.

Attachments (6)

Change History (65)

comment:2 Changed 12 years ago by jmoore

  • Priority changed from major to critical

comment:3 Changed 12 years ago by jmoore

  • Milestone changed from OMERO-Beta4.4 to OMERO-Beta4.4.x

I've opened a PR that may have a little effect on this issue: https://github.com/openmicroscopy/openmicroscopy/pull/220

It's unlikely to make it go away, but at least will have more information going forward.

My suspicion is that a lot of the issues are caused by network issues such that the main server is simply not able to contact the processor service. If that's the case, there's little we can do without a major re-architecting.

comment:4 Changed 11 years ago by jmoore

  • Cc spli added
  • Description modified (diff)

comment:5 Changed 11 years ago by spli

I don't think it's purely a network issue. I've just had a case where omero script params 2132 correctly lists the parameters but omero script params 2133 returns Could not get params: No processor available! [0 response(s)]. Repeated to check it wasn't a one off, was fixed after restarting server.

comment:6 Changed 11 years ago by jmoore

Simon, agreed. Sorry, should have added an update status to this. Chris has also seen behavior that's pointing towards Ice reaching some time of limit (file handles, memory usage, etc) which is causing it to simply hang.

But if you have 2 scripts, the second of which always shows the behavior, can you attach 2132 and 2133 (or point to their source or a server where they are installed)? Is there anything odd in the Processor-0.log when running 2133?

comment:7 Changed 11 years ago by spli

It was only repeatable until I restarted the server. Nothing in Processor-0.log. If it happens again I'll do a bit more digging.

comment:8 Changed 11 years ago by jmoore

And only for the one script, or once 2133 started failing did you not re-try 2132?

comment:9 Changed 11 years ago by spli

I tried 2133, 2132, 2133, 2132, .... 2132 worked everytime, 2133 didn't. So I concluded it probably wasn't a network issue :)

comment:10 Changed 11 years ago by spli

This still occurs randomly. I tried running a short test script Long_Running_Script_Test.py (a few seconds) in a loop for a day (while true; omero script launch ...) without the error showing up.

comment:11 Changed 11 years ago by jmoore

Simon: Colin has tracked down a similar hanging on DropBox? (#9936) to happen after approx. 70 imports, i.e. 70 calls to omero.client(). There may be a similar leakage in processor.py, though many of the methods certainly try to guarantee omero.client cleanup.

comment:12 Changed 11 years ago by cblackburn

Simon: I can rule out the issue in #9936 as being related to this ticket. In #9936 it's simply the bad placement of a locked wait and not related to the client state.

Changed 11 years ago by spli

comment:13 Changed 11 years ago by spli

I happened to catch this with DEBUG logging turned on, shortly after clearing the logs and restarting the server. Attached in case it's useful.

Using Long_Running_Script_test.py again (twice)

script launch 11301 Number_Of_Messages=660 Log_Interval=1 Keep_Alive_Interval=60

Actually ignore that.... I rebuilt the server and the script was removed as part of the process... error message seems a bit misleading though.

Last edited 11 years ago by spli (previous) (diff)

comment:14 in reply to: ↑ description Changed 11 years ago by khgillen

We are having this issue with the Demo Server at KIT:

Logs captured, and placed: smb://squig.openmicroscopy.org.uk/ome/team/log-new/20130418_demoserver_logs

"I have been playing with the OMERO images in OMERO and was wanting to share some the images, but when I attempt to download the images by creating an OME-TIFF it give a prompt saying that no processor is available and to try again later...this obviously impedes downloading and sharing the files. Any help you could offer would be great. I have attached two print screen photo showing the issue.

Respectfully,

Austin Callison"

Last edited 11 years ago by khgillen (previous) (diff)

comment:15 Changed 11 years ago by wmoore

Kenny - if the problem persists, you need to try the fix above:

$ bin/omero admin ice "server stop Processor-0"

Changed 11 years ago by spli

Changed 11 years ago by spli

comment:16 Changed 11 years ago by spli

Got this again, but this time I had DEBUG logging enabled on the server. Relevant logs attached. The no processor error occurred twice, then it started working:

$ omero script launch 15951 Scale=0.1 IDs=56063 Data_Type=Image
No processor error (Blitz.log: 2013-05-20 22:41:10,535)

$ omero script launch 15951 Scale=0.1 IDs=56063 Data_Type=Image
No processor error (Blitz.log: 2013-05-20 22:45:30,685)

$ omero script list
Scripts listed

$ omero script launch 15951 Scale=0.1 IDs=56063 Data_Type=Image
Script was launched

Processor.log at 2013-05-20 22:41:10

2013-05-20 22:41:10,509 INFO  [                            omero.remote] (Dummy-5   )  Meth: ProcessorI.willAccept
2013-05-20 22:41:10,542 DEBUG [              omero.processor.ProcessorI] (Dummy-5   ) Accepts called on: user:652 group:653 scriptjob:12403 - Valid: True
2013-05-20 22:41:10,542 INFO  [                            omero.remote] (Dummy-5   )  Rslt: True

comment:17 Changed 11 years ago by pwalczysko

The bug appeared today on ome-analysis-01 server. Simon ran omero admin ice server stop Processor-0 for me, which relieved the bug. The workflow was select a single image and run "Calculate features" script on it - it took very long first, then a message appeared in the script dialog.

Changed 11 years ago by spli

OMERO VM which exhibits the No Processor error

comment:18 Changed 11 years ago by spli

omero-vm-5.0.0-alpha1-2300-38b3bda-dirty-ice33-b150.ova is a VM that seemingly always gives the No Processor error after the first start. Restarting Processor-0 solves it. Since this is a VM maybe it's possible to debug it?

E.g.

$ OMERO.server/bin/omero script list
$ OMERO.server/bin/omero script params 4
  - No Processor error
$ OMERO.server/bin/omero admin ice server stop Processor-0
$ OMERO.server/bin/omero script params 4
  - Now works

Note the name of the image is incorrect (its actually ice34).

Last edited 11 years ago by spli (previous) (diff)

comment:19 Changed 11 years ago by jamoore

What's Processor-0.log look like at that point?

comment:20 Changed 11 years ago by spli

No entries other than startup message. I'ved attached Processor-0.log and Blitz-0.log anyway. Error occurs at 2013-08-21 15:42:27 UTC.

comment:21 Changed 11 years ago by jamoore

So that looks like the startup time for Blitz is just so long on the VM that the Processor croaks. Slightly different issue.

comment:22 Changed 11 years ago by spli

So what's actually failing? admin diagnostics still lists Processor-0 as active- is that completely separate from what Blitz thinks?

comment:23 Changed 11 years ago by jamoore

In this case, Processor-0 is hung, due likely to not exiting cleanly on startup. Perhaps the same type of failure (a swallowed exception??) is also leaving the running Processor-0 in a hung state.

comment:24 Changed 11 years ago by wmoore

Reported again at #11436

comment:25 Changed 11 years ago by jamoore

  • Cc jamoore added
  • Owner changed from jamoore to spli
  • Version set to 4.4.8

Note, Simon, that #11436 has a message I've never seen before: "serverStackTrace = ome.conditions.ValidationException?: Found wrong number of files: []" Perhaps you have?

comment:26 Changed 11 years ago by spli

No, I haven't come across that before. I can't see how that can occur unless the query at
ScriptI.java:504
returns nothing, but looking at
QueryBuilder.__query()
it's not obvious where it could go wrong.

comment:27 Changed 11 years ago by jamoore

I don't think __query() will be the culprit. Most likely it's something related to either permissions (i.e. the script wasn't properly added) or the script has since been invalidated (e.g. had its mimetype changed)

comment:28 Changed 11 years ago by spli

A bit more digging... the full error (as output to the sysadmin chatroom):

	omero-logmonitor-02@jabber.org.uk		ERROR: gretzky Processor-0:
2013-09-05 12:29:41,648 ERROR [              omero.processor.ProcessorI] (Dummy-12  ) File lookup failed: user=7, group=9, script=3790
Traceback (most recent call last):
  File "/home/omero/slave/workspace/OMERO-merge-develop/src/dist/lib/python/omero/processor.py", line 765, in willAccept
    file, handle = self.lookup(scriptContext)
  File "/home/omero/slave/workspace/OMERO-merge-develop/src/dist/lib/python/omero/processor.py", line 740, in lookup
    file = prx.validateScript(job, self.accepts_list)
  File "/home/omero/slave/workspace/OMERO-merge-develop/src/dist/lib/python/omero/processor.py", line 40, in handler
    return func(*args, **kwargs)
  File "/home/omero/slave/workspace/OMERO-merge-develop/src/dist/lib/python/omero_api_IScript_ice.py", line 162, in validateScript
    return _M_omero.api.IScript._op_validateScript.invoke(self, ((j, acceptsList), _ctx))
ValidationException: exception ::omero::ValidationException
{
    serverStackTrace = ome.conditions.ValidationException: Found wrong number of fi...

on Gretzky:

omero=# select * from joboriginalfilelink where parent = 3790;
 id  | permissions | version | child | creation_id | external_id | group_id | owner_id | update_id | parent
-----+-------------+---------+-------+-------------+-------------+----------+----------+-----------+--------
 659 |        -120 |         |     1 |      142848 |             |        9 |        7 |    142848 |   3790
(1 row)

refers to child 1.

omero=# select * from originalfile where id = 1;
 id |          atime          |          ctime          | permissions |                 hash                 |  mimetype  |          mtime          | name  | path | size | version | creation_id | external_id | group_id | owner_id | update_id | hasher | repo | params
----+-------------------------+-------------------------+-------------+--------------------------------------+------------+-------------------------+-------+------+------+---------+-------------+-------------+----------+----------+-----------+--------+------+--------
  1 | 2013-07-19 05:20:55.657 | 2013-07-19 05:20:55.657 |        -120 | cd5bcde5-1d85-4f84-a076-0e293d8c47c9 | Repository | 2013-07-19 05:20:55.657 | OMERO | /    |    0 |       0 |           3 |             |        1 |        0 |         3 |        |      |
(1 row)

which appears to be the repository root instead of the script. Similarly for the retries (job ids 3791, 3792).

Trying it today gives the correct result (originalfile.id=13).

comment:29 Changed 11 years ago by jamoore

Perhaps related error conditions for a service (threads, connections, etc): http://www.zeroc.com/forums/help-center/6112-intermittent-ice-connectionrefusedexception-exceptions.html

comment:30 Changed 11 years ago by spli

  • Milestone changed from OMERO-4.4.9 to OMERO-4.4.10

comment:31 Changed 11 years ago by spli

An update:

Openmicroscopy branch with additional logging, and all servers disabled apart from Blitz, Glacier2 and Processor:
https://github.com/manics/openmicroscopy/tree/no_processor_investigation

Manually triggered build for building a VM with the above that will probably exhibit the 'No Processor' bug on startup:
http://hudson.openmicroscopy.org.uk/view/Experimental/job/TEMPORARY-OMERO-merge-develop-virtualbox-8266_no_processor/

Note Ice network/protocol tracing is only enabled for the Blitz server since it causes the Python servers to hang, see #11561.

On Mac OS X it does work- however when the no processor error occurs there are no signs in the Processor logs of any ice network activity for the relevant period.

comment:32 Changed 10 years ago by spli

Other causes of this error include missing python dependencies required by a script. See #12040

comment:33 Changed 10 years ago by drussell-x

  • Cc drussell-x added

comment:34 Changed 10 years ago by spli

It'd be useful to have more stats on this, e.g. Ice and OS versions. Could the Error submission dialog from #10231 be re-enabled in addition to the message? It would also be useful to distinguish between "No Processor" with an unknown cause, and ones caused for known reasons such as missing script deps (#12040)

comment:35 Changed 10 years ago by jamoore

How do you mean, "more stats", Simon? From the cases we've seen?

comment:36 Changed 10 years ago by spli

Stats on how frequently it's occuring with current releases and OS distributions. For example, a quick search suggests the only Ice35 report was due to missing Python deps, but if everyone already knows about omero admin ice server stop Processor-0 then they wouldn't report it.

comment:37 Changed 10 years ago by jamoore

We can try to scavenger a list (and certainly tell the whole team to try to record it), but since it's an issue of a running process no longer replying, it could be tricky. How many were fixed by full restarts? We *could* try to teach the new bin/omero admin restart --service method how to contact us and say as much. ;)

comment:38 Changed 10 years ago by jamoore

  • Cc saloynton removed
  • Milestone changed from 5.1.0 to 5.0.2
  • Owner changed from spli to jamoore
  • Status changed from new to accepted

I think I've finally pinned this down (or at least one incarnation of it). When a category is generated via getCategoryForClient that ends with a \, this leads to bad parsing of the object name:

$ cat s.java 
public class s {
    public static void main(String[] args) {
        Ice.Communicator ic = Ice.Util.initialize();
        Ice.Identity id = ic.stringToIdentity(args[0]);
        System.out.println("Category:" + id.category);
        System.out.println("    Name:" + id.name);
    }
}
$ java s '"W:T\/t`jFayi;\/yOlCJ@\/UsermodeProcessor-4c46d188-8323-4998-b621-671771537fd9" -t -e 1.0:tcp -h 127.0.0.1 -p 58406'
Category:
    Name:"W:T/t`jFayi;/yOlCJ@/UsermodeProcessor-4c46d188-8323-4998-b621-671771537fd9" -t -e 1.0:tcp -h 127.0.0.1 -p 58406

but remove the trailing slash:

java s '"W:T\/t`jFayi;\/yOlCJ@/UsermodeProcessor-4c46d188-8323-4998-b621-671771537fd9" -t -e 1.0:tcp -h 127.0.0.1 -p 58406'
Category:"W:T/t`jFayi;/yOlCJ@
    Name:UsermodeProcessor-4c46d188-8323-4998-b621-671771537fd9" -t -e 1.0:tcp -h 127.0.0.1 -p 58406

and it works. Workaround is unknown, but likely involves using a method other than Ice.Communicator.stringToProxy. Note: this holds for Python and Java.

comment:40 Changed 10 years ago by jamoore

PR opened: https://github.com/openmicroscopy/openmicroscopy/pull/2528

Likely we can close this ticket, and open a follow-on to remove the workarounds depending on the ZeroC forum post.

comment:41 Changed 10 years ago by jamoore

Following the feedback from ZeroC, I'll update 5.1 to pass a Processor* in willAccept() and will leave the workarounds in place for 5.0.

comment:42 Changed 10 years ago by jamoore

  • Resolution set to fixed
  • Status changed from accepted to closed

Slightly different strategy taken for the develop PR: https://github.com/openmicroscopy/openmicroscopy/pull/2535

That shouldn't need any further modifying, so I won't file any follow-up ticket with the assumption that 5.0 users may be on older Ice versions until they upgrade to 5.1, so the minor overhead of re-creating sessions should be tolerable.

comment:43 Changed 10 years ago by spli

Do we need a cleanup ticket, e.g. to undo https://trac.openmicroscopy.org.uk/ome/ticket/10231 (if no processor still occurs we want to know)

comment:44 Changed 10 years ago by jamoore

Not sure, Simon. Perhaps Will or Jean-marie have a feeling. My understanding from the PRs in #10231 was that a "No processor available" would still be displayed, but just not cause an error report. Do we want to move to an error report so we can help people to debug? Possibly. Once we start allowing the launching of more scripts (not just the official ones) then we should *not* send error reports, or at least, not without checking the number of responses first.

comment:45 Changed 10 years ago by spli

Ideally we need separate error handling for server/grid errors and scripting problems

comment:46 Changed 10 years ago by jmoore <josh@…>

(In [b3c4b62fee5adfdf05fc80a9d7cbd96f322493f9/ome.git] on branch dev_4_4) Investigation for "No processor" (See #8266)

These logging changes and fail-fast modifications were useful
in pinpointing the cause of the "No processor" error. Specifically,
they show that the usermode processor used by test_inputs.py was
being called and was communicating with the server *but* that when
the server tried to access the proxy an ObjectNotFoundException
was being thrown (caused by poor parsing).

comment:47 Changed 10 years ago by jmoore <josh@…>

  • Remaining Time set to 0

(In [d1f283f3204b1abe3fc5f6aec1a20e32d0fa75fe/ome.git] on branch dev_4_4) Implement workaround clients.py (Fix #8266)

For each Glacier2.Router session whose category
ends in "\" (but not "
") we immediately close
the session and create another one.

See:
http://www.zeroc.com/forums/bug-reports/6262-periodic-objectnotexistexception-glacier2-categories-post27018.html#post27018

comment:48 Changed 10 years ago by jmoore <josh@…>

(In [e7e1d7096a6381f7bed8944ce530475c951d8fb3/ome.git] on branch dev_4_4) Implement workaround for client.java (See #8266)

comment:49 Changed 10 years ago by jmoore <josh@…>

(In [81d93235da6029c075d6a3697b321474675e406e/ome.git] on branch dev_4_4) Implement workaround for client.cpp (See #8266)

comment:50 Changed 10 years ago by jmoore <josh@…>

(In [85f7b6cfb73dece839924aac86c1f5ad8a9e3c2a/ome.git] on branch dev_4_4) Revert workaround (See #8266)

Though all tests passed, the workaround here broken
OMERO.web's multi-threaded usage.

comment:51 Changed 10 years ago by jmoore <josh@…>

(In [b8fbfd4ba2ed762ce8b972f84b97e6f9d73b21b8/ome.git] on branch dev_4_4) Modify the isAccept method to not use a string (Fix #8266)

On the stable 5.0 branch, we used a workaround to detect
Glacier2 client categories which would fail due to a bug
in the identityToString Ice methods. Rather than carry
along that workaround when ZeroC will fix the issue, we
make a breaking change here which accepts a Processor*
rather than a stringified version of the proxy.

comment:52 Changed 10 years ago by jmoore <josh@…>

(In [8346364fdb7058abdbba2540f4b0d3aef908c030/ome.git] on branch dev_4_4) Maintain backwards compatibility (See #8266)

By using a new method for the improved (i.e. non-breaking)
invocation, older processors can still talk to newer servers.
If a newer processor is deployed against an older server, than
a NoSuchMethodError? will be thrown.

Conflicts:

components/tools/OmeroPy/test/integration/scriptstest/test_inputs.py

comment:53 Changed 10 years ago by jmoore <josh@…>

(In [0c6e6ff30222957d36d6ab3e33c32c2b1dd568e0/ome.git] on branch develop) Investigation for "No processor" (See #8266)

These logging changes and fail-fast modifications were useful
in pinpointing the cause of the "No processor" error. Specifically,
they show that the usermode processor used by test_inputs.py was
being called and was communicating with the server *but* that when
the server tried to access the proxy an ObjectNotFoundException
was being thrown (caused by poor parsing).

comment:54 Changed 10 years ago by jmoore <josh@…>

(In [66d2006e56a80ab0b0dd1353665d82e46321f3cd/ome.git] on branch develop) Implement workaround clients.py (Fix #8266)

For each Glacier2.Router session whose category
ends in "\" (but not "
") we immediately close
the session and create another one.

See:
http://www.zeroc.com/forums/bug-reports/6262-periodic-objectnotexistexception-glacier2-categories-post27018.html#post27018

comment:55 Changed 10 years ago by jmoore <josh@…>

(In [9c036f11c4109daf4dcc24aa65957e20ca170c16/ome.git] on branch develop) Implement workaround for client.java (See #8266)

comment:56 Changed 10 years ago by jmoore <josh@…>

(In [b5687a3c6b8bfb4654f6f370a6fc3a1fb3f2836b/ome.git] on branch develop) Implement workaround for client.cpp (See #8266)

comment:57 Changed 10 years ago by jmoore <josh@…>

(In [2366726b5c700ab651d7f0400ae6c5ffbca6da75/ome.git] on branch develop) Revert workaround (See #8266)

Though all tests passed, the workaround here broken
OMERO.web's multi-threaded usage.

comment:58 Changed 10 years ago by jmoore <josh@…>

(In [131d3c7ac9f59afcfe7069a104bbf217ff41be63/ome.git] on branch develop) Modify the isAccept method to not use a string (Fix #8266)

On the stable 5.0 branch, we used a workaround to detect
Glacier2 client categories which would fail due to a bug
in the identityToString Ice methods. Rather than carry
along that workaround when ZeroC will fix the issue, we
make a breaking change here which accepts a Processor*
rather than a stringified version of the proxy.

comment:59 Changed 10 years ago by jmoore <josh@…>

(In [8aa99ff129a138885f10e07f13b62990e0142530/ome.git] on branch develop) Maintain backwards compatibility (See #8266)

By using a new method for the improved (i.e. non-breaking)
invocation, older processors can still talk to newer servers.
If a newer processor is deployed against an older server, than
a NoSuchMethodError? will be thrown.

Note: See TracTickets for help on using tickets. You may also have a look at Agilo extensions to the ticket.

1.3.13-PRO © 2008-2011 Agilo Software all rights reserved (this page was served in: 0.74433 sec.)

We're Hiring!