Bug #906 (closed)
Opened 16 years ago
Closed 16 years ago
Server deadlock on empty connection pool
Reported by: | jburel | Owned by: | jamoore |
---|---|---|---|
Priority: | blocker | Cc: | cxallan, atarkowska |
Sprint: | n.a. | ||
Total Remaining Time: | n.a. |
Description (last modified by jmoore)
When too many methods are executing simultaneously, the connection pool can lock up, causing the server to freeze and all methods to fail. kill -9 is required to stop the server.
One symptom of the freeze is that the FullTextThread blocks while indexing, causing other FullTextThreads to print:
INFO [FullTextThread] Currently running; skipping
Attachments (2)
Change History (31)
comment:1 Changed 16 years ago by jburel
- Description modified (diff)
comment:2 Changed 16 years ago by jburel
- Cc jason removed
comment:3 Changed 16 years ago by jburel
- Description modified (diff)
comment:4 Changed 16 years ago by jmoore
- Cc calln added
- Component changed from General to Services
- Owner changed from jmoore, callan to jmoore
- Status changed from new to assigned
comment:5 Changed 16 years ago by jmoore
- Cc callan added; calln removed
Is there any chance at all of getting the log that led up to this? Apparently there's a race condition of sorts in which the full text thread continually tries to get into a protected piece of code.
comment:6 Changed 16 years ago by jmoore
Specifically, it looks like a single thread has gotten stuck and therefore is holding the lock on DetailsFieldBridge. Since this happens, you should get lots of "Inactive; skipping" messages after trying to shutdown rather than "Currently running; skipping" messages.
How did you try to shutdown? The only other possibility is that Spring failed to call "stop()" on the FullTextThread.
comment:7 Changed 16 years ago by jburel
INFO [FullTextThread] Currently running; skipping
At that point, I have not tried to shut down the server.
The same message is then displayed over and over
After a while, I tried a Ctrl C to stop server, but it didn't work
comment:8 Changed 16 years ago by jmoore
Understood. Skipping before shutting down will mean that the full-text thread is taking a VERY long time, usually takes .1-5 seconds max for a cycle. Did you upload a very large text file? Also could you see if that JBoss log is still available, no matter how large it is?
comment:9 Changed 16 years ago by jmoore
comment:10 Changed 16 years ago by cxallan
- Summary changed from server Freeze to Server Deadlock While Indexing
- Version 3.0-M3 deleted
Deadlock data follows...
Server revision:
callan@mage ~/OMERO3 $ svn info Path: . URL: svn+ssh://cvs.openmicroscopy.org.uk/home/svn/omero/trunk Repository Root: svn+ssh://cvs.openmicroscopy.org.uk/home/svn/omero Repository UUID: 05709c45-44f0-0310-885b-81a1db45b4a6 Revision: 2365 Node Kind: directory Schedule: normal Last Changed Author: jburel Last Changed Rev: 2365 Last Changed Date: 2008-05-11 17:18:47 +0100 (Sun, 11 May 2008)
PostgreSQL lock data:
omero3=# select * from pg_stat_activity ; datid | datname | procpid | usesysid | usename | current_query | query_start | backend_start | client_addr | client_port ---------+---------+---------+----------+---------+---------------+-------------------------------+-------------------------------+-------------+------------- 1018661 | omero3 | 32004 | 16384 | callan | <IDLE> | 2008-05-12 12:10:28.320578+01 | 2008-05-12 11:54:28.454397+01 | 127.0.0.1 | 60658 1018661 | omero3 | 32003 | 16384 | callan | <IDLE> | 2008-05-12 12:10:28.224922+01 | 2008-05-12 11:54:28.455642+01 | 127.0.0.1 | 45924 1018661 | omero3 | 32002 | 16384 | callan | <IDLE> | 2008-05-12 12:10:28.080963+01 | 2008-05-12 11:54:28.457481+01 | 127.0.0.1 | 40919 1018661 | omero3 | 24261 | 16384 | callan | <IDLE> | 2008-05-07 15:09:47.376005+01 | 2008-04-23 21:24:40.008894+01 | | -1 1018661 | omero3 | 32033 | 16384 | callan | <IDLE> | 2008-05-12 12:10:28.155431+01 | 2008-05-12 11:56:30.042641+01 | 127.0.0.1 | 49437 1018661 | omero3 | 32035 | 16384 | callan | <IDLE> | 2008-05-12 12:10:28.085952+01 | 2008-05-12 11:56:30.042656+01 | 127.0.0.1 | 59075 1018661 | omero3 | 32034 | 16384 | callan | <IDLE> | 2008-05-12 12:10:28.25795+01 | 2008-05-12 11:56:30.047677+01 | 127.0.0.1 | 39904 1018661 | omero3 | 32044 | 16384 | callan | <IDLE> | 2008-05-12 12:10:28.180709+01 | 2008-05-12 11:57:30.471054+01 | 127.0.0.1 | 38096 1018661 | omero3 | 32045 | 16384 | callan | <IDLE> | 2008-05-12 12:10:28.321916+01 | 2008-05-12 11:57:30.471104+01 | 127.0.0.1 | 58917 1018661 | omero3 | 32046 | 16384 | callan | <IDLE> | 2008-05-12 12:10:28.313033+01 | 2008-05-12 11:57:30.495012+01 | 127.0.0.1 | 45406 1018661 | omero3 | 32128 | 16384 | callan | <IDLE> | 2008-05-12 12:10:28.29668+01 | 2008-05-12 12:10:28.061232+01 | 127.0.0.1 | 40196 1018661 | omero3 | 32129 | 16384 | callan | <IDLE> | 2008-05-12 12:10:28.35524+01 | 2008-05-12 12:10:28.063689+01 | 127.0.0.1 | 37537 1018661 | omero3 | 32127 | 16384 | callan | <IDLE> | 2008-05-12 12:10:28.071263+01 | 2008-05-12 12:10:28.070155+01 | 127.0.0.1 | 48789 1018661 | omero3 | 32131 | 16384 | callan | <IDLE> | 2008-05-12 12:10:28.432773+01 | 2008-05-12 12:10:28.084959+01 | 127.0.0.1 | 49964 1018661 | omero3 | 32132 | 16384 | callan | <IDLE> | 2008-05-12 12:10:28.379633+01 | 2008-05-12 12:10:28.081523+01 | 127.0.0.1 | 38263 1018661 | omero3 | 32134 | 16384 | callan | <IDLE> | 2008-05-12 12:10:28.254855+01 | 2008-05-12 12:10:28.105267+01 | 127.0.0.1 | 55950 1018661 | omero3 | 32193 | 16384 | callan | <IDLE> | 2008-05-12 12:23:40.912671+01 | 2008-05-12 12:23:40.912608+01 | | -1 (17 rows) omero3=# -- http://www.postgresql.org/docs/7.4/interactive/view-pg-locks.html omero3=# SELECT pg_class.relname AS table, omero3-# pg_database.datname AS database, omero3-# transaction, pid, mode, granted omero3-# FROM pg_locks, pg_class, pg_database omero3-# WHERE pg_locks.relation = pg_class.oid omero3-# AND pg_locks.database = pg_database.oid; table | database | transaction | pid | mode | granted --------------------+----------+-------------+-------+-----------------+--------- pg_class | omero3 | 11882818 | 32195 | AccessShareLock | t pg_class_oid_index | omero3 | 11882818 | 32195 | AccessShareLock | t pg_locks | omero3 | 11882818 | 32195 | AccessShareLock | t (3 rows) omero3=# omero3=# -- Christian Storm <storm AT iparadigms.com> omero3=# select pg_stat_activity.datname, omero3-# pg_class.relname, omero3-# pg_locks.transaction, omero3-# pg_locks.mode, omero3-# pg_locks.granted, omero3-# pg_stat_activity.usename, omero3-# substr(pg_stat_activity.current_query,1,30), omero3-# pg_stat_activity.query_start, omero3-# age(now(), omero3(# pg_stat_activity.query_start) as "age", omero3-# pg_stat_activity.procpid omero3-# from pg_stat_activity,pg_locks left outer join pg_class on (pg_locks.relation = pg_class.oid) omero3-# where pg_locks.pid=pg_stat_activity.procpid omero3-# order by query_start desc; datname | relname | transaction | mode | granted | usename | substr | query_start | age | procpid ---------+------------------+-------------+-----------------+---------+---------+--------+-------------------------------+-----------------+--------- omero3 | pg_authid | 11882819 | AccessShareLock | t | callan | <IDLE> | 2008-05-12 12:24:06.354396+01 | 00:00:20.647701 | 32195 omero3 | pg_stat_activity | 11882819 | AccessShareLock | t | callan | <IDLE> | 2008-05-12 12:24:06.354396+01 | 00:00:20.647701 | 32195 omero3 | | 11882819 | ExclusiveLock | t | callan | <IDLE> | 2008-05-12 12:24:06.354396+01 | 00:00:20.647701 | 32195 omero3 | pg_database | 11882819 | AccessShareLock | t | callan | <IDLE> | 2008-05-12 12:24:06.354396+01 | 00:00:20.647701 | 32195 omero3 | pg_locks | 11882819 | AccessShareLock | t | callan | <IDLE> | 2008-05-12 12:24:06.354396+01 | 00:00:20.647701 | 32195 omero3 | pg_class | 11882819 | AccessShareLock | t | callan | <IDLE> | 2008-05-12 12:24:06.354396+01 | 00:00:20.647701 | 32195 (6 rows)
comment:11 Changed 16 years ago by cxallan
Further data from an earlier deadlock with the same server version:
omero3=# select * from pg_stat_activity ; datid | datname | procpid | usesysid | usename | current_query | query_start | backend_start | client_addr | client_port ---------+---------+---------+----------+---------+------------------------------+-------------------------------+-------------------------------+-------------+------------- 1018661 | omero3 | 8911 | 16384 | callan | <IDLE> | 2008-04-28 10:35:34.754785+01 | 2008-03-21 16:42:21.116586+00 | 127.0.0.1 | 59277 1018661 | omero3 | 8912 | 16384 | callan | <command string not enabled> | | 2008-03-21 16:42:21.116618+00 | 127.0.0.1 | 37319 1018661 | omero3 | 8913 | 16384 | callan | <IDLE> | 2008-05-07 12:22:27.041159+01 | 2008-03-21 16:42:21.116623+00 | 127.0.0.1 | 39331 1018661 | omero3 | 24261 | 16384 | callan | <IDLE> | 2008-05-07 15:09:47.376005+01 | 2008-04-23 21:24:40.008894+01 | | -1 1018661 | omero3 | 31386 | 16384 | callan | <IDLE> | 2008-05-12 11:46:44.681304+01 | 2008-05-12 08:37:27.119246+01 | 127.0.0.1 | 49153 1018661 | omero3 | 31387 | 16384 | callan | <IDLE> | 2008-05-12 11:46:44.831645+01 | 2008-05-12 08:37:27.125101+01 | 127.0.0.1 | 32792 1018661 | omero3 | 8987 | 16384 | callan | <command string not enabled> | | 2008-03-21 16:43:00.929327+00 | 127.0.0.1 | 52465 1018661 | omero3 | 8988 | 16384 | callan | <IDLE> | 2008-05-12 11:48:00.034243+01 | 2008-03-21 16:43:01.091478+00 | 127.0.0.1 | 58609 1018661 | omero3 | 8989 | 16384 | callan | <IDLE> | 2008-05-12 09:09:30.038081+01 | 2008-03-21 16:43:00.933228+00 | 127.0.0.1 | 34683 1018661 | omero3 | 31388 | 16384 | callan | <IDLE> | 2008-05-12 11:46:44.797632+01 | 2008-05-12 08:37:27.127697+01 | 127.0.0.1 | 38993 1018661 | omero3 | 31593 | 16384 | callan | <IDLE> | 2008-05-12 11:46:44.816352+01 | 2008-05-12 10:29:15.747345+01 | 127.0.0.1 | 37697 1018661 | omero3 | 31595 | 16384 | callan | <IDLE> | 2008-05-12 11:46:44.694687+01 | 2008-05-12 10:29:15.752309+01 | 127.0.0.1 | 54951 1018661 | omero3 | 31594 | 16384 | callan | <IDLE> | 2008-05-12 11:46:44.706963+01 | 2008-05-12 10:29:15.752355+01 | 127.0.0.1 | 60211 1018661 | omero3 | 28389 | 16384 | callan | <IDLE> | 2008-04-28 10:30:37.091669+01 | 2008-03-31 16:45:16.877403+01 | 127.0.0.1 | 35712 1018661 | omero3 | 28391 | 16384 | callan | <IDLE> | 2008-05-07 14:10:46.744597+01 | 2008-03-31 16:45:16.879851+01 | 127.0.0.1 | 60063 1018661 | omero3 | 28392 | 16384 | callan | <command string not enabled> | | 2008-03-31 16:45:16.883708+01 | 127.0.0.1 | 43758 1018661 | omero3 | 31685 | 16384 | callan | <IDLE> | 2008-05-12 11:46:44.664839+01 | 2008-05-12 11:06:21.466437+01 | 127.0.0.1 | 41580 1018661 | omero3 | 31686 | 16384 | callan | <IDLE> | 2008-05-12 11:46:44.622948+01 | 2008-05-12 11:06:21.466419+01 | 127.0.0.1 | 37106 1018661 | omero3 | 31687 | 16384 | callan | <IDLE> | 2008-05-12 11:46:44.536756+01 | 2008-05-12 11:06:21.470411+01 | 127.0.0.1 | 45767 1018661 | omero3 | 31785 | 16384 | callan | <IDLE> | 2008-05-12 11:46:44.518946+01 | 2008-05-12 11:46:44.513495+01 | 127.0.0.1 | 37486 1018661 | omero3 | 31786 | 16384 | callan | <IDLE> | 2008-05-12 11:46:44.777663+01 | 2008-05-12 11:46:44.513926+01 | 127.0.0.1 | 44982 1018661 | omero3 | 31787 | 16384 | callan | <IDLE> | 2008-05-12 11:46:44.869679+01 | 2008-05-12 11:46:44.513941+01 | 127.0.0.1 | 42208 1018661 | omero3 | 31796 | 16384 | callan | <IDLE> | 2008-05-12 11:46:44.698011+01 | 2008-05-12 11:46:44.628062+01 | 127.0.0.1 | 36093 1018661 | omero3 | 31798 | 16384 | callan | <IDLE> | 2008-05-12 11:46:44.886039+01 | 2008-05-12 11:46:44.638314+01 | 127.0.0.1 | 46696 1018661 | omero3 | 31800 | 16384 | callan | <IDLE> | 2008-05-12 11:46:44.65131+01 | 2008-05-12 11:46:44.649937+01 | 127.0.0.1 | 39196 1018661 | omero3 | 31861 | 16384 | callan | <IDLE> | 2008-05-12 11:48:21.690855+01 | 2008-05-12 11:48:21.690763+01 | | -1 omero3=# -- http://www.postgresql.org/docs/7.4/interactive/view-pg-locks.html omero3=# SELECT pg_class.relname AS table, omero3-# pg_database.datname AS database, omero3-# transaction, pid, mode, granted omero3-# FROM pg_locks, pg_class, pg_database omero3-# WHERE pg_locks.relation = pg_class.oid omero3-# AND pg_locks.database = pg_database.oid; table | database | transaction | pid | mode | granted --------------------+----------+-------------+-------+-----------------+--------- pg_locks | omero3 | 11871408 | 31861 | AccessShareLock | t pg_class_oid_index | omero3 | 11871408 | 31861 | AccessShareLock | t pg_class | omero3 | 11871408 | 31861 | AccessShareLock | t (3 rows) omero3=# omero3=# -- Christian Storm <storm AT iparadigms.com> omero3=# select pg_stat_activity.datname, omero3-# pg_class.relname, omero3-# pg_locks.transaction, omero3-# pg_locks.mode, omero3-# pg_locks.granted, omero3-# pg_stat_activity.usename, omero3-# substr(pg_stat_activity.current_query,1,30), omero3-# pg_stat_activity.query_start, omero3-# age(now(), omero3(# pg_stat_activity.query_start) as "age", omero3-# pg_stat_activity.procpid omero3-# from pg_stat_activity,pg_locks left outer join pg_class on (pg_locks.relation = pg_class.oid) omero3-# where pg_locks.pid=pg_stat_activity.procpid omero3-# order by query_start desc; datname | relname | transaction | mode | granted | usename | substr | query_start | age | procpid ---------+------------------+-------------+-----------------+---------+---------+--------+-------------------------------+-----------------+--------- omero3 | pg_locks | 11871409 | AccessShareLock | t | callan | <IDLE> | 2008-05-12 11:48:26.245359+01 | 00:03:55.993249 | 31861 omero3 | pg_authid | 11871409 | AccessShareLock | t | callan | <IDLE> | 2008-05-12 11:48:26.245359+01 | 00:03:55.993249 | 31861 omero3 | pg_database | 11871409 | AccessShareLock | t | callan | <IDLE> | 2008-05-12 11:48:26.245359+01 | 00:03:55.993249 | 31861 omero3 | | 11871409 | ExclusiveLock | t | callan | <IDLE> | 2008-05-12 11:48:26.245359+01 | 00:03:55.993249 | 31861 omero3 | pg_class | 11871409 | AccessShareLock | t | callan | <IDLE> | 2008-05-12 11:48:26.245359+01 | 00:03:55.993249 | 31861 omero3 | pg_stat_activity | 11871409 | AccessShareLock | t | callan | <IDLE> | 2008-05-12 11:48:26.245359+01 | 00:03:55.993249 | 31861 (6 rows)
comment:12 Changed 16 years ago by cxallan
Deadlock number three:
omero3=# -- http://www.postgresql.org/docs/7.4/interactive/view-pg-locks.html omero3=# SELECT pg_class.relname AS table, omero3-# pg_database.datname AS database, omero3-# transaction, pid, mode, granted omero3-# FROM pg_locks, pg_class, pg_database omero3-# WHERE pg_locks.relation = pg_class.oid omero3-# AND pg_locks.database = pg_database.oid; table | database | transaction | pid | mode | granted --------------------+----------+-------------+-------+-----------------+--------- pg_class_oid_index | omero3 | 11902738 | 32747 | AccessShareLock | t pg_locks | omero3 | 11902738 | 32747 | AccessShareLock | t pg_class | omero3 | 11902738 | 32747 | AccessShareLock | t (3 rows) omero3=# omero3=# -- Christian Storm <storm AT iparadigms.com> omero3=# select pg_stat_activity.datname, omero3-# pg_class.relname, omero3-# pg_locks.transaction, omero3-# pg_locks.mode, omero3-# pg_locks.granted, omero3-# pg_stat_activity.usename, omero3-# substr(pg_stat_activity.current_query,1,30), omero3-# pg_stat_activity.query_start, omero3-# age(now(), omero3(# pg_stat_activity.query_start) as "age", omero3-# pg_stat_activity.procpid omero3-# from pg_stat_activity,pg_locks left outer join pg_class on (pg_locks.relation = pg_class.oid) omero3-# where pg_locks.pid=pg_stat_activity.procpid omero3-# order by query_start desc; datname | relname | transaction | mode | granted | usename | substr | query_start | age | procpid ---------+------------------+-------------+-----------------+---------+---------+--------+-------------------------------+-----------------+--------- omero3 | pg_database | 11902739 | AccessShareLock | t | callan | <IDLE> | 2008-05-12 14:11:14.988843+01 | 00:00:01.365515 | 32747 omero3 | pg_locks | 11902739 | AccessShareLock | t | callan | <IDLE> | 2008-05-12 14:11:14.988843+01 | 00:00:01.365515 | 32747 omero3 | | 11902739 | ExclusiveLock | t | callan | <IDLE> | 2008-05-12 14:11:14.988843+01 | 00:00:01.365515 | 32747 omero3 | pg_authid | 11902739 | AccessShareLock | t | callan | <IDLE> | 2008-05-12 14:11:14.988843+01 | 00:00:01.365515 | 32747 omero3 | pg_stat_activity | 11902739 | AccessShareLock | t | callan | <IDLE> | 2008-05-12 14:11:14.988843+01 | 00:00:01.365515 | 32747 omero3 | pg_class | 11902739 | AccessShareLock | t | callan | <IDLE> | 2008-05-12 14:11:14.988843+01 | 00:00:01.365515 | 32747 (6 rows)
comment:13 Changed 16 years ago by cxallan
Information from Michael:
Chris 14:11 Were you importing or just working with data? Michael Porter 14:11 Using Insight In fact, wasn't even doing anyting fancy, just browsing the treeview
comment:14 Changed 16 years ago by jmoore
cf. r2139
comment:15 Changed 16 years ago by jmoore
- Description modified (diff)
- Summary changed from Server Deadlock While Indexing to Server deadlock on empty connection pool
Finally reproduced on my box. All it takes to reproduce the server freeze is to start 50 threads at the same time all of which call IAdmin.getEventContext(). c3p0 (the JDBC connection pool) then locks. Looking into solutions.
comment:16 Changed 16 years ago by cxallan
This might be related to an issue that J-M noticed a little while ago where he could exhaust the number of sockets (or threads I can't remember) that the server has available. This was especially noticeable when many thumbnail retrieval threads were going at the same time inside Insight.
J-M, have you noticed this again since you revamped the way you're retrieving thumbnails?
comment:17 Changed 16 years ago by jburel
I minimize the number of calls for counting
but I have done change the thumbnails yet.
That might be the reason why michael with his large dataset has the problem
I will put that on top of my TODO list.
comment:18 Changed 16 years ago by jmoore
This is definitely a DoS, though at the moment I can only pin-point it to SQL. Jean-Marie, if you know that this has anything to do with the number of threads, please let me know.
Status: seems to be tied to the number of postgres connections that are allowed. Using the default 30, I could "simultaneously" call up to 35 IAdmin.getEventContext methods. More than 30 simply because of the timing. Reducing the number of postgres connections down to 10 allowed only 6 simultaneous connections. Only 6 because each method may use multiple SQL connections, for sessions, etc. (Interestingly, increasing postgres max_connections to 100 also maxed out at 30.)
The GOOD thing about reducing the number so low is that it forced c3p0 to realize that there was a deadlock, which allowed regular server halting as opposed to using `kill -9'. I'll play with the parameters for c3p0 and see if I can't teach it to recognize deadlocks with lots of connections.
In the end, probably the only permanent cure is a throttle setting which roughly matches the max number of connections to postgres.
09:48:10,205 INFO [ServiceHandler] Meth: getEventContext 09:48:10,205 INFO [ServiceHandler] Args: () 09:48:10,247 INFO [EventHandler] Auth: user=100,group=2,event=12209(User) 09:48:10,281 INFO [ServiceHandler] Rslt: ome.system.SimpleEventContext@3e925e 09:48:10,517 INFO [ServiceHandler] Meth: getEventContext 09:48:10,517 INFO [ServiceHandler] Args: () 09:48:10,519 INFO [ServiceHandler] Meth: getEventContext 09:48:10,520 INFO [ServiceHandler] Args: () 09:48:10,624 INFO [EventHandler] Auth: user=100,group=2,event=12210(User) 09:48:10,632 INFO [EventHandler] Auth: user=100,group=2,event=12211(User) 09:48:10,657 INFO [ServiceHandler] Rslt: ome.system.SimpleEventContext@8ef279 09:48:10,659 INFO [ServiceHandler] Rslt: ome.system.SimpleEventContext@831dff 09:48:11,030 INFO [ServiceHandler] Meth: getEventContext 09:48:11,030 INFO [ServiceHandler] Args: () 09:48:11,326 INFO [ServiceHandler] Meth: getEventContext 09:48:11,326 INFO [ServiceHandler] Args: () 09:48:11,679 INFO [ServiceHandler] Meth: getEventContext 09:48:11,680 INFO [ServiceHandler] Args: () 09:48:11,820 INFO [EventHandler] Auth: user=100,group=2,event=12212(User) 09:48:11,825 INFO [EventHandler] Auth: user=100,group=2,event=12213(User) 09:48:11,826 INFO [EventHandler] Auth: user=100,group=2,event=12214(User) 09:48:11,910 INFO [ServiceHandler] Rslt: ome.system.SimpleEventContext@ccf2f9 09:48:11,915 INFO [ServiceHandler] Rslt: ome.system.SimpleEventContext@d28476 09:48:11,925 INFO [ServiceHandler] Rslt: ome.system.SimpleEventContext@5b7bd1 09:48:12,036 INFO [ServiceHandler] Meth: getEventContext 09:48:12,036 INFO [ServiceHandler] Args: () 09:48:12,077 INFO [ServiceHandler] Meth: getEventContext 09:48:12,077 INFO [ServiceHandler] Args: () 09:48:12,118 INFO [ServiceHandler] Meth: getEventContext 09:48:12,118 INFO [ServiceHandler] Args: () 09:48:12,152 INFO [ServiceHandler] Meth: getEventContext 09:48:12,152 INFO [ServiceHandler] Args: () 09:48:12,259 INFO [EventHandler] Auth: user=100,group=2,event=12215(User) 09:48:12,285 INFO [EventHandler] Auth: user=100,group=2,event=12216(User) 09:48:12,300 INFO [EventHandler] Auth: user=100,group=2,event=12218(User) 09:48:12,305 INFO [EventHandler] Auth: user=100,group=2,event=12217(User) 09:48:12,385 INFO [ServiceHandler] Rslt: ome.system.SimpleEventContext@d8b4ba 09:48:12,402 INFO [ServiceHandler] Rslt: ome.system.SimpleEventContext@c2339 09:48:12,408 INFO [ServiceHandler] Rslt: ome.system.SimpleEventContext@afb49d 09:48:12,417 INFO [ServiceHandler] Rslt: ome.system.SimpleEventContext@bcdd50 09:48:12,527 INFO [ServiceHandler] Meth: getEventContext 09:48:12,527 INFO [ServiceHandler] Args: () 09:48:12,534 INFO [ServiceHandler] Meth: getEventContext 09:48:12,534 INFO [ServiceHandler] Args: () 09:48:12,603 INFO [ServiceHandler] Meth: getEventContext 09:48:12,603 INFO [ServiceHandler] Args: () 09:48:12,639 INFO [ServiceHandler] Meth: getEventContext 09:48:12,639 INFO [ServiceHandler] Args: () 09:48:12,710 INFO [ServiceHandler] Meth: getEventContext 09:48:12,711 INFO [ServiceHandler] Args: () 09:48:12,724 INFO [EventHandler] Auth: user=100,group=2,event=12219(User) 09:48:12,739 INFO [EventHandler] Auth: user=100,group=2,event=12220(User) 09:48:12,799 INFO [EventHandler] Auth: user=100,group=2,event=12221(User) 09:48:12,832 INFO [EventHandler] Auth: user=100,group=2,event=12222(User) 09:48:12,833 INFO [ServiceHandler] Rslt: ome.system.SimpleEventContext@834822 09:48:12,859 INFO [EventHandler] Auth: user=100,group=2,event=12223(User) 09:48:12,881 INFO [ServiceHandler] Rslt: ome.system.SimpleEventContext@1ecf06 09:48:12,894 INFO [ServiceHandler] Rslt: ome.system.SimpleEventContext@f709a2 09:48:12,905 INFO [ServiceHandler] Rslt: ome.system.SimpleEventContext@90b8a5 09:48:12,922 INFO [ServiceHandler] Rslt: ome.system.SimpleEventContext@82901c 09:48:13,054 INFO [ServiceHandler] Meth: getEventContext 09:48:13,054 INFO [ServiceHandler] Args: () 09:48:13,056 INFO [ServiceHandler] Meth: getEventContext 09:48:13,056 INFO [ServiceHandler] Args: () 09:48:13,061 INFO [ServiceHandler] Meth: getEventContext 09:48:13,061 INFO [ServiceHandler] Args: () 09:48:13,062 INFO [ServiceHandler] Meth: getEventContext 09:48:13,062 INFO [ServiceHandler] Args: () 09:48:13,064 INFO [ServiceHandler] Meth: getEventContext 09:48:13,064 INFO [ServiceHandler] Args: () 09:48:13,066 INFO [ServiceHandler] Meth: getEventContext 09:48:13,066 INFO [ServiceHandler] Args: () 09:48:13,180 INFO [EventHandler] Auth: user=100,group=2,event=12224(User) 09:48:13,275 INFO [EventHandler] Auth: user=100,group=2,event=12227(User) 09:48:13,293 INFO [EventHandler] Auth: user=100,group=2,event=12225(User) 09:48:13,381 INFO [ServiceHandler] Rslt: ome.system.SimpleEventContext@3a58c1 09:48:13,382 INFO [ServiceHandler] Rslt: ome.system.SimpleEventContext@a2d046 09:48:13,384 INFO [EventHandler] Auth: user=100,group=2,event=12226(User) 09:48:13,409 INFO [ServiceHandler] Rslt: ome.system.SimpleEventContext@9cc91d 09:48:13,455 INFO [ServiceHandler] Rslt: ome.system.SimpleEventContext@7c297a 09:48:27,224 WARN [ThreadPoolAsynchronousRunner] com.mchange.v2.async.ThreadPoolAsynchronousRunner$DeadlockDetector@eaec4 -- APPARENT DEADLOCK!!! Creating emergency threads for unassigned pending tasks! 09:48:27,237 WARN [ThreadPoolAsynchronousRunner] com.mchange.v2.async.ThreadPoolAsynchronousRunner$DeadlockDetector@eaec4 -- APPARENT DEADLOCK!!! Complete Status: Managed Threads: 3 Active Threads: 3 Active Tasks: com.mchange.v2.resourcepool.BasicResourcePool$AcquireTask@683e5e (com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread-#0) com.mchange.v2.resourcepool.BasicResourcePool$AcquireTask@d6ca90 (com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread-#2) com.mchange.v2.resourcepool.BasicResourcePool$AcquireTask@8ce38e (com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread-#1) Pending Tasks: com.mchange.v2.resourcepool.BasicResourcePool$AcquireTask@af8cbc com.mchange.v2.resourcepool.BasicResourcePool$AcquireTask@40c470 com.mchange.v2.resourcepool.BasicResourcePool$1RefurbishCheckinResourceTask@209ff6 com.mchange.v2.resourcepool.BasicResourcePool$1RefurbishCheckinResourceTask@e5f80f com.mchange.v2.resourcepool.BasicResourcePool$1RefurbishCheckinResourceTask@d590 com.mchange.v2.resourcepool.BasicResourcePool$1RefurbishCheckinResourceTask@c4e2fd com.mchange.v2.resourcepool.BasicResourcePool$1RefurbishCheckinResourceTask@127a7e com.mchange.v2.resourcepool.BasicResourcePool$1RefurbishCheckinResourceTask@2825d5 com.mchange.v2.resourcepool.BasicResourcePool$1RefurbishCheckinResourceTask@cbd3f4 com.mchange.v2.resourcepool.BasicResourcePool$1RefurbishCheckinResourceTask@c862a3 Pool thread stack traces: Thread[com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread-#2,5,jboss] java.lang.Thread.sleep(Native Method) com.mchange.v2.resourcepool.BasicResourcePool$AcquireTask.run(BasicResourcePool.java:1805) com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread.run(ThreadPoolAsynchronousRunner.java:547) Thread[com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread-#0,5,jboss] java.lang.Thread.sleep(Native Method) com.mchange.v2.resourcepool.BasicResourcePool$AcquireTask.run(BasicResourcePool.java:1805) com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread.run(ThreadPoolAsynchronousRunner.java:547) Thread[com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread-#1,5,jboss] java.lang.Thread.sleep(Native Method) com.mchange.v2.resourcepool.BasicResourcePool$AcquireTask.run(BasicResourcePool.java:1805) com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread.run(ThreadPoolAsynchronousRunner.java:547) 09:48:27,296 INFO [EventHandler] Auth: user=100,group=2,event=12228(User) 09:48:27,314 INFO [EventHandler] Auth: user=100,group=2,event=12229(User) 09:48:27,314 INFO [ServiceHandler] Rslt: ome.system.SimpleEventContext@34369e 09:48:27,334 INFO [ServiceHandler] Rslt: ome.system.SimpleEventContext@3539d7 09:48:30,012 INFO [ServiceHandler] Meth: doWork 09:48:30,013 INFO [ServiceHandler] Args: [null, null, ome.tools.spring.InternalServiceFactory@17aae1] 09:48:30,041 INFO [EventHandler] Auth: user=0,group=0,event=12230(FullText) 09:48:30,070 INFO [FullTextIndexer] Indexed: ome.model.meta.Session:Id_702 09:48:30,660 INFO [ServiceHandler] Rslt: null 09:48:41,892 WARN [BasicResourcePool] com.mchange.v2.resourcepool.BasicResourcePool$AcquireTask@d6ca90 -- Acquisition Attempt Failed!!! Clearing pending acquires. While trying to acquire a needed new resource, we failed to succeed more than the maximum number of allowed acquisition attempts (30). Last acquisition attempt exception: org.postgresql.util.PSQLException: FATAL: sorry, too many clients already at org.postgresql.core.v3.ConnectionFactoryImpl.readStartupMessages(ConnectionFactoryImpl.java:443) at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:98) at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:65) at org.postgresql.jdbc2.AbstractJdbc2Connection.<init>(AbstractJdbc2Connection.java:116) at org.postgresql.jdbc3.AbstractJdbc3Connection.<init>(AbstractJdbc3Connection.java:30) at org.postgresql.jdbc3.Jdbc3Connection.<init>(Jdbc3Connection.java:24) at org.postgresql.Driver.makeConnection(Driver.java:369) at org.postgresql.Driver.connect(Driver.java:245) at com.mchange.v2.c3p0.DriverManagerDataSource.getConnection(DriverManagerDataSource.java:135) at com.mchange.v2.c3p0.WrapperConnectionPoolDataSource.getPooledConnection(WrapperConnectionPoolDataSource.java:182) at com.mchange.v2.c3p0.WrapperConnectionPoolDataSource.getPooledConnection(WrapperConnectionPoolDataSource.java:171) at com.mchange.v2.c3p0.impl.C3P0PooledConnectionPool$1PooledConnectionResourcePoolManager.acquireResource(C3P0PooledConnectionPool.java:137) at com.mchange.v2.resourcepool.BasicResourcePool.doAcquire(BasicResourcePool.java:1014) at com.mchange.v2.resourcepool.BasicResourcePool.access$800(BasicResourcePool.java:32) at com.mchange.v2.resourcepool.BasicResourcePool$AcquireTask.run(BasicResourcePool.java:1810) at com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread.run(ThreadPoolAsynchronousRunner.java:547) 09:48:42,049 WARN [BasicResourcePool] com.mchange.v2.resourcepool.BasicResourcePool$AcquireTask@683e5e -- Acquisition Attempt Failed!!! Clearing pending acquires. While trying to acquire a needed new resource, we failed to succeed more than the maximum number of allowed acquisition attempts (30). Last acquisition attempt exception: org.postgresql.util.PSQLException: FATAL: sorry, too many clients already at org.postgresql.core.v3.ConnectionFactoryImpl.readStartupMessages(ConnectionFactoryImpl.java:443) at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:98) at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:65) at org.postgresql.jdbc2.AbstractJdbc2Connection.<init>(AbstractJdbc2Connection.java:116) at org.postgresql.jdbc3.AbstractJdbc3Connection.<init>(AbstractJdbc3Connection.java:30) at org.postgresql.jdbc3.Jdbc3Connection.<init>(Jdbc3Connection.java:24) at org.postgresql.Driver.makeConnection(Driver.java:369) at org.postgresql.Driver.connect(Driver.java:245) at com.mchange.v2.c3p0.DriverManagerDataSource.getConnection(DriverManagerDataSource.java:135) at com.mchange.v2.c3p0.WrapperConnectionPoolDataSource.getPooledConnection(WrapperConnectionPoolDataSource.java:182) at com.mchange.v2.c3p0.WrapperConnectionPoolDataSource.getPooledConnection(WrapperConnectionPoolDataSource.java:171) at com.mchange.v2.c3p0.impl.C3P0PooledConnectionPool$1PooledConnectionResourcePoolManager.acquireResource(C3P0PooledConnectionPool.java:137) at com.mchange.v2.resourcepool.BasicResourcePool.doAcquire(BasicResourcePool.java:1014) at com.mchange.v2.resourcepool.BasicResourcePool.access$800(BasicResourcePool.java:32) at com.mchange.v2.resourcepool.BasicResourcePool$AcquireTask.run(BasicResourcePool.java:1810) at com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread.run(ThreadPoolAsynchronousRunner.java:547) 09:48:42,355 WARN [BasicResourcePool] com.mchange.v2.resourcepool.BasicResourcePool$AcquireTask@8ce38e -- Acquisition Attempt Failed!!! Clearing pending acquires. While trying to acquire a needed new resource, we failed to succeed more than the maximum number of allowed acquisition attempts (30). Last acquisition attempt exception: org.postgresql.util.PSQLException: FATAL: sorry, too many clients already at org.postgresql.core.v3.ConnectionFactoryImpl.readStartupMessages(ConnectionFactoryImpl.java:443) at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:98) at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:65) at org.postgresql.jdbc2.AbstractJdbc2Connection.<init>(AbstractJdbc2Connection.java:116) at org.postgresql.jdbc3.AbstractJdbc3Connection.<init>(AbstractJdbc3Connection.java:30) at org.postgresql.jdbc3.Jdbc3Connection.<init>(Jdbc3Connection.java:24) at org.postgresql.Driver.makeConnection(Driver.java:369) at org.postgresql.Driver.connect(Driver.java:245) at com.mchange.v2.c3p0.DriverManagerDataSource.getConnection(DriverManagerDataSource.java:135) at com.mchange.v2.c3p0.WrapperConnectionPoolDataSource.getPooledConnection(WrapperConnectionPoolDataSource.java:182) at com.mchange.v2.c3p0.WrapperConnectionPoolDataSource.getPooledConnection(WrapperConnectionPoolDataSource.java:171) at com.mchange.v2.c3p0.impl.C3P0PooledConnectionPool$1PooledConnectionResourcePoolManager.acquireResource(C3P0PooledConnectionPool.java:137) at com.mchange.v2.resourcepool.BasicResourcePool.doAcquire(BasicResourcePool.java:1014) at com.mchange.v2.resourcepool.BasicResourcePool.access$800(BasicResourcePool.java:32) at com.mchange.v2.resourcepool.BasicResourcePool$AcquireTask.run(BasicResourcePool.java:1810) at com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread.run(ThreadPoolAsynchronousRunner.java:547) 09:48:56,643 WARN [BasicResourcePool] com.mchange.v2.resourcepool.BasicResourcePool$AcquireTask@40c470 -- Acquisition Attempt Failed!!! Clearing pending acquires. While trying to acquire a needed new resource, we failed to succeed more than the maximum number of allowed acquisition attempts (30). Last acquisition attempt exception: org.postgresql.util.PSQLException: FATAL: sorry, too many clients already at org.postgresql.core.v3.ConnectionFactoryImpl.readStartupMessages(ConnectionFactoryImpl.java:443) at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:98) at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:65) at org.postgresql.jdbc2.AbstractJdbc2Connection.<init>(AbstractJdbc2Connection.java:116) at org.postgresql.jdbc3.AbstractJdbc3Connection.<init>(AbstractJdbc3Connection.java:30) at org.postgresql.jdbc3.Jdbc3Connection.<init>(Jdbc3Connection.java:24) at org.postgresql.Driver.makeConnection(Driver.java:369) at org.postgresql.Driver.connect(Driver.java:245) at com.mchange.v2.c3p0.DriverManagerDataSource.getConnection(DriverManagerDataSource.java:135) at com.mchange.v2.c3p0.WrapperConnectionPoolDataSource.getPooledConnection(WrapperConnectionPoolDataSource.java:182) at com.mchange.v2.c3p0.WrapperConnectionPoolDataSource.getPooledConnection(WrapperConnectionPoolDataSource.java:171) at com.mchange.v2.c3p0.impl.C3P0PooledConnectionPool$1PooledConnectionResourcePoolManager.acquireResource(C3P0PooledConnectionPool.java:137) at com.mchange.v2.resourcepool.BasicResourcePool.doAcquire(BasicResourcePool.java:1014) at com.mchange.v2.resourcepool.BasicResourcePool.access$800(BasicResourcePool.java:32) at com.mchange.v2.resourcepool.BasicResourcePool$AcquireTask.run(BasicResourcePool.java:1810) at com.mchange.v2.async.ThreadPerTaskAsynchronousRunner$TaskThread.run(ThreadPerTaskAsynchronousRunner.java:255) 09:48:56,646 WARN [BasicResourcePool] com.mchange.v2.resourcepool.BasicResourcePool$AcquireTask@af8cbc -- Acquisition Attempt Failed!!! Clearing pending acquires. While trying to acquire a needed new resource, we failed to succeed more than the maximum number of allowed acquisition attempts (30). Last acquisition attempt exception: org.postgresql.util.PSQLException: FATAL: sorry, too many clients already at org.postgresql.core.v3.ConnectionFactoryImpl.readStartupMessages(ConnectionFactoryImpl.java:443) at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:98) at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:65) at org.postgresql.jdbc2.AbstractJdbc2Connection.<init>(AbstractJdbc2Connection.java:116) at org.postgresql.jdbc3.AbstractJdbc3Connection.<init>(AbstractJdbc3Connection.java:30) at org.postgresql.jdbc3.Jdbc3Connection.<init>(Jdbc3Connection.java:24) at org.postgresql.Driver.makeConnection(Driver.java:369) at org.postgresql.Driver.connect(Driver.java:245) at com.mchange.v2.c3p0.DriverManagerDataSource.getConnection(DriverManagerDataSource.java:135) at com.mchange.v2.c3p0.WrapperConnectionPoolDataSource.getPooledConnection(WrapperConnectionPoolDataSource.java:182) at com.mchange.v2.c3p0.WrapperConnectionPoolDataSource.getPooledConnection(WrapperConnectionPoolDataSource.java:171) at com.mchange.v2.c3p0.impl.C3P0PooledConnectionPool$1PooledConnectionResourcePoolManager.acquireResource(C3P0PooledConnectionPool.java:137) at com.mchange.v2.resourcepool.BasicResourcePool.doAcquire(BasicResourcePool.java:1014) at com.mchange.v2.resourcepool.BasicResourcePool.access$800(BasicResourcePool.java:32) at com.mchange.v2.resourcepool.BasicResourcePool$AcquireTask.run(BasicResourcePool.java:1810) at com.mchange.v2.async.ThreadPerTaskAsynchronousRunner$TaskThread.run(ThreadPerTaskAsynchronousRunner.java:255) 09:49:00,013 INFO [ServiceHandler] Meth: doWork 09:49:00,014 INFO [ServiceHandler] Args: [null, null, ome.tools.spring.InternalServiceFactory@d2a0dc] 09:49:00,058 INFO [EventHandler] Auth: user=0,group=0,event=12231(FullText) 09:49:00,087 INFO [ServiceHandler] Rslt: null ... 09:52:23,411 INFO [TransactionManagerService] Stopping recovery manager 09:52:23,474 INFO [Server] Shutdown complete Shutdown complete Halting VM
comment:19 Changed 16 years ago by jmoore
Confirmed in blitz. Confirmed with and without indexing thread running.
Adding the following c3p0.properties allows scaling to 100s of simultaneous IAdmin.getEventContext() calls without the need for kill -9. Specifically up to 500 from OmeroPy at which point ConnectionLost exceptions start to be thrown, and above 200 (250, 350, 400 threads passed) for RMI/java before some (currently unknown) failure occurs:
c3p0.initialPoolSize=30 c3p0.minPoolSize=30 c3p0.maxPoolSize=63
Note: 63 is one less than the maximum number of connections allowed by my postgres.
comment:20 Changed 16 years ago by jmoore
Upgrading to c3p0 0.9.1.2 as suggested in HHH-3189 doesn't seem to significantly increase number of threads possible without c3p0.properties (<60), but the number of threads with c3p0.properties and version 0.9.1.2 is slightly higher. In Java, perhaps as many as 500 (never more) when warm, though these figures seem to fluctuate highly.
comment:21 Changed 16 years ago by jmoore
r2373 has an initial commit with c3p0 changes.
Beginning to outline how to configure OMERO for numerous concurrent threads under wiki:ScalingOmero
comment:22 Changed 16 years ago by jmoore
- Cc atarkowska added
How are we looking everyone? Any more freezes? Ola had another issue with the settings in c3p0.properties causing "too many clients" exceptions. Probably needs to be tuned.
comment:23 Changed 16 years ago by jburel
I have not seen the problem since modifications.
comment:24 Changed 16 years ago by jmoore
- Resolution set to fixed
- Status changed from assigned to closed
Closing, c3p0.properties configuration will be handled elsewhere.
comment:25 Changed 16 years ago by jmoore
- Resolution fixed deleted
- Status changed from closed to reopened
Reopening, since it looks like the current configuration for c3p0:
- assumes too many connections for many users
- would eventually still lead to a server freeze
comment:26 Changed 16 years ago by jmoore
r2451 changes the etc/c3p0.properties to better handle scaling. With a maxPoolSize of just 20 (the default), I can scale on my box to 300+ threads. With maxPoolSize at 50, I can scale to 1000+ threads. (With a connectionTimeout of 5 seconds, and a test time of 30 seconds.)
The commit also includes a "weakening" of the lock on FullTextThread.stop(). There's really no reason that if the server is going down, that we should wait more than a minute or two to kill the thread.
Chris, let me know how testing goes.
comment:27 Changed 16 years ago by jmoore
Also for reference:
http://www.mchange.com/projects/c3p0/index.html#configuration_properties
comment:28 Changed 16 years ago by jmoore
See also #996
comment:29 Changed 16 years ago by jmoore
- Resolution set to fixed
- Status changed from reopened to closed
Sorry pressed the submit button by mistake