Skip Menu | Logged in as guest | Logout
 
Ticket metadata
Id: 4472
Status: resolved
Priority: 9/0
Queue: vdt-support

Fixed in: 1.10.1s
Fix scheduled: CUR

Owner: Tim Cartwright
Requestors: weigand@fnal.gov
Cc:
AdminCc:

New reminder:

Created: Tue Dec 02 10:45:04 2008
Starts: Not set
Started: Tue Dec 02 15:30:54 2008
Last Contact: Tue Jan 13 15:08:01 2009
Due: Not set
Closed: Tue Feb 03 09:31:00 2009
Updated: Tue Feb 03 09:31:03 2009 by cat



History Brief headersFull headers
Subject: VOMS server 1.8.8-2 stops functioning after 8 hours of inactivity
Date: Tue, 02 Dec 2008 10:40:37 -0600
To: vdt-support <vdt-support@OPENSCIENCEGRID.ORG>
From: John Weigand <weigand@fnal.gov>
Download (untitled) / with headers
text/plain 5.3k
VDT 1.10.1n

VOMS 1.8.8-2
MySQL 5.0.67

If there has been no activity against the VOMS proxy server (/usr/local/osg-voms/glite/sbin/edg-voms) for 8+ hours, the server stops processing proxy requests.

The server has to be re-cycled to become functional again.

For a large VO, this is not likely a problem. However for a small VO, 8 hours seems rather short.

From what I have read, the cause appears to be a timeout on the MySql connection (8 hours is the default). I believe the VOMS server uses hibernate and c3p0 to manage connection sessions. For what it is worth, these are some links I have found on this connection timeout problem.

http://www.codefin.net/2007/05/hibernate-and-mysql-connection-timeouts.html
http://onzoob.com/archive/read/troubleshooting-c3p0-connection-pool

At the back end of this is a voms-proxy-init that fails due to the connection problem. I have also listed the log entries for a good proxy and failed proxy since there is no real indication in the log that there is a problem.

One confusing aspect of logs is that there are 2 lines showing "... LOG_ERROR:STARTUP ..." which appear on both a good and failed proxy request.


John Weigand



############################################################
This is the failure that occurred after 9 hours of inactivty...

voms-proxy-init --voms oiv_test1
Enter GRID pass phrase:
Your identity: /DC=org/DC=doegrids/OU=People/CN=John Weigand 458491
Creating temporary proxy ............................................. Done
Contacting cms-xen3.fnal.gov:15000
[/DC=org/DC=doegrids/OU=Services/CN=http/cms-xen3.fnal.gov]
"oiv_test1"gss_assist_get_unwrap failure:
globus_gss_assist token :3: failure: Connection closed
Failed

Error: GSS authentication failure
globus_gss_assist token :3: failure: Connection closed
None of the contacted servers for oiv_test1 were capable
of returning a valid AC for the user.

###########################################################
/usr/local/osg-voms/glite/log/oiv_test1 log file show no error on failure.

==== GOOD ( after 8 hours of inactivity ) ===============================
Mon Dec 1 19:46:42 2008:cms-xen3.fnal.gov:vomsd[26644]:
msg="LOG_INFO:REQUEST:Listen (Server.cpp:361):Received connection from:
131.225.204.129:30143."
Mon Dec 1 19:46:42 2008:cms-xen3.fnal.gov:vomsd[26644]:
msg="LOG_INFO:REQUEST:Run (vomsd.cc:600):Starting Executor with pid = 1418"
Mon Dec 1 19:46:43 2008:cms-xen3.fnal.gov:vomsd[1418]:
msg="LOG_INFO:REQUEST:Run (vomsd.cc:612):Self :
/DC=org/DC=doegrids/OU=Services/CN=http/cms-xen3.fnal.gov"
Mon Dec 1 19:46:43 2008:cms-xen3.fnal.gov:vomsd[1418]:
msg="LOG_INFO:REQUEST:Run (vomsd.cc:613):Self CA :
/DC=org/DC=DOEGrids/OU=Certificate Authorities/CN=DOEGrids CA 1"
Mon Dec 1 19:46:43 2008:cms-xen3.fnal.gov:vomsd[1418]:
msg="LOG_INFO:REQUEST:Run (vomsd.cc:620):At: Mon Dec 1 19:46:43 2008.
Received Contact :"
Mon Dec 1 19:46:43 2008:cms-xen3.fnal.gov:vomsd[1418]:
msg="LOG_INFO:REQUEST:Run (vomsd.cc:621): user:
/DC=org/DC=doegrids/OU=Services/CN=cms-xen3.fnal.gov"
Mon Dec 1 19:46:43 2008:cms-xen3.fnal.gov:vomsd[1418]:
msg="LOG_INFO:REQUEST:Run (vomsd.cc:622): ca :
/DC=org/DC=DOEGrids/OU=Certificate Authorities/CN=DOEGrids CA 1"
Mon Dec 1 19:46:43 2008:cms-xen3.fnal.gov:vomsd[1418]:
msg="LOG_INFO:REQUEST:Run (vomsd.cc:623): serial: 5F11"
Mon Dec 1 19:46:43 2008:cms-xen3.fnal.gov:vomsd[1418]:
msg="LOG_ERROR:STARTUP:my_recv (globuswrap.c:112):trueres = 1."
Mon Dec 1 19:46:43 2008:cms-xen3.fnal.gov:vomsd[1418]:
msg="LOG_ERROR:STARTUP:my_recv (globuswrap.c:112):trueres = 1."
Mon Dec 1 19:46:43 2008:cms-xen3.fnal.gov:vomsd[1418]:
msg="LOG_INFO:REQUEST:Execute (vomsd.cc:737):Userid = "10""
Mon Dec 1 19:46:43 2008:cms-xen3.fnal.gov:vomsd[1418]:
msg="LOG_INFO:REQUEST:Execute (vomsd.cc:753):Next command : G/oiv_test1"
Mon Dec 1 19:46:43 2008:cms-xen3.fnal.gov:vomsd[1418]:
msg="LOG_INFO:RESULT:Execute (vomsd.cc:1022):Request Result:
/oiv_test1/Role=NULL/Capability=NULL"

==== FAILED ( after 9 hours of inactivity) ==============================
Tue Dec 2 04:46:40 2008:cms-xen3.fnal.gov:vomsd[26644]:
msg="LOG_INFO:REQUEST:Listen (Server.cpp:361):Received connection from:
131.225.204.129:1743."
Tue Dec 2 04:46:40 2008:cms-xen3.fnal.gov:vomsd[26644]:
msg="LOG_INFO:REQUEST:Run (vomsd.cc:600):Starting Executor with pid = 7067"
Tue Dec 2 04:46:41 2008:cms-xen3.fnal.gov:vomsd[7067]:
msg="LOG_INFO:REQUEST:Run (vomsd.cc:612):Self :
/DC=org/DC=doegrids/OU=Services/CN=http/cms-xen3.fnal.gov"
Tue Dec 2 04:46:41 2008:cms-xen3.fnal.gov:vomsd[7067]:
msg="LOG_INFO:REQUEST:Run (vomsd.cc:613):Self CA :
/DC=org/DC=DOEGrids/OU=Certificate Authorities/CN=DOEGrids CA 1"
Tue Dec 2 04:46:41 2008:cms-xen3.fnal.gov:vomsd[7067]:
msg="LOG_INFO:REQUEST:Run (vomsd.cc:620):At: Tue Dec 2 04:46:41 2008.
Received Contact :"
Tue Dec 2 04:46:41 2008:cms-xen3.fnal.gov:vomsd[7067]:
msg="LOG_INFO:REQUEST:Run (vomsd.cc:621): user:
/DC=org/DC=doegrids/OU=Services/CN=cms-xen3.fnal.gov"
Tue Dec 2 04:46:41 2008:cms-xen3.fnal.gov:vomsd[7067]:
msg="LOG_INFO:REQUEST:Run (vomsd.cc:622): ca :
/DC=org/DC=DOEGrids/OU=Certificate Authorities/CN=DOEGrids CA 1"
Tue Dec 2 04:46:41 2008:cms-xen3.fnal.gov:vomsd[7067]:
msg="LOG_INFO:REQUEST:Run (vomsd.cc:623): serial: 5F11"
Tue Dec 2 04:46:41 2008:cms-xen3.fnal.gov:vomsd[7067]:
msg="LOG_ERROR:STARTUP:my_recv (globuswrap.c:112):trueres = 1."
Tue Dec 2 04:46:41 2008:cms-xen3.fnal.gov:vomsd[7067]:
msg="LOG_ERROR:STARTUP:my_recv (globuswrap.c:112):trueres = 1."
John:

> If there has been no activity against the VOMS proxy server for 8+ hours, the
> server stops processing proxy requests. The server has to be re-cycled to
> become functional again.

Wow, that's pretty nasty! I will report this one to Vincenzo and ask him to
debug it.

One comment though:

> [...] the cause appears to be a timeout on the MySQL connection (8 hours is
> the default). I believe the VOMS server uses hibernate and c3p0 to manage
> connection sessions. For what it is worth, these are some links I have found
> on this connection timeout problem.

This is not quite accurate. Hibernate and c3p0 are Java database layers that
provide developers with higher-level data abstractions. The VOMS server is a
C/C++ program, not Java. I believe you are thinking of VOMS Admin, which is a
Java web application, but VOMS Admin is not involved in handling VOMS client
(voms-proxy-init) requests.

Now, perhaps there is a similar timeout issue happening on the MySQL client API
level, it would just be happening in C, not Java.

Anyway, I will pass this along to Vincenzo and keep you posted.

-- Tim
Download (untitled) / with headers
text/plain 138b
John:

I have filed the VOMS bug in the LCG Savannah. You can track that ticket at:

https://savannah.cern.ch/bugs/index.php?44936

-- Tim
Download (untitled) / with headers
text/plain 630b
John:

This is Vincenzo's response:

> The interesting part here is this:

>> Error: GSS authentication failure
>> globus_gss_assist token :3: failure: Connection closed

> What is the status of certificates/CRL on both machines?

> Also, note that if there is a problem with GSS, nothing should appear on the
> logs, since voms-proxy-init fails before getting authenticated.

> Is this reproducible on the usual machine?

> I tried on mine, and this problem does not seem to be there.

Can you please comment on the state of certificates and CRLs on your test
machine? Also, have you reproduced this behavior reliably?

-- Tim
Subject: Re: [vdt-support #4472] VOMS server 1.8.8-2 stops functioning after 8 hours of inactivity
Date: Thu, 04 Dec 2008 11:35:02 -0600
To: vdt-support@OPENSCIENCEGRID.ORG
From: John Weigand <weigand@fnal.gov>
Download (untitled) / with headers
text/plain 5.9k
embedded....

John Weigand




Tim Cartwright via RT wrote:
> John:
>
> This is Vincenzo's response:
>
>
>> The interesting part here is this:
>>
>
>
>>> Error: GSS authentication failure
>>> globus_gss_assist token :3: failure: Connection closed
>>>
>
>
>> What is the status of certificates/CRL on both machines?
>>
Nothing is wrong with either the CRLs or certificates on the machines.

If that were the case, I would not think that re-cycling the voms proxy
server would have any effect.

The test was run with the client (voms-proxy-init) and proxy server
(edg-voms) on the same machine.
>> Also, note that if there is a problem with GSS, nothing should appear on the
>> logs, since voms-proxy-init fails before getting authenticated.
>>
The GSS error is on the client side and obviously to stdout/err not to
any logs file.

I am not saying that there is a problem with GSS. That is what the
voms-proxy-init is saying.

The purpose of showing the server logs is to indicate that the proxy
request is being sent to the proxy server. It is doing some processing,
however it just stops at that point. The first log shows a good one.
The 2nd set shows how it just stops. What or if it actually sends
anything back is a personal thing between the server and client that I
have no clue about.

The output of the voms-proxy-init says it is contacting the server. And
the server log I show indicates it has sent something.

One guess is that this may be a timeout (very short at that) and the
actual error message is misleading.
>
>
>> Is this reproducible on the usual machine?
>>
!00% reproducible. I did it over several days. However, you do have to
wait 8 hours for the connections to drop.

>
>
>> I tried on mine, and this problem does not seem to be there.
>>
That is totally baffling. I have to 2 test nodes with VOMS on them and
both have the same problem. I just ran against the 2nd one and failed.
Once I recycled the voms server, I was successful. And, just in case
the issue of being on the same node for the client and server comes up
next. This test was done from a different node as the server.

Are you sure you are waiting 8 hours (or does your my.cnf set the
default longer?
Are you sure there is no activity for that period?
>
> Can you please comment on the state of certificates and CRLs on your test
> machine? Also, have you reproduced this behavior reliably?
>
> -- Tim
>
>
I also ran a script against the mysql database doing a 'show proceslist'
to check on connections. It ran every hour and you can see the
connections drop between 22:13 and 23:13 for the 3 test VOs my VOMS has.


###################################################
# Tue Dec 2 22:13:01 CST 2008
#--------------------------------------------------

Id User Host db Command Time State Info
1 oiv_test1_adm localhost voms_oiv_test1 Sleep
27582 NULL
2 oiv_test2_adm localhost voms_oiv_test2 Sleep
27688 NULL
3 oiv_test3_adm localhost voms_oiv_test3 Sleep
27687 NULL
3355 oiv_test1_adm localhost:33943 voms_oiv_test1 Sleep
83 NULL
3358 oiv_test2_adm localhost:33946 voms_oiv_test2 Sleep
70 NULL
3363 oiv_test3_adm localhost:33951 voms_oiv_test3 Sleep
64 NULL
3400 oiv_test3_adm localhost:33988 voms_oiv_test3 Sleep
93 NULL
3401 oiv_test1_adm localhost:33989 voms_oiv_test1 Sleep
14 NULL
3402 oiv_test1_adm localhost:33990 voms_oiv_test1 Sleep
14 NULL
3403 oiv_test3_adm localhost:33991 voms_oiv_test3 Sleep
85 NULL
3404 oiv_test2_adm localhost:33992 voms_oiv_test2 Sleep
2 NULL
3405 oiv_test2_adm localhost:33993 voms_oiv_test2 Sleep
2 NULL
3406 oiv_test3_adm localhost:33994 voms_oiv_test3 Sleep
72 NULL
3407 oiv_test3_adm localhost:33995 voms_oiv_test3 Sleep
72 NULL
3408 oiv_test2_adm localhost:33996 voms_oiv_test2 Sleep
2 NULL
3409 oiv_test1_adm localhost:33997 voms_oiv_test1 Sleep
14 NULL
3410 oiv_test1_adm localhost:33998 voms_oiv_test1 Sleep
14 NULL
3411 oiv_test2_adm localhost:33999 voms_oiv_test2 Sleep
2 NULL
3412 root localhost NULL Query 0 NULL show
processlist

###################################################
# Tue Dec 2 23:13:01 CST 2008
#--------------------------------------------------

Id User Host db Command Time State Info
3801 oiv_test1_adm localhost:34388 voms_oiv_test1 Sleep
83 NULL
3805 oiv_test2_adm localhost:34392 voms_oiv_test2 Sleep
70 NULL
3807 oiv_test3_adm localhost:34394 voms_oiv_test3 Sleep
64 NULL
3844 oiv_test3_adm localhost:34431 voms_oiv_test3 Sleep
92 NULL
3845 oiv_test1_adm localhost:34432 voms_oiv_test1 Sleep
14 NULL
3846 oiv_test3_adm localhost:34433 voms_oiv_test3 Sleep
82 NULL
3847 oiv_test2_adm localhost:34434 voms_oiv_test2 Sleep
2 NULL
3848 oiv_test3_adm localhost:34435 voms_oiv_test3 Sleep
69 NULL
3849 oiv_test3_adm localhost:34436 voms_oiv_test3 Sleep
69 NULL
3850 oiv_test2_adm localhost:34437 voms_oiv_test2 Sleep
2 NULL
3851 oiv_test2_adm localhost:34438 voms_oiv_test2 Sleep
2 NULL
3852 oiv_test1_adm localhost:34439 voms_oiv_test1 Sleep
14 NULL
3853 oiv_test1_adm localhost:34440 voms_oiv_test1 Sleep
14 NULL
3854 oiv_test1_adm localhost:34441 voms_oiv_test1 Sleep
14 NULL
3855 oiv_test2_adm localhost:34442 voms_oiv_test2 Sleep
2 NULL
3856 root localhost NULL Query 0 NULL show
processlist
Subject: Re: [vdt-support #4472] VOMS server 1.8.8-2 stops functioning after 8 hours of inactivity
Date: Thu, 04 Dec 2008 11:48:03 -0600
To: vdt-support@OPENSCIENCEGRID.ORG
From: Alain Roy <roy@cs.wisc.edu>
Download (untitled) / with headers
text/plain 202b
Tim--

How hard would it be for you to replicate this problem here on gk1?
Install VOMS today, and see if it's failing tomorrow? We could then
give Vincenzo access to the computer and logs.

-alain
Subject: [vdt-support #4472] SVN commit, rev 8437
To: vdt-support@cs.wisc.edu
From: cat@cs.wisc.edu
Download (untitled) / with headers
text/plain 228b
Commit comment:
Creating a new branch to test Vincenzo's fix to the VOMS-MySQL timeout problem.


Changed files:
A vdt/branches/vdt-1.10.1-voms-timeout/

To generate a diff:
svn diff -c 8437 file:///p/condor/workspaces/vdt/svn
Subject: [vdt-support #4472] SVN commit, rev 8438
To: vdt-support@cs.wisc.edu
From: cat@cs.wisc.edu
Download (untitled) / with headers
text/plain 202b
Commit comment:
Force test cache to cat-voms-timeout for testing.


Changed files:
U vdt/branches/vdt-1.10.1-voms-timeout/defs

To generate a diff:
svn diff -c 8438 file:///p/condor/workspaces/vdt/svn
Subject: [vdt-support #4472] SVN commit, rev 8468
To: vdt-support@cs.wisc.edu
From: cat@cs.wisc.edu
Download (untitled) / with headers
text/plain 321b
Commit comment:
Two small changes to make a new build that includes Vincenzo's working CVS tag
with the voms-mysql fix.


Changed files:
U vdt/branches/vdt-1.10.1-voms-timeout/VOMS/nmi/voms-mysql.cvs
U vdt/branches/vdt-1.10.1-voms-timeout/defs

To generate a diff:
svn diff -c 8468 file:///p/condor/workspaces/vdt/svn
CC: Tim Cartwright <cat@cs.wisc.edu>
Subject: Re: [vdt-support #4472] VOMS server 1.8.8-2 stops functioning after 8 hours of inactivity
Date: Fri, 12 Dec 2008 10:13:44 -0600
To: vdt-support@OPENSCIENCEGRID.ORG
From: John Weigand <weigand@fnal.gov>
Download (untitled) / with headers
text/plain 756b
This issued appears resolved.

I installed yesterday from the
http://vdt.cs.wisc.edu/test-cache/cat-voms-timeout cache at 14:27 and
tested this morning at 09:45.

Before and after running the tests, I viewed the connections in the
MySql database. It appears the solution was to remove the persistent
connection and open a database connection with each request. This would
definitely eliminate any connection timeout issues.

Performance degradation may or may not be a problem with this solution
as it would depend on the proxy request rate and the number of VOMS
servers using the same database as they do at FNAL. This I can make no
judgment on, but do not feel it will be an issue (hmmmm.... I guess I
just made a judgment).


John Weigand
CC: vdt-support@OPENSCIENCEGRID.ORG
Subject: Re: [vdt-support #4472] VOMS server 1.8.8-2 stops functioning after 8 hours of inactivity
Date: Fri, 12 Dec 2008 10:33:10 -0600
To: weigand@fnal.gov
From: Tim Cartwright <cat@cs.wisc.edu>
Download (untitled) / with headers
text/plain 995b
John, thanks for the update, just about to do my own test. But please
file this update in the Savannah ticket! That's where it needs to be
seen.

-- Tim

On 12 Dec 2008, at 10:13, John Weigand wrote:

> This issued appears resolved.
>
> I installed yesterday from the http://vdt.cs.wisc.edu/test-cache/cat-voms-timeout
> cache at 14:27 and tested this morning at 09:45.
>
> Before and after running the tests, I viewed the connections in the
> MySql database. It appears the solution was to remove the persistent
> connection and open a database connection with each request. This
> would definitely eliminate any connection timeout issues.
> Performance degradation may or may not be a problem with this
> solution as it would depend on the proxy request rate and the number
> of VOMS servers using the same database as they do at FNAL. This I
> can make no judgment on, but do not feel it will be an issue
> (hmmmm.... I guess I just made a judgment).
>
>
> John Weigand
Subject: [vdt-support #4472] SVN commit, rev 8595
To: vdt-support@cs.wisc.edu
From: cat@cs.wisc.edu
Download (untitled) / with headers
text/plain 215b
Commit comment:
Switched to final build tag for voms-mysql.


Changed files:
U vdt/branches/vdt-1.10.1-voms-timeout/VOMS/nmi/voms-mysql.cvs

To generate a diff:
svn diff -c 8595 file:///p/condor/workspaces/vdt/svn
Subject: [vdt-support #4472] SVN commit, rev 8596
To: vdt-support@cs.wisc.edu
From: cat@cs.wisc.edu
Download (untitled) / with headers
text/plain 210b
Commit comment:
Tag for first merge from this branch to vdt-1.10.1 branch.


Changed files:
A vdt/tags/vdt-1.10.1-voms-timeout-merge1/

To generate a diff:
svn diff -c 8596 file:///p/condor/workspaces/vdt/svn
Subject: [vdt-support #4472] SVN commit, rev 8597
To: vdt-support@cs.wisc.edu
From: cat@cs.wisc.edu
Download (untitled) / with headers
text/plain 437b
Commit comment:
Merge from vdt-1.10.1-voms-timeout branch:

svn merge -r 8437:8596 $SVN/vdt/branches/vdt-1.10.1-voms-timeout .

Revision 8596 = tags/vdt-1.10.1-voms-timeout-merge1.

Updated to VOMS-MySQL 3.1.0, which fixes the timeout problem; see RT #4472.


Changed files:
U vdt/branches/vdt-1.10.1/VOMS/nmi/voms-mysql.cvs
U vdt/branches/vdt-1.10.1/defs

To generate a diff:
svn diff -c 8597 file:///p/condor/workspaces/vdt/svn
Subject: [vdt-support #4472] New VOMS build ready for test
Date: Tue, 13 Jan 2009 15:06:45 -0600
To: vdt-support@OPENSCIENCEGRID.ORG
From: Tim Cartwright <cat@cs.wisc.edu>
Download (untitled) / with headers
text/plain 751b
John:

You submitted a bug report starting with the following:

> If there has been no activity against the VOMS proxy server (/usr/
> local/osg-voms/glite/sbin/edg-voms) for 8+ hours, the server stops
> processing proxy requests.

We recently got a final CVS tag from Vincenzo for this fix, and I now
have a build in place, ready to release. If you would like to test
it, please use the 1.10.99 cache:

pacman -get http://vdt.cs.wisc.edu/vdt_11099_cache:VOMS

We'll be running our automated tests on this new VOMS build, but we
don't have the facilities to test the timeout issue. Therefore, I
have done a manual test to confirm the fix.

If all goes well, this change will be released later this week or
early next week.

-- Tim
Subject: Re: [vdt-support #4472] New VOMS build ready for test
Date: Tue, 13 Jan 2009 15:16:27 -0600
To: vdt-support@OPENSCIENCEGRID.ORG
From: John Weigand <weigand@fnal.gov>
Download (untitled) / with headers
text/plain 851b
Will do

john

Tim Cartwright via RT wrote:
> John:
>
> You submitted a bug report starting with the following:
>
>
>> If there has been no activity against the VOMS proxy server (/usr/
>> local/osg-voms/glite/sbin/edg-voms) for 8+ hours, the server stops
>> processing proxy requests.
>>
>
> We recently got a final CVS tag from Vincenzo for this fix, and I now
> have a build in place, ready to release. If you would like to test
> it, please use the 1.10.99 cache:
>
> pacman -get http://vdt.cs.wisc.edu/vdt_11099_cache:VOMS
>
> We'll be running our automated tests on this new VOMS build, but we
> don't have the facilities to test the timeout issue. Therefore, I
> have done a manual test to confirm the fix.
>
> If all goes well, this change will be released later this week or
> early next week.
>
> -- Tim
>
>
>
Subject: [vdt-support #4472] SVN commit, rev 8598
To: vdt-support@cs.wisc.edu
From: cat@cs.wisc.edu
Download (untitled) / with headers
text/plain 241b
Commit comment:
Turned on 1.10.99 testing again, because the latest VOMS build needs testing
before release.


Changed files:
U tests/trunk/test-scripts/tests-to-run

To generate a diff:
svn diff -c 8598 file:///p/condor/workspaces/vdt/svn