| # | Tue Aug 21 14:39:37 2007 | osg@tick-indy.globalnoc.iu.edu - Ticket created | [Reply] | |||||||||
When replying, type your text above this line. ---------------------------------------------- This message is to let you know that Open Science Grid ticket 4004 "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) still not working" which is assigned to you, was updated on 08/21/2007 at 19:35:17 with the following information: FootPrints Ticket Description: VDT Support, Please respond to the following issue. Thank You, Tim Silvers OSG Grid Operations Center goc@opensciencegrid.org, 317-278-9699 web: http://www.opensciencegrid.org rss:Â http://www.grid.iu.edu/news/ From: hs@nhn.ou.edu Subject: grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) still not working Date: August 21, 2007 3:28:47 PM EDT To: goc@opensciencegrid.org Cc: adesmet@cs.wisc.edu Reply-To: hs@nhn.ou.edu Hi all, GridEx jobs have just been resumed on OUHEP_ITB (osgitb1.nhn.ou.edu), and immediately the load went through the roof again and is now at 7. And when I checked, I saw that the grid_manager_monitor is apparently still not working, since there is no such process, but rather one globus-job-manager process for each submitted GridEx job. Can we please get this resolved? It's been like that since at least June, and we need this testbed for real testing and can't afford to have it bogged down with GridEx jobs like this. Thanks, Horst Assignees: Operations Workgroup, Arvind Gopu, OSG Support Centers, VDT Status: Support Agency Originating VO Support Center: DOSAR Destination VO Support Center: VDT Originating Ticket Number: Destination Ticket Number: Thank You, OSG Grid Operations Center goc@opensciencegrid.org, 317-278-9699 info: http://www.opensciencegrid.org rss: http://www.grid.iu.edu/news/ |
||||||||||||
| # | Tue Aug 21 14:43:51 2007 | osg@tick-indy.globalnoc.iu.edu - Correspondence added | [Reply] | |||||||||
When replying, type your text above this line. ---------------------------------------------- Open Science Grid ticket 4004 "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) still not working" was assigned to you via round-robin on 2007-08-21 at 19:41:04 with the following information: Footprints Ticket Description: Entered on 08/21/2007 at 19:35:17 by Tim Silvers: VDT Support, Please respond to the following issue. Thank You, Tim Silvers OSG Grid Operations Center goc@opensciencegrid.org, 317-278-9699 web: http://www.opensciencegrid.org rss:Â http://www.grid.iu.edu/news/ From: hs@nhn.ou.edu Subject: grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) still not working Date: August 21, 2007 3:28:47 PM EDT To: goc@opensciencegrid.org Cc: adesmet@cs.wisc.edu Reply-To: hs@nhn.ou.edu Hi all, GridEx jobs have just been resumed on OUHEP_ITB (osgitb1.nhn.ou.edu), and immediately the load went through the roof again and is now at 7. And when I checked, I saw that the grid_manager_monitor is apparently still not working, since there is no such process, but rather one globus-job-manager process for each submitted GridEx job. Can we please get this resolved? It's been like that since at least June, and we need this testbed for real testing and can't afford to have it bogged down with GridEx jobs like this. Thanks, Horst Assignees: Operations Workgroup, Arvind Gopu, OSG Support Centers, VDT Status: Support Agency Originating VO Support Center: DOSAR Destination VO Support Center: VDT Originating Ticket Number: Destination Ticket Number: Thank You, OSG Grid Operations Center goc@opensciencegrid.org, 317-278-9699 info: http://www.opensciencegrid.org rss: http://www.grid.iu.edu/news/ |
||||||||||||
| # | Tue Aug 21 14:43:57 2007 | osg@tick-indy.globalnoc.iu.edu - Correspondence added | [Reply] | |||||||||
When replying, type your text above this line. ---------------------------------------------- This message is to let you know that Open Science Grid ticket 4004 "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) still not working" which is assigned to you, was updated on 08/21/2007 at 19:41:08 with the following information: FootPrints Ticket Description: Greetings from the VDT support system! This message was generated automatically in response to the creation of a ticket regarding: Open Science Grid: grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) still not worki... ISSUE=4004 PROJ=71 Your original request is copied below for reference. There is no need to reply to this message right now -- someone from the VDT team will respond to you as soon as possible. If you wish to view your support ticket online, visit: http://vdt.cs.wisc.edu/rt/Ticket/Display.html?user=guest&pass=guest&id=2922 Your ticket has been assigned an ID as follows: [vdt-support #2922] Please include the ticket ID in the subject line of all future email about this issue. To do so, you may reply to this message. Thank you for your interest in the VDT. ------------------------------------------------------------------------- [Duplicate message snipped] Assignees: Operations Workgroup, Arvind Gopu, OSG Support Centers, VDT Status: Support Agency Originating VO Support Center: DOSAR Destination VO Support Center: VDT Originating Ticket Number: Destination Ticket Number: Thank You, OSG Grid Operations Center goc@opensciencegrid.org, 317-278-9699 info: http://www.opensciencegrid.org rss: http://www.grid.iu.edu/news/ |
||||||||||||
| # | Tue Aug 21 15:04:36 2007 | osg@tick-indy.globalnoc.iu.edu - Correspondence added | [Reply] | |||||||||
When replying, type your text above this line. ---------------------------------------------- This message is to let you know that Open Science Grid ticket 4004 "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) still not working" which is assigned to you, was updated on 08/21/2007 at 19:59:07 with the following information: FootPrints Ticket Description: Hi again, some more info which may have gotten lost since the last time we talked about this. So this is a RHEL5 machine with an osg-0.7.0 gatekeeper: [hs@osgitb1 ~]$ cat /etc/redhat-release Red Hat Enterprise Linux Client release 5 (Tikanga) [hs@osgitb1 ~]$ uname -a Linux osgitb1.nhn.ou.edu 2.6.18-8.1.8.el5 #1 SMP Mon Jun 25 17:06:19 EDT 2007 i686 i686 i386 GNU/Linux [hs@osgitb1 ~]$ vdt-version You have installed a subset of VDT version 1.8.0d: Apache HTTPD 2.2.4 gLite CEMon Server (INFN release from 2006-05-19, plus RAW dialect) 1.7.1 CA Certificates v29 (includes IGTF 1.16 CAs) CONDOR-DEVEL (Not an official part of the VDT) EDG Make Gridmap 2.9.0 Fetch CRL 2.6.2 Generic Information Provider 1.0.15 (Iowa 15-Feb-2006) Globus Toolkit, pre web-services, client 4.0.5 Globus Toolkit, pre web-services, server 4.0.5 Globus Toolkit, web-services, client 4.0.5 Globus Toolkit, web-services, server 4.0.5 GLUE Schema 1.2 draft 7 GPT 3.2 Gratia Condor Probe 0.26.2b-1 GRATIA_METRIC_PROBE (Not an official part of the VDT) Java SDK 1.4.2_14 Java 5 SDK 1.5.0_12 KX509 20031111 Logrotate 3.7 MonALISA 1.6.16 MyProxy 3.9 MySQL 4.1.22 MySQL Connector/J 5.0.6 Pegaus Worker Package 2.0.1 PPDG Cert Scripts 2.4 PRIMA Authorization Module 0.6 RLS, client 3.0.041021 SRM V1 Client 1.25 SRM V2 Client 2.2.0.2 syslog-ng 2.0.4 Apache Tomcat 5.0.28 UberFTP 1.24 Wget 1.10.2 Please let me know if I can provide you with any other helpful information. Thanks a lot, Horst Assignees: Operations Workgroup, Arvind Gopu, OSG Support Centers, VDT Status: Support Agency Originating VO Support Center: DOSAR Destination VO Support Center: VDT Originating Ticket Number: Destination Ticket Number: Thank You, OSG Grid Operations Center goc@opensciencegrid.org, 317-278-9699 info: http://www.opensciencegrid.org rss: http://www.grid.iu.edu/news/ |
||||||||||||
| # | Thu Aug 23 08:44:46 2007 | osg@tick-indy.globalnoc.iu.edu - Correspondence added | [Reply] | |||||||||
When replying, type your text above this line. ---------------------------------------------- This message is to let you know that Open Science Grid ticket 4004 "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) still not working" which is assigned to you, was updated on 08/23/2007 at 13:26:08 with the following information: FootPrints Ticket Description: Hi again, looks like osgitb1 has been completely removed from the GridEx list now. This is not helpful, since we need to figure out why the grid monitor isn't working, not ignore it altogether. Can we please put it back in and try to solve this? Thanks, Horst ------------- Begin Forwarded Message ------------- Date: Thu, 23 Aug 2007 00:01:05 -0500 From: Grid Exerciser <grid-ex@cs.wisc.edu> Subject: OSG-ITB Grid Exerciser Results (2007-08-23) To: adesmet@cs.wisc.edu, jfrey@cs.wisc.edu, OSG-INT@OPENSCIENCEGRID.ORG List-Owner: <mailto:OSG-INT-request@LISTSERV.FNAL.GOV> Information on reading this report can be found at http://www.cs.wisc.edu/condor/tools/exerciser/reading_report.html General information on the Grid Exerciser can be found at http://www.cs.wisc.edu/condor/tools/exerciser/ The maximum simultaneous jobs to any given site are currently throttled to 10. Job duration: 900 sec Maximum job duration for timeout: 10800 sec Current run submitted at Wed Aug 22 10:30:42 2007. Grid Exerciser (Experimental) Results for OSG-ITB from Wed Aug 22 00:01:01 2007 through Thu Aug 23 00:01:01 2007 Report generated on Thu Aug 23 00:01:05 2007 Status Summary Site Simul Submit Rec'd Timout Done Errors Run Time Default citgrid3.cacr.caltech.edu/jobmanager-condor 10 10 1234 0 0 1235 0 cms-xen1.fnal.gov/jobmanager-condor 10 74 84 0 66 11 16 cms-xen9.fnal.gov/jobmanager-condor 10 71 81 0 61 10 15 cmsitbsrv01.fnal.gov/jobmanager-condor 10 443 443 0 433 0 114 cmssrv09.fnal.gov/jobmanager-condor 10 198 198 0 188 0 50 feynman.uits.iupui.edu/jobmanager-condor 10 10 0 0 0 460 0 fgitb-gk.fnal.gov/jobmanager-condor 10 454 454 0 444 0 117 gk.phys.sinica.edu.tw/jobmanager-condor 10 10 50 40 0 0 0 gridtest01.racf.bnl.gov/jobmanager-condor 10 10 0 40 0 0 0 grow-itb.its.uiowa.edu/jobmanager-pbs 10 10 0 40 0 0 0 ligo-itb.aset.psu.edu/jobmanager-pbs 10 22 176 0 12 361 3 osg-itb.ligo.caltech.edu/jobmanager-condor 10 452 452 0 442 0 115 osg-vtb.ligo.caltech.edu/jobmanager-condor 10 98 148 0 88 50 23 osggate.clemson.edu/jobmanager-condor 10 222 222 0 212 0 55 osp-vtb00.nersc.gov/jobmanager-sge 10 10 0 0 0 1280 0 pc1805.nersc.gov/jobmanager-sge 10 10 0 40 0 0 0 pdsfgrid1/jobmanager-sge 10 10 0 40 0 0 0 t2dev-osg.uchicago.edu/jobmanager-condor 10 10 0 0 0 1280 0 tb10.grid.iu.edu/jobmanager-condor 10 10 0 0 0 461 0 testwulf.hpcc.ttu.edu/jobmanager-pbs 10 222 222 0 212 0 55 GRAND TOTAL (20 sites) 200 2356 3764 200 2158 5148 567 Globus Error Summary Globus Error Codes: 7 17 79 121 **** Failur citgrid3.cacr.caltec 0 1235 0 0 0 100.0% cms-xen1.fnal.gov/jo 1 0 0 10 0 14.3% cms-xen9.fnal.gov/jo 0 0 0 10 0 14.1% cmsitbsrv01.fnal.gov 0 0 0 0 0 0.0% cmssrv09.fnal.gov/jo 0 0 0 0 0 0.0% feynman.uits.iupui.e 0 0 0 0 460 100.0% fgitb-gk.fnal.gov/jo 0 0 0 0 0 0.0% gk.phys.sinica.edu.t 0 0 0 0 0 gridtest01.racf.bnl. 0 0 0 0 0 grow-itb.its.uiowa.e 0 0 0 0 0 ligo-itb.aset.psu.ed 0 0 81 0 280 96.8% osg-itb.ligo.caltech 0 0 0 0 0 0.0% osg-vtb.ligo.caltech 0 50 0 0 0 36.2% osggate.clemson.edu/ 0 0 0 0 0 0.0% osp-vtb00.nersc.gov/ 1280 0 0 0 0 100.0% pc1805.nersc.gov/job 0 0 0 0 0 pdsfgrid1/jobmanager 0 0 0 0 0 t2dev-osg.uchicago.e 1280 0 0 0 0 100.0% tb10.grid.iu.edu/job 0 0 0 0 461 100.0% testwulf.hpcc.ttu.ed 0 0 0 0 0 0.0% TOTALS: 2561 1285 81 20 1201 PERCENT OF ERRORS: 49.7 25.0 1.6 0.4 23.3 **** These errors were not Globus errors. See below for details. Error Details citgrid3.cacr.caltech.edu/jobmanager-condor 1235 Globus error 17: the job failed when the job manager attempted to run it cms-xen1.fnal.gov/jobmanager-condor 1 Globus error 7: authentication with the remote server failed 10 Globus error 121: the job state file doesn't exist 10 Grid Resource Back Up 10 Detected Down Globus Resource cms-xen9.fnal.gov/jobmanager-condor 10 Globus error 121: the job state file doesn't exist 10 Grid Resource Back Up 10 Detected Down Globus Resource cmsitbsrv01.fnal.gov/jobmanager-condor No errors cmssrv09.fnal.gov/jobmanager-condor No errors feynman.uits.iupui.edu/jobmanager-condor 460 Unspecified gridmanager error fgitb-gk.fnal.gov/jobmanager-condor No errors gk.phys.sinica.edu.tw/jobmanager-condor No errors gridtest01.racf.bnl.gov/jobmanager-condor 10 Detected Down Globus Resource grow-itb.its.uiowa.edu/jobmanager-pbs 10 Detected Down Globus Resource ligo-itb.aset.psu.edu/jobmanager-pbs 81 Globus error 79: connecting to the job manager failed. Possible reasons: job terminated, invalid job contact, network problems, ... 280 Unspecified gridmanager error osg-itb.ligo.caltech.edu/jobmanager-condor No errors osg-vtb.ligo.caltech.edu/jobmanager-condor 50 Globus error 17: the job failed when the job manager attempted to run it osggate.clemson.edu/jobmanager-condor No errors osp-vtb00.nersc.gov/jobmanager-sge 1280 Globus error 7: authentication with the remote server failed pc1805.nersc.gov/jobmanager-sge 10 Detected Down Globus Resource pdsfgrid1/jobmanager-sge 10 Detected Down Globus Resource t2dev-osg.uchicago.edu/jobmanager-condor 1280 Globus error 7: authentication with the remote server failed tb10.grid.iu.edu/jobmanager-condor 461 Unspecified gridmanager error testwulf.hpcc.ttu.edu/jobmanager-pbs No errors This report took 0.1 minutes to generate ------------- End Forwarded Message ------------- Assignees: Operations Workgroup, Arvind Gopu, OSG Support Centers, VDT Status: Support Agency Originating VO Support Center: DOSAR Destination VO Support Center: VDT Originating Ticket Number: Destination Ticket Number: Thank You, OSG Grid Operations Center goc@opensciencegrid.org, 317-278-9699 info: http://www.opensciencegrid.org rss: http://www.grid.iu.edu/news/ |
||||||||||||
| # | Fri Aug 24 00:07:09 2007 | roy - Given to roy | ||
| # | Fri Aug 24 00:07:09 2007 | roy - Cc jfrey@cs.wisc.edu added | ||
| # | Fri Aug 24 00:10:05 2007 | roy - Correspondence added | [Reply] | |
|
> From: hs@nhn.ou.edu > Subject: grid monitor for GridEx jobs on OUHEP_ITB > (osgitb1.nhn.ou.edu) still not working > Date: August 21, 2007 3:28:47 PM EDT > To: goc@opensciencegrid.org > Cc: adesmet@cs.wisc.edu > Reply-To: hs@nhn.ou.edu > > Hi all, > > GridEx jobs have just been resumed on OUHEP_ITB (osgitb1.nhn.ou.edu), > and immediately the load went through the roof again and is now at 7. > > And when I checked, I saw that the grid_manager_monitor is apparently > still not working, since there is no such process, but rather > one globus-job-manager process for each submitted GridEx job. > > Can we please get this resolved? It's been like that since at least > June, > and we need this testbed for real testing and can't afford to have it > bogged down with GridEx jobs like this. > > Thanks, I've added Jaime Frey to this ticket in this hope that he can help us debug the problem. Jaime, you can view the full ticket (which has a bit more than this email) at: <http://vdt.cs.wisc.edu/rt/index.html?user=guest&pass=guest&q=2922> Could you give us some advice on how to debug why the grid monitor is not working for Horst's site? These are grid exerciser jobs, and those are definitely using the grid monitor, so this is a bit mysterious. Jaime, don't feel that you need to handle the entire ticket: it is assigned to me. But if you have any advice on where to begin looking for the problem, that would be greatly appreciated. Thanks, -alain |
||||
| # | Fri Aug 24 00:10:07 2007 | RT_System - Status changed from 'new' to 'open' | ||
| # | Fri Aug 24 00:34:47 2007 | osg@tick-indy.globalnoc.iu.edu - Correspondence added | [Reply] | |||||||||
When replying, type your text above this line. ---------------------------------------------- This message is to let you know that Open Science Grid ticket 4004 "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) still not working" which is assigned to you, was updated on 08/24/2007 at 05:11:07 with the following information: FootPrints Ticket Description: > From: hs@nhn.ou.edu > Subject: grid monitor for GridEx jobs on OUHEP_ITB > (osgitb1.nhn.ou.edu) still not working > Date: August 21, 2007 3:28:47 PM EDT > To: goc@opensciencegrid.org > Cc: adesmet@cs.wisc.edu > Reply-To: hs@nhn.ou.edu > > Hi all, > > GridEx jobs have just been resumed on OUHEP_ITB (osgitb1.nhn.ou.edu), > and immediately the load went through the roof again and is now at 7. > > And when I checked, I saw that the grid_manager_monitor is apparently > still not working, since there is no such process, but rather > one globus-job-manager process for each submitted GridEx job. > > Can we please get this resolved? It's been like that since at least > June, > and we need this testbed for real testing and can't afford to have it > bogged down with GridEx jobs like this. > > Thanks, I've added Jaime Frey to this ticket in this hope that he can help us debug the problem. Jaime, you can view the full ticket (which has a bit more than this email) at: <http://vdt.cs.wisc.edu/rt/index.html?user=guest&pass=guest&q=2922> Could you give us some advice on how to debug why the grid monitor is not working for Horst's site? These are grid exerciser jobs, and those are definitely using the grid monitor, so this is a bit mysterious. Jaime, don't feel that you need to handle the entire ticket: it is assigned to me. But if you have any advice on where to begin looking for the problem, that would be greatly appreciated. Thanks, -alain -- View ticket at <http://vdt.cs.wisc.edu/rt/Ticket/Display.html?user=guest&pass=guest&id=2922> VDT Support, vdt-support@ivdgl.org Assignees: Operations Workgroup, Arvind Gopu, OSG Support Centers, VDT Status: Support Agency Originating VO Support Center: DOSAR Destination VO Support Center: VDT Originating Ticket Number: Destination Ticket Number: Thank You, OSG Grid Operations Center goc@opensciencegrid.org, 317-278-9699 info: http://www.opensciencegrid.org rss: http://www.grid.iu.edu/news/ |
||||||||||||
| # | Fri Aug 24 14:16:29 2007 | jfrey@cs.wisc.edu - Correspondence added | [Reply] | |||||||||||
On Aug 24, 2007, at 12:10 AM, Alain Roy via RT wrote: >> GridEx jobs have just been resumed on OUHEP_ITB (osgitb1.nhn.ou.edu), >>> and immediately the load went through the roof again and is now at 7. >> >> And when I checked, I saw that the grid_manager_monitor is apparently >> still not working, since there is no such process, but rather >> one globus-job-manager process for each submitted GridEx job. >> >> Can we please get this resolved? It's been like that since at least >> June, >> and we need this testbed for real testing and can't afford to have it >> bogged down with GridEx jobs like this. >> >> Thanks, > I've added Jaime Frey to this ticket in this hope that he can help us > debug the problem. > > Jaime, you can view the full ticket (which has a bit more than this > email) at: > > <http://vdt.cs.wisc.edu/rt/index.html?user=guest&pass=guest&q=2922> > > Could you give us some advice on how to debug why the grid monitor is > not working for Horst's site? These are grid exerciser jobs, and those > are definitely using the grid monitor, so this is a bit mysterious. > > Jaime, don't feel that you need to handle the entire ticket: it is > assigned to me. But if you have any advice on where to begin > looking for > the problem, that would be greatly appreciated. If the grid monitor fails to report the status of jobs, the Condor gridmanager will fall back to running a limited number of jobmanager processes (no more than 10 by default). The best way to start debugging grid monitor problems is to run it from the command line. Here are instructions: Run the following command, substituting as appropriate: globusrun -s -r <resource>/jobmanager-fork '&(executable= $(GLOBUSRUN_GASS_URL)/<condor path>/sbin/grid_monitor.sh) (arguments="--dest-url="#$(GLOBUSRUN_GASS_URL)#"/tmp/job_status")' That should all be on one line. If it's working correctly, it should print out something like this: 2006-01-31 16:21:17 OK: 2006-01-31 16:21:17 INFO: Forced agent start 2006-01-31 16:21:17 INFO: Starting grid_manager_monitor_agent 2006-01-31 16:21:17 INFO: Started grid_manager_monitor_agent as /tmp/grid_manager_monitor_agent.jfrey.18795.1000, pid 18797 2006-01-31 16:21:17 INFO: grid_manager_monitor_agent already running. and continue to run, printing out an 'OK' line every minute. Then, /tmp/job_status should appear on your machine and contain something like this: 1138746108 1138746108 https://nostos.cs.wisc.edu:43462/8588/1137692629/ 8 https://nostos.cs.wisc.edu:43962/8760/1137693090/ 8 GRIDMONEOF The file should be replaced with a fresh version about every minute. +--------------------------------+-----------------------------------+ | Jaime Frey | I used to be a heavy gambler. | +--------------------------------+-----------------------------------+| jfrey@cs.wisc.edu | But now I just make mental bets. | | http://www.cs.wisc.edu/~jfrey/ | That's how I lost my mind. | |
||||||||||||||
| # | Fri Aug 24 14:21:37 2007 | osg@tick-indy.globalnoc.iu.edu - Correspondence added | [Reply] | |||||||||
When replying, type your text above this line. ---------------------------------------------- This message is to let you know that Open Science Grid ticket 4004 "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) still not working" which is assigned to you, was updated on 08/24/2007 at 19:14:11 with the following information: FootPrints Ticket Description: On Aug 24, 2007, at 12:10 AM, Alain Roy via RT wrote: >> GridEx jobs have just been resumed on OUHEP_ITB (osgitb1.nhn.ou.edu), >>> and immediately the load went through the roof again and is now at 7. >> >> And when I checked, I saw that the grid_manager_monitor is apparently >> still not working, since there is no such process, but rather >> one globus-job-manager process for each submitted GridEx job. >> >> Can we please get this resolved? It's been like that since at least >> June, >> and we need this testbed for real testing and can't afford to have it >> bogged down with GridEx jobs like this. >> >> Thanks, > I've added Jaime Frey to this ticket in this hope that he can help us > debug the problem. > > Jaime, you can view the full ticket (which has a bit more than this > email) at: > > <http://vdt.cs.wisc.edu/rt/index.html?user=guest&pass=guest&q=2922> > > Could you give us some advice on how to debug why the grid monitor is > not working for Horst's site? These are grid exerciser jobs, and those > are definitely using the grid monitor, so this is a bit mysterious. > > Jaime, don't feel that you need to handle the entire ticket: it is > assigned to me. But if you have any advice on where to begin > looking for > the problem, that would be greatly appreciated. If the grid monitor fails to report the status of jobs, the Condor gridmanager will fall back to running a limited number of jobmanager processes (no more than 10 by default). The best way to start debugging grid monitor problems is to run it from the command line. Here are instructions: Run the following command, substituting as appropriate: globusrun -s -r <resource>/jobmanager-fork '&(executable= $(GLOBUSRUN_GASS_URL)/<condor path>/sbin/grid_monitor.sh) (arguments="--dest-url="#$(GLOBUSRUN_GASS_URL)#"/tmp/job_status")' That should all be on one line. If it's working correctly, it should print out something like this: 2006-01-31 16:21:17 OK: 2006-01-31 16:21:17 INFO: Forced agent start 2006-01-31 16:21:17 INFO: Starting grid_manager_monitor_agent 2006-01-31 16:21:17 INFO: Started grid_manager_monitor_agent as /tmp/grid_manager_monitor_agent.jfrey.18795.1000, pid 18797 2006-01-31 16:21:17 INFO: grid_manager_monitor_agent already running. and continue to run, printing out an 'OK' line every minute. Then, /tmp/job_status should appear on your machine and contain something like this: 1138746108 1138746108 https://nostos.cs.wisc.edu:43462/8588/1137692629/ 8 https://nostos.cs.wisc.edu:43962/8760/1137693090/ 8 GRIDMONEOF The file should be replaced with a fresh version about every minute. +--------------------------------+-----------------------------------+ | Jaime Frey | I used to be a heavy gambler. | +--------------------------------+-----------------------------------+| jfrey@cs.wisc.edu | But now I just make mental bets. | | http://www.cs.wisc.edu/~jfrey/ | That's how I lost my mind. | Assignees: Operations Workgroup, Arvind Gopu, OSG Support Centers, VDT Status: Support Agency Originating VO Support Center: DOSAR Destination VO Support Center: VDT Originating Ticket Number: Destination Ticket Number: Thank You, OSG Grid Operations Center goc@opensciencegrid.org, 317-278-9699 info: http://www.opensciencegrid.org rss: http://www.grid.iu.edu/news/ |
||||||||||||
| # | Tue Aug 28 10:46:46 2007 | osg@tick-indy.globalnoc.iu.edu - Correspondence added | [Reply] | |||||||||
When replying, type your text above this line. ---------------------------------------------- This message is to let you know that Open Science Grid ticket 4004 "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) still not working" which is assigned to you, was updated on 08/24/2007 at 19:17:10 with the following information: FootPrints Ticket Description: On Aug 24, 2007, at 12:10 AM, Alain Roy via RT wrote: >> GridEx jobs have just been resumed on OUHEP_ITB (osgitb1.nhn.ou.edu), >>> and immediately the load went through the roof again and is now at 7. >> >> And when I checked, I saw that the grid_manager_monitor is apparently >> still not working, since there is no such process, but rather >> one globus-job-manager process for each submitted GridEx job. >> >> Can we please get this resolved? It's been like that since at least >> June, >> and we need this testbed for real testing and can't afford to have it >> bogged down with GridEx jobs like this. >> >> Thanks, > I've added Jaime Frey to this ticket in this hope that he can help us > debug the problem. > > Jaime, you can view the full ticket (which has a bit more than this > email) at: > > <http://vdt.cs.wisc.edu/rt/index.html?user=guest&pass=guest&q=2922> > > Could you give us some advice on how to debug why the grid monitor is > not working for Horst's site? These are grid exerciser jobs, and those > are definitely using the grid monitor, so this is a bit mysterious. > > Jaime, don't feel that you need to handle the entire ticket: it is > assigned to me. But if you have any advice on where to begin > looking for > the problem, that would be greatly appreciated. If the grid monitor fails to report the status of jobs, the Condor gridmanager will fall back to running a limited number of jobmanager processes (no more than 10 by default). The best way to start debugging grid monitor problems is to run it from the command line. Here are instructions: Run the following command, substituting as appropriate: globusrun -s -r <resource>/jobmanager-fork '&(executable= $(GLOBUSRUN_GASS_URL)/<condor path>/sbin/grid_monitor.sh) (arguments="--dest-url="#$(GLOBUSRUN_GASS_URL)#"/tmp/job_status")' That should all be on one line. If it's working correctly, it should print out something like this: 2006-01-31 16:21:17 OK: 2006-01-31 16:21:17 INFO: Forced agent start 2006-01-31 16:21:17 INFO: Starting grid_manager_monitor_agent 2006-01-31 16:21:17 INFO: Started grid_manager_monitor_agent as /tmp/grid_manager_monitor_agent.jfrey.18795.1000, pid 18797 2006-01-31 16:21:17 INFO: grid_manager_monitor_agent already running. and continue to run, printing out an 'OK' line every minute. Then, /tmp/job_status should appear on your machine and contain something like this: 1138746108 1138746108 https://nostos.cs.wisc.edu:43462/8588/1137692629/ 8 https://nostos.cs.wisc.edu:43962/8760/1137693090/ 8 GRIDMONEOF The file should be replaced with a fresh version about every minute. +--------------------------------+-----------------------------------+ | Jaime Frey | I used to be a heavy gambler. | +--------------------------------+-----------------------------------+| jfrey@cs.wisc.edu | But now I just make mental bets. | | http://www.cs.wisc.edu/~jfrey/ | That's how I lost my mind. | -- View ticket at <http://vdt.cs.wisc.edu/rt/Ticket/Display.html?user=guest&pass=guest&id=2922> VDT Support, vdt-support@ivdgl.org Assignees: Operations Workgroup, Arvind Gopu, OSG Support Centers, VDT Status: Support Agency Originating VO Support Center: DOSAR Destination VO Support Center: VDT Originating Ticket Number: Destination Ticket Number: Thank You, OSG Grid Operations Center goc@opensciencegrid.org, 317-278-9699 info: http://www.opensciencegrid.org rss: http://www.grid.iu.edu/news/ |
||||||||||||
| # | Tue Aug 28 13:21:23 2007 | osg@tick-indy.globalnoc.iu.edu - Correspondence added | [Reply] | |||||||||
When replying, type your text above this line. ---------------------------------------------- This message is to let you know that Open Science Grid ticket 4004 "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) still not working" which is assigned to you, was updated on 08/28/2007 at 18:02:08 with the following information: FootPrints Ticket Description: Hi Jaime, thanks for the info, sorry about the delay. This seems to work fine. I ran this from ouhep5, my desktop, on osgitb1, the ITB gatekeeper in question, and got the following response: ----- [hs@ouhep5 hs]$ globusrun -s -r osgitb1.nhn.ou.edu/jobmanager-fork '&(executable=$(GLOBUSRUN_GASS_URL)/usr/local/condor/sbin/grid_monitor.sh)(arguments="--dest-url="#$(GLOBUSRUN_GASS_URL)#"/tmp/job_status")' /usr/local/opt/osg-0.7.0/apache/lib:/usr/local/opt/osg-0.7.0/MonaLisa/Service/VDTFarm/pgsql/lib:/usr/local/opt/osg-0.7.0/glite/lib:/usr/local/opt/osg-0.7.0/prima/lib:/usr/local/opt/osg-0.7.0/jdk1.5/jre/lib/i386:/usr/local/opt/osg-0.7.0/jdk1.5/jre/lib/i386/server:/usr/local/opt/osg-0.7.0/jdk1.5/jre/lib/i386/client:/usr/local/opt/osg-0.7.0/mysql/lib/mysql:/usr/local/opt/osg-0.7.0/globus/lib:/usr/local/opt/osg-0.7.0/berkeley-db/lib:/usr/local/opt/osg-0.7.0/expat/lib:/usr/local/opt/osg-0.7.0/apache/lib:/usr/local/opt/osg-0.7.0/MonaLisa/Service/VDTFarm/pgsql/lib:/usr/local/opt/osg-0.7.0/glite/lib:/usr/local/opt/osg-0.7.0/prima/lib:/usr/local/opt/osg-0.7.0/jdk1.5/jre/lib/i386:/usr/local/opt/osg-0.7.0/jdk1.5/jre/lib/i386/server:/usr/local/opt/osg-0.7.0/jdk1.5/jre/lib/i386/client:/usr/local/opt/osg-0.7.0/mysql/lib/mysql:/usr/local/opt/osg-0.7.0/berkeley-db/lib:/usr/local/opt/osg-0.7.0/expat/lib: 2007-08-28 12:32:29 OK: 2007-08-28 12:32:29 INFO: /usr/local/opt/osg-0.7.0/globus/tmp/gram_job_state/grid_manager_monitor_agent_log.354 missing Unquoted string "break" may clash with future reserved word at /usr/local/opt/osg-0.7.0/globus/lib/perl/Globus/GRAM/JobManager/condor.pm line 55. Useless use of a constant in void context at /usr/local/opt/osg-0.7.0/globus/lib/perl/Globus/GRAM/JobManager/condor.pm line 55. // should probably be written as "" at /usr/local/opt/osg-0.7.0/globus/lib/perl/Globus/GRAM/JobManager/condor.pm line 393. 2007-08-28 12:32:29 INFO: Starting grid_manager_monitor_agent 2007-08-28 12:32:29 INFO: Started grid_manager_monitor_agent as /tmp/grid_manager_monitor_agent.usatlas1.24371.1000, pid 24399 2007-08-28 12:33:29 OK: 2007-08-28 12:34:29 OK: 2007-08-28 12:35:29 OK: ... ----- And /tmp/job_status does look like you said: ---- 1188323129 1188323129 GRIDMONEOF ---- And on osgitb1, I see this in /tmp: ---- [hs@osgitb1 ~]$ ls -ao /tmp/condor-lock.osgitb10.309009174424322/ total 12 drwxr-xr-x 2 condor 4096 Aug 28 10:56 . drwxrwxrwt 12 root 4096 Aug 28 12:51 .. -rw------- 1 condor 0 Aug 23 14:25 InstanceLock prw------- 1 condor 0 Aug 28 10:56 procd_pipe.SCHEDD prw------- 1 condor 0 Aug 23 15:06 procd_pipe.SCHEDD.watchdog ---- So I didn't get the "Forced agent to start", but otherwise it looks okay, so it doesn't look like there's a problem on this end, right? I just tried the same with osgitb1 as the client -- so, from osgitb1 to osgitb1, and I get the same result, so both osg-0.6.0 on SLF3 (ouhep5) and osg-0.7.0 on RHEL5 (osgitb1) can start a grid_manager_monitor just fine. What else can we try to debug this? Can you run a set of GridEx jobs by hand, and see what that does? What OS version and osg version is the normal GridEx running on? Thanks a lot, Horst Assignees: Operations Workgroup, Arvind Gopu, OSG Support Centers, VDT Status: Support Agency Originating VO Support Center: DOSAR Destination VO Support Center: VDT Originating Ticket Number: Destination Ticket Number: Thank You, OSG Grid Operations Center goc@opensciencegrid.org, 317-278-9699 info: http://www.opensciencegrid.org rss: http://www.grid.iu.edu/news/ |
||||||||||||
| # | Tue Aug 28 14:36:23 2007 | jfrey@cs.wisc.edu - Correspondence added | [Reply] | |||||||||
On Aug 28, 2007, at 1:21 PM, osg@tick-indy.globalnoc.iu.edu via RT wrote: > When replying, type your text above this line. > ---------------------------------------------- > This message is to let you know that Open Science Grid ticket 4004 > "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) > still not working" which is assigned to you, was updated on > 08/28/2007 at 18:02:08 with the following information: > > FootPrints Ticket Description: > Hi Jaime, > > thanks for the info, sorry about the delay. > > This seems to work fine. I ran this from ouhep5, my desktop, on > osgitb1, > the ITB gatekeeper in question, and got the following response: > > ----- > [hs@ouhep5 hs]$ globusrun -s -r osgitb1.nhn.ou.edu/jobmanager-fork > '&(executable=$(GLOBUSRUN_GASS_URL)/usr/local/condor/sbin/ > grid_monitor.sh)(arguments="--dest-url="#$(GLOBUSRUN_GASS_URL)#"/ > tmp/job_status")' > /usr/local/opt/osg-0.7.0/apache/lib:/usr/local/opt/osg-0.7.0/ > MonaLisa/Service/VDTFarm/pgsql/lib:/usr/local/opt/osg-0.7.0/glite/ > lib:/usr/local/opt/osg-0.7.0/prima/lib:/usr/local/opt/osg-0.7.0/ > jdk1.5/jre/lib/i386:/usr/local/opt/osg-0.7.0/jdk1.5/jre/lib/i386/ > server:/usr/local/opt/osg-0.7.0/jdk1.5/jre/lib/i386/client:/usr/ > local/opt/osg-0.7.0/mysql/lib/mysql:/usr/local/opt/osg-0.7.0/globus/ > lib:/usr/local/opt/osg-0.7.0/berkeley-db/lib:/usr/local/opt/ > osg-0.7.0/expat/lib:/usr/local/opt/osg-0.7.0/apache/lib:/usr/local/ > opt/osg-0.7.0/MonaLisa/Service/VDTFarm/pgsql/lib:/usr/local/opt/ > osg-0.7.0/glite/lib:/usr/local/opt/osg-0.7.0/prima/lib:/usr/local/ > opt/osg-0.7.0/jdk1.5/jre/lib/i386:/usr/local/opt/osg-0.7.0/jdk1.5/ > jre/lib/i386/server:/usr/local/opt/osg-0.7.0/jdk1.5/jre/lib/i386/ > client:/usr/local/opt/osg-0.7.0/mysql/lib/mysql:/usr/local/opt/ > osg-0.7.0/berkeley-db/lib:/usr/local/opt/osg-0.7.0/expat/lib: > 2007-08-28 12:32:29 OK: > 2007-08-28 12:32:29 INFO: /usr/local/opt/osg-0.7.0/globus/tmp/ > gram_job_state/grid_manager_monitor_agent_log.354 missing > Unquoted string "break" may clash with future reserved word at /usr/ > local/opt/osg-0.7.0/globus/lib/perl/Globus/GRAM/JobManager/ > condor.pm line 55. > Useless use of a constant in void context at /usr/local/opt/ > osg-0.7.0/globus/lib/perl/Globus/GRAM/JobManager/condor.pm line 55. > // should probably be written as "" at /usr/local/opt/osg-0.7.0/ > globus/lib/perl/Globus/GRAM/JobManager/condor.pm line 393. > 2007-08-28 12:32:29 INFO: Starting grid_manager_monitor_agent > 2007-08-28 12:32:29 INFO: Started grid_manager_monitor_agent as / > tmp/grid_manager_monitor_agent.usatlas1.24371.1000, pid 24399 > 2007-08-28 12:33:29 OK: > 2007-08-28 12:34:29 OK: > 2007-08-28 12:35:29 OK: > ... > ----- > > And /tmp/job_status does look like you said: > > ---- > 1188323129 1188323129 > GRIDMONEOF > ---- > > And on osgitb1, I see this in /tmp: > > ---- > [hs@osgitb1 ~]$ ls -ao /tmp/condor-lock.osgitb10.309009174424322/ > total 12 > drwxr-xr-x 2 condor 4096 Aug 28 10:56 . > drwxrwxrwt 12 root 4096 Aug 28 12:51 .. > -rw------- 1 condor 0 Aug 23 14:25 InstanceLock > prw------- 1 condor 0 Aug 28 10:56 procd_pipe.SCHEDD > prw------- 1 condor 0 Aug 23 15:06 procd_pipe.SCHEDD.watchdog > ---- > > So I didn't get the "Forced agent to start", but otherwise it looks > okay, > so it doesn't look like there's a problem on this end, right? > > I just tried the same with osgitb1 as the client -- so, from osgitb1 > to osgitb1, and I get the same result, so both osg-0.6.0 on SLF3 > (ouhep5) > and osg-0.7.0 on RHEL5 (osgitb1) can start a grid_manager_monitor > just fine. > > What else can we try to debug this? Can you run a set of GridEx > jobs by hand, > and see what that does? What OS version and osg version is the normal > GridEx running on? The job status file that the grid monitor sends back to the client machine (/tmp/job_status when you run it from the command line) should contain a line for each job currently submitted to gram under the same unix uid on the gatekeeper. Your job status file has none. Can you try submitting a long sleep job to the gatekeeper via Condor- G before running the grid monitor from the command line? Then we know that at least one job should show up in the file the grid monitor sends back. If the file still has no jobs, then we know something's wrong. +--------------------------------+-----------------------------------+ | Jaime Frey | I used to be a heavy gambler. | +--------------------------------+-----------------------------------+| jfrey@cs.wisc.edu | But now I just make mental bets. | | http://www.cs.wisc.edu/~jfrey/ | That's how I lost my mind. | |
||||||||||||
| # | Tue Aug 28 14:41:21 2007 | osg@tick-indy.globalnoc.iu.edu - Correspondence added | [Reply] | |||||||||
When replying, type your text above this line. ---------------------------------------------- This message is to let you know that Open Science Grid ticket 4004 "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) still not working" which is assigned to you, was updated on 08/28/2007 at 19:38:08 with the following information: FootPrints Ticket Description: On Aug 28, 2007, at 1:21 PM, osg@tick-indy.globalnoc.iu.edu via RT wrote: > [Duplicate message snipped] Assignees: Operations Workgroup, Arvind Gopu, OSG Support Centers, VDT Status: Support Agency Originating VO Support Center: DOSAR Destination VO Support Center: VDT Originating Ticket Number: Destination Ticket Number: Thank You, OSG Grid Operations Center goc@opensciencegrid.org, 317-278-9699 info: http://www.opensciencegrid.org rss: http://www.grid.iu.edu/news/ |
||||||||||||
| # | Tue Aug 28 14:46:21 2007 | osg@tick-indy.globalnoc.iu.edu - Correspondence added | [Reply] | |||||||||
When replying, type your text above this line. ---------------------------------------------- This message is to let you know that Open Science Grid ticket 4004 "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) still not working" which is assigned to you, was updated on 08/28/2007 at 19:44:11 with the following information: FootPrints Ticket Description: Jamie/VDT, Looks to me you wrote in something but the note got completely snipped because you likely wrote your response below the line that says: "When replying, type your text above this line.". I found your latest note on the VDT ticket and copied n pasted it below. But please write your notes *above* that line in your future correspondences that involve GOC ticket system. [I know it's a wee bit annoying but I have no control over that behavior]. Thanks! Arvind --------- Jamie's note: The job status file that the grid monitor sends back to the client machine (/tmp/job_status when you run it from the command line) should contain a line for each job currently submitted to gram under the same unix uid on the gatekeeper. Your job status file has none. Can you try submitting a long sleep job to the gatekeeper via Condor- G before running the grid monitor from the command line? Then we know that at least one job should show up in the file the grid monitor sends back. If the file still has no jobs, then we know something's wrong. +--------------------------------+-----------------------------------+ | Jaime Frey | I used to be a heavy gambler. | +--------------------------------+-----------------------------------+| jfrey@cs.wisc.edu | But now I just make mental bets. | | http://www.cs.wisc.edu/~jfrey/ | That's how I lost my mind. | Assignees: Operations Workgroup, Arvind Gopu, OSG Support Centers, VDT Status: Support Agency Originating VO Support Center: DOSAR Destination VO Support Center: VDT Originating Ticket Number: Destination Ticket Number: Thank You, OSG Grid Operations Center goc@opensciencegrid.org, 317-278-9699 info: http://www.opensciencegrid.org rss: http://www.grid.iu.edu/news/ |
||||||||||||
| # | Tue Aug 28 15:41:53 2007 | osg@tick-indy.globalnoc.iu.edu - Correspondence added | [Reply] | |||||||||
When replying, type your text above this line. ---------------------------------------------- This message is to let you know that Open Science Grid ticket 4004 "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) still not working" which is assigned to you, was updated on 08/28/2007 at 20:26:08 with the following information: FootPrints Ticket Description: Hi Jaime, okay, I have some more info. After I started a sleep job from ouhep5 on osgitb1 via Condor-G, I ran the monitor again, and this time it crashed: ----- [hs@ouhep5 hs]$ globusrun -s -r osgitb1.nhn.ou.edu/jobmanager-fork '&(executable=$(GLOBUSRUN_GASS_URL)/usr/local/condor/sbin/grid_monitor.sh)(argum ents="--dest-url="#$(GLOBUSRUN_GASS_URL)#"/tmp/job_status")' /usr/local/opt/osg-0.7.0/apache/lib:/usr/local/opt/osg-0.7.0/MonaLisa/Service/VD TFarm/pgsql/lib:/usr/local/opt/osg-0.7.0/glite/lib:/usr/local/opt/osg-0.7.0/prim a/lib:/usr/local/opt/osg-0.7.0/jdk1.5/jre/lib/i386:/usr/local/opt/osg-0.7.0/jdk1 .5/jre/lib/i386/server:/usr/local/opt/osg-0.7.0/jdk1.5/jre/lib/i386/client:/usr/ local/opt/osg-0.7.0/mysql/lib/mysql:/usr/local/opt/osg-0.7.0/globus/lib:/usr/loc al/opt/osg-0.7.0/berkeley-db/lib:/usr/local/opt/osg-0.7.0/expat/lib:/usr/local/o pt/osg-0.7.0/apache/lib:/usr/local/opt/osg-0.7.0/MonaLisa/Service/VDTFarm/pgsql/ lib:/usr/local/opt/osg-0.7.0/glite/lib:/usr/local/opt/osg-0.7.0/prima/lib:/usr/l ocal/opt/osg-0.7.0/jdk1.5/jre/lib/i386:/usr/local/opt/osg-0.7.0/jdk1.5/jre/lib/i 386/server:/usr/local/opt/osg-0.7.0/jdk1.5/jre/lib/i386/client:/usr/local/opt/os g-0.7.0/mysql/lib/mysql:/usr/local/opt/osg-0.7.0/berkeley-db/lib:/usr/local/opt/ osg-0.7.0/expat/lib: 2007-08-28 15:09:38 OK: 22007-08-28 15:09:38 INFO: Forced agent start 2007-08-28 15:09:38 INFO: Starting grid_manager_monitor_agent Unquoted string "break" may clash with future reserved word at /usr/local/opt/osg-0.7.0/globus/lib/perl/Globus/GRAM/JobManager/condor.pm line 55. Useless use of a constant in void context at /usr/local/opt/osg-0.7.0/globus/lib/perl/Globus/GRAM/JobManager/condor.pm line 55. // should probably be written as "" at /usr/local/opt/osg-0.7.0/globus/lib/perl/Globus/GRAM/JobManager/condor.pm line 393. Can't locate object method "new" via package "Globus::GRAM::JobManager::condor" at /usr/local/opt/osg-0.7.0/globus/lib/perl/Globus/GRAM/JobManager/condor.pm line 29. 2007-08-28 15:09:38 INFO: Started grid_manager_monitor_agent as /tmp/grid_manager_monitor_agent.usatlas1.5908.1000, pid 5930 2007-08-28 15:09:39 ERROR: 8: grid_manager_monitor_agent (pid 5930) exited with a 255 result (65280). ----- Then I started it again, and then it ran for a while, and produced some output: ----- [hs@ouhep5 hs]$ cat /tmp/job_status 1188332060 1188332060 https://osgitb1.nhn.ou.edu:63015/7990/1188332042/ 32 GRIDMONEOF ----- But then it crashed again with the same error. And when I submitted the monitor from osgitb1, it additionally gave me this error: ----- Can't locate object method "new" via package "Globus::GRAM::JobManager::condor" at /usr/local/opt/osg-0.7.0/globus/lib/perl/Globus/GRAM/JobManager/condor.pm line 29. ----- But the /tmp/job_status on osgitb1 also looked the same as on ouhep5, so it picked up the job, too. Does that tell you anything? Thanks a lot, Horst Assignees: Operations Workgroup, Arvind Gopu, OSG Support Centers, VDT Status: Support Agency Originating VO Support Center: DOSAR Destination VO Support Center: VDT Originating Ticket Number: Destination Ticket Number: Thank You, OSG Grid Operations Center goc@opensciencegrid.org, 317-278-9699 info: http://www.opensciencegrid.org rss: http://www.grid.iu.edu/news/ |
||||||||||||
| # | Tue Aug 28 16:31:46 2007 | jfrey@cs.wisc.edu - Correspondence added | [Reply] | |||||||||
If the grid monitor is regularly crashing with errors like these, that would explain the behavior that was reported. When the grid monitor fails, the Condor gridmnager will restart up to 10 jobmanagers, which will increase the load on the CE. The error to investigate is this one: Can't locate object method "new" via package "Globus::GRAM::JobManager::condor" at /usr/local/opt/osg-0.7.0/globus/lib/perl/Globus/GRAM/JobManager/ condor.pm line 29. This may be related to a problem I saw earlier this month. The grid monitor was failing at LTU because it was using the system-installed perl and the standard perl library path, which was missing a critical module. Globus was using an OSG-installed perl with its own library path, which had the module. -- Jaime On Aug 28, 2007, at 3:41 PM, osg@tick-indy.globalnoc.iu.edu via RT wrote: > When replying, type your text above this line. > ---------------------------------------------- > This message is to let you know that Open Science Grid ticket 4004 > "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) > still not working" which is assigned to you, was updated on > 08/28/2007 at 20:26:08 with the following information: > > FootPrints Ticket Description: > Hi Jaime, > > okay, I have some more info. > > After I started a sleep job from ouhep5 on osgitb1 via Condor-G, > I ran the monitor again, and this time it crashed: > > ----- > [hs@ouhep5 hs]$ globusrun -s -r osgitb1.nhn.ou.edu/jobmanager-fork > '&(executable=$(GLOBUSRUN_GASS_URL)/usr/local/condor/sbin/ > grid_monitor.sh)(argum > ents="--dest-url="#$(GLOBUSRUN_GASS_URL)#"/tmp/job_status")' > /usr/local/opt/osg-0.7.0/apache/lib:/usr/local/opt/osg-0.7.0/ > MonaLisa/Service/VD > TFarm/pgsql/lib:/usr/local/opt/osg-0.7.0/glite/lib:/usr/local/opt/ > osg-0.7.0/prim > a/lib:/usr/local/opt/osg-0.7.0/jdk1.5/jre/lib/i386:/usr/local/opt/ > osg-0.7.0/jdk1 > .5/jre/lib/i386/server:/usr/local/opt/osg-0.7.0/jdk1.5/jre/lib/i386/ > client:/usr/ > local/opt/osg-0.7.0/mysql/lib/mysql:/usr/local/opt/osg-0.7.0/globus/ > lib:/usr/loc > al/opt/osg-0.7.0/berkeley-db/lib:/usr/local/opt/osg-0.7.0/expat/ > lib:/usr/local/o > pt/osg-0.7.0/apache/lib:/usr/local/opt/osg-0.7.0/MonaLisa/Service/ > VDTFarm/pgsql/ > lib:/usr/local/opt/osg-0.7.0/glite/lib:/usr/local/opt/osg-0.7.0/ > prima/lib:/usr/l > ocal/opt/osg-0.7.0/jdk1.5/jre/lib/i386:/usr/local/opt/osg-0.7.0/ > jdk1.5/jre/lib/i > 386/server:/usr/local/opt/osg-0.7.0/jdk1.5/jre/lib/i386/client:/usr/ > local/opt/os > g-0.7.0/mysql/lib/mysql:/usr/local/opt/osg-0.7.0/berkeley-db/lib:/ > usr/local/opt/ > osg-0.7.0/expat/lib: > 2007-08-28 15:09:38 OK: > 22007-08-28 15:09:38 INFO: Forced agent start > 2007-08-28 15:09:38 INFO: Starting grid_manager_monitor_agent > Unquoted string "break" may clash with future reserved word at > /usr/local/opt/osg-0.7.0/globus/lib/perl/Globus/GRAM/JobManager/ > condor.pm line > 55. > Useless use of a constant in void context at > /usr/local/opt/osg-0.7.0/globus/lib/perl/Globus/GRAM/JobManager/ > condor.pm line > 55. > // should probably be written as "" at > /usr/local/opt/osg-0.7.0/globus/lib/perl/Globus/GRAM/JobManager/ > condor.pm line > 393. > Can't locate object method "new" via package > "Globus::GRAM::JobManager::condor" > at /usr/local/opt/osg-0.7.0/globus/lib/perl/Globus/GRAM/JobManager/ > condor.pm > line 29. > 2007-08-28 15:09:38 INFO: Started grid_manager_monitor_agent as > /tmp/grid_manager_monitor_agent.usatlas1.5908.1000, pid 5930 > 2007-08-28 15:09:39 ERROR: 8: grid_manager_monitor_agent (pid 5930) > exited with > a 255 result (65280). > ----- > > Then I started it again, and then it ran for a while, and produced > some output: > > ----- > [hs@ouhep5 hs]$ cat /tmp/job_status > 1188332060 1188332060 > https://osgitb1.nhn.ou.edu:63015/7990/1188332042/ 32 > GRIDMONEOF > ----- > > But then it crashed again with the same error. > > And when I submitted the monitor from osgitb1, it additionally > gave me this error: > > ----- > Can't locate object method "new" via package > "Globus::GRAM::JobManager::condor" > at /usr/local/opt/osg-0.7.0/globus/lib/perl/Globus/GRAM/JobManager/ > condor.pm > line 29. > ----- > > But the /tmp/job_status on osgitb1 also looked the same as on ouhep5, > so it picked up the job, too. > > Does that tell you anything? > > Thanks a lot, > > Horst > > Assignees: Operations Workgroup, Arvind Gopu, OSG Support Centers, VDT > Status: Support Agency > Originating VO Support Center: DOSAR > Destination VO Support Center: VDT > Originating Ticket Number: > Destination Ticket Number: > > Thank You, > OSG Grid Operations Center > goc@opensciencegrid.org, 317-278-9699 > info: http://www.opensciencegrid.org > rss: http://www.grid.iu.edu/news/ > > > -- > View ticket at <http://vdt.cs.wisc.edu/rt/Ticket/Display.html? > user=guest&pass=guest&id=2922> > VDT Support, vdt-support@ivdgl.org +--------------------------------+-----------------------------------+ | Jaime Frey | I used to be a heavy gambler. | +--------------------------------+-----------------------------------+| jfrey@cs.wisc.edu | But now I just make mental bets. | | http://www.cs.wisc.edu/~jfrey/ | That's how I lost my mind. | |
||||||||||||
| # | Tue Aug 28 16:41:40 2007 | osg@tick-indy.globalnoc.iu.edu - Correspondence added | [Reply] | |||||||||
When replying, type your text above this line. ---------------------------------------------- This message is to let you know that Open Science Grid ticket 4004 "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) still not working" which is assigned to you, was updated on 08/28/2007 at 21:32:14 with the following information: FootPrints Ticket Description: If the grid monitor is regularly crashing with errors like these, that would explain the behavior that was reported. When the grid monitor fails, the Condor gridmnager will restart up to 10 jobmanagers, which will increase the load on the CE. The error to investigate is this one: Can't locate object method "new" via package "Globus::GRAM::JobManager::condor" at /usr/local/opt/osg-0.7.0/globus/lib/perl/Globus/GRAM/JobManager/ condor.pm line 29. This may be related to a problem I saw earlier this month. The grid monitor was failing at LTU because it was using the system-installed perl and the standard perl library path, which was missing a critical module. Globus was using an OSG-installed perl with its own library path, which had the module. -- Jaime On Aug 28, 2007, at 3:41 PM, osg@tick-indy.globalnoc.iu.edu via RT wrote: > [Duplicate message snipped] Assignees: Operations Workgroup, Arvind Gopu, OSG Support Centers, VDT Status: Support Agency Originating VO Support Center: DOSAR Destination VO Support Center: VDT Originating Ticket Number: Destination Ticket Number: Thank You, OSG Grid Operations Center goc@opensciencegrid.org, 317-278-9699 info: http://www.opensciencegrid.org rss: http://www.grid.iu.edu/news/ |
||||||||||||
| # | Tue Aug 28 16:56:25 2007 | osg@tick-indy.globalnoc.iu.edu - Correspondence added | [Reply] | |||||||||
When replying, type your text above this line. ---------------------------------------------- This message is to let you know that Open Science Grid ticket 4004 "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) still not working" which is assigned to you, was updated on 08/28/2007 at 21:50:09 with the following information: FootPrints Ticket Description: Hi Jaime, > The error to investigate is this one: > Can't locate object method "new" via package > "Globus::GRAM::JobManager::condor" > at /usr/local/opt/osg-0.7.0/globus/lib/perl/Globus/GRAM/JobManager/ > condor.pm > line 29. so how do we investigate this error? Anything we can do to help? This error only happenend when I submitted the globusrun grid manager from osgitb1. Do you think the error when submitting from ouhep5 was the same, but wasn't printed because of the older osg version (0.6 vs. 0.7)? Thanks a lot, Horst Assignees: Operations Workgroup, Arvind Gopu, OSG Support Centers, VDT Status: Support Agency Originating VO Support Center: DOSAR Destination VO Support Center: VDT Originating Ticket Number: Destination Ticket Number: Thank You, OSG Grid Operations Center goc@opensciencegrid.org, 317-278-9699 info: http://www.opensciencegrid.org rss: http://www.grid.iu.edu/news/ |
||||||||||||
| # | Tue Aug 28 19:01:26 2007 | jfrey@cs.wisc.edu - Correspondence added | [Reply] | |||||||||
Your previous email said the error was printed in both cases (grid monitor job submitted from osgitb1 and ouhep5). -- Jaime On Aug 28, 2007, at 4:56 PM, osg@tick-indy.globalnoc.iu.edu via RT wrote: > When replying, type your text above this line. > ---------------------------------------------- > This message is to let you know that Open Science Grid ticket 4004 > "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) > still not working" which is assigned to you, was updated on > 08/28/2007 at 21:50:09 with the following information: > > FootPrints Ticket Description: > Hi Jaime, > >> The error to investigate is this one: >>> Can't locate object method "new" via package >> "Globus::GRAM::JobManager::condor" >> at /usr/local/opt/osg-0.7.0/globus/lib/perl/Globus/GRAM/JobManager/ >> condor.pm >> line 29. > so how do we investigate this error? Anything we can do to help? > > This error only happenend when I submitted the globusrun grid manager > from osgitb1. Do you think the error when submitting from ouhep5 > was the same, but wasn't printed because of the older osg version > (0.6 vs. 0.7)? > > Thanks a lot, > > Horst > > Assignees: Operations Workgroup, Arvind Gopu, OSG Support Centers, VDT > Status: Support Agency > Originating VO Support Center: DOSAR > Destination VO Support Center: VDT > Originating Ticket Number: > Destination Ticket Number: > > Thank You, > OSG Grid Operations Center > goc@opensciencegrid.org, 317-278-9699 > info: http://www.opensciencegrid.org > rss: http://www.grid.iu.edu/news/ > > > -- > View ticket at <http://vdt.cs.wisc.edu/rt/Ticket/Display.html? > user=guest&pass=guest&id=2922> > VDT Support, vdt-support@ivdgl.org +--------------------------------+-----------------------------------+ | Jaime Frey | I used to be a heavy gambler. | +--------------------------------+-----------------------------------+| jfrey@cs.wisc.edu | But now I just make mental bets. | | http://www.cs.wisc.edu/~jfrey/ | That's how I lost my mind. | |
||||||||||||
| # | Tue Aug 28 19:11:54 2007 | osg@tick-indy.globalnoc.iu.edu - Correspondence added | [Reply] | |||||||||
When replying, type your text above this line. ---------------------------------------------- This message is to let you know that Open Science Grid ticket 4004 "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) still not working" which is assigned to you, was updated on 08/29/2007 at 00:02:07 with the following information: FootPrints Ticket Description: Your previous email said the error was printed in both cases (grid monitor job submitted from osgitb1 and ouhep5). -- Jaime On Aug 28, 2007, at 4:56 PM, osg@tick-indy.globalnoc.iu.edu via RT wrote: > [Duplicate message snipped] Assignees: Operations Workgroup, Arvind Gopu, OSG Support Centers, VDT Status: Support Agency Originating VO Support Center: DOSAR Destination VO Support Center: VDT Originating Ticket Number: Destination Ticket Number: Thank You, OSG Grid Operations Center goc@opensciencegrid.org, 317-278-9699 info: http://www.opensciencegrid.org rss: http://www.grid.iu.edu/news/ |
||||||||||||
| # | Tue Aug 28 19:51:25 2007 | osg@tick-indy.globalnoc.iu.edu - Correspondence added | [Reply] | |||||||||
When replying, type your text above this line. ---------------------------------------------- This message is to let you know that Open Science Grid ticket 4004 "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) still not working" which is assigned to you, was updated on 08/29/2007 at 00:47:07 with the following information: FootPrints Ticket Description: Hi Jaime, > Your previous email said the error was printed in both cases (grid > monitor job submitted from osgitb1 and ouhep5). no it didn't. =) This is what it said: ========== After I started a sleep job from ouhep5 on osgitb1 via Condor-G,I ran the monitor again, and this time it crashed: ----- [hs@ouhep5 hs]$ globusrun -s -r osgitb1.nhn.ou.edu/jobmanager-fork ... [...] 2007-08-28 15:09:39 ERROR: 8: grid_manager_monitor_agent (pid 5930) exited with a 255 result (65280). ----- Then I started it again, and then it ran for a while, and produced some output: ----- [hs@ouhep5 hs]$ cat /tmp/job_status 1188332060 1188332060 https://osgitb1.nhn.ou.edu:63015/7990/1188332042/ 32 GRIDMONEOF ----- But then it crashed again with the same error. And when I submitted the monitor from osgitb1, it additionally gave me this error: ----- Can't locate object method "new" via package "Globus::GRAM::JobManager::condor" at /usr/local/opt/osg-0.7.0/globus/lib/perl/Globus/GRAM/JobManager/condor.pm line 29. ----- ========== Note the 'additionally'. :) So the "exited with a 255 result (65280)" error happened when I submitted the grid manager from both machines, but the "Can't locate object method" error only happened when I submitted the grid manager from osgitb1, i.e., the new osg-0.7.0 version. Unfortunately I don't completely understand the command I ran, so I'm not sure which parts of it were run on the client vs. the gatekeeper, so I'm not sure how else to debug this. Any hints greatly appreciated. :^) Thanks a lot, Horst Assignees: Operations Workgroup, Arvind Gopu, OSG Support Centers, VDT Status: Support Agency Originating VO Support Center: DOSAR Destination VO Support Center: VDT Originating Ticket Number: Destination Ticket Number: Thank You, OSG Grid Operations Center goc@opensciencegrid.org, 317-278-9699 info: http://www.opensciencegrid.org rss: http://www.grid.iu.edu/news/ |
||||||||||||
| # | Tue Aug 28 20:41:15 2007 | jfrey@cs.wisc.edu - Correspondence added | [Reply] | |||||||||
Look two lines above the 'ERROR: 8: grid_manager_monitor_agent (pid 5930) exited with a 255 result (65280).' in your previous email. You'll see the same 'additional' error message. So it appears to happen if you submit the grid monitor job from either machine (ouhep5 or osgitb1). -- Jaime On Aug 28, 2007, at 7:51 PM, osg@tick-indy.globalnoc.iu.edu via RT wrote: > When replying, type your text above this line. > ---------------------------------------------- > This message is to let you know that Open Science Grid ticket 4004 > "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) > still not working" which is assigned to you, was updated on > 08/29/2007 at 00:47:07 with the following information: > > FootPrints Ticket Description: > Hi Jaime, > >> Your previous email said the error was printed in both cases (grid >>> monitor job submitted from osgitb1 and ouhep5). > no it didn't. =) > > This is what it said: > > ========== > After I started a sleep job from ouhep5 on osgitb1 via Condor-G,> I ran the monitor again, and this time it crashed: > > ----- > [hs@ouhep5 hs]$ globusrun -s -r osgitb1.nhn.ou.edu/jobmanager-fork ... > > [...] > > 2007-08-28 15:09:39 ERROR: 8: grid_manager_monitor_agent (pid 5930) > exited with > a 255 result (65280). > ----- > > Then I started it again, and then it ran for a while, and produced > some output: > > ----- > [hs@ouhep5 hs]$ cat /tmp/job_status > 1188332060 1188332060 > https://osgitb1.nhn.ou.edu:63015/7990/1188332042/ 32 > GRIDMONEOF > ----- > > But then it crashed again with the same error. > > And when I submitted the monitor from osgitb1, it additionally > gave me this error: > > ----- > Can't locate object method "new" via package > "Globus::GRAM::JobManager::condor" > at /usr/local/opt/osg-0.7.0/globus/lib/perl/Globus/GRAM/JobManager/ > condor.pm > line 29. > ----- > ========== >> Note the 'additionally'. :) > > So the "exited with a 255 result (65280)" error happened when I > submitted > the grid manager from both machines, but the "Can't locate object > method" > error only happened when I submitted the grid manager from osgitb1, > i.e., the new osg-0.7.0 version. > > Unfortunately I don't completely understand the command I ran, > so I'm not sure which parts of it were run on the client vs. the > gatekeeper, > so I'm not sure how else to debug this. Any hints greatly > appreciated. :^) > > Thanks a lot, > > Horst > > Assignees: Operations Workgroup, Arvind Gopu, OSG Support Centers, VDT > Status: Support Agency > Originating VO Support Center: DOSAR > Destination VO Support Center: VDT > Originating Ticket Number: > Destination Ticket Number: > > Thank You, > OSG Grid Operations Center > goc@opensciencegrid.org, 317-278-9699 > info: http://www.opensciencegrid.org > rss: http://www.grid.iu.edu/news/ > > > -- > View ticket at <http://vdt.cs.wisc.edu/rt/Ticket/Display.html? > user=guest&pass=guest&id=2922> > VDT Support, vdt-support@ivdgl.org +--------------------------------+-----------------------------------+ | Jaime Frey | I used to be a heavy gambler. | +--------------------------------+-----------------------------------+| jfrey@cs.wisc.edu | But now I just make mental bets. | | http://www.cs.wisc.edu/~jfrey/ | That's how I lost my mind. | |
||||||||||||
| # | Tue Aug 28 20:56:21 2007 | osg@tick-indy.globalnoc.iu.edu - Correspondence added | [Reply] | |||||||||
When replying, type your text above this line. ---------------------------------------------- This message is to let you know that Open Science Grid ticket 4004 "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) still not working" which is assigned to you, was updated on 08/29/2007 at 01:44:07 with the following information: FootPrints Ticket Description: Look two lines above the 'ERROR: 8: grid_manager_monitor_agent (pid 5930) exited with a 255 result (65280).' in your previous email. You'll see the same 'additional' error message. So it appears to happen if you submit the grid monitor job from either machine (ouhep5 or osgitb1). -- Jaime On Aug 28, 2007, at 7:51 PM, osg@tick-indy.globalnoc.iu.edu via RT wrote: > [Duplicate message snipped] Assignees: Operations Workgroup, Arvind Gopu, OSG Support Centers, VDT Status: Support Agency Originating VO Support Center: DOSAR Destination VO Support Center: VDT Originating Ticket Number: Destination Ticket Number: Thank You, OSG Grid Operations Center goc@opensciencegrid.org, 317-278-9699 info: http://www.opensciencegrid.org rss: http://www.grid.iu.edu/news/ |
||||||||||||
| # | Tue Aug 28 21:21:27 2007 | osg@tick-indy.globalnoc.iu.edu - Correspondence added | [Reply] | |||||||||
When replying, type your text above this line. ---------------------------------------------- This message is to let you know that Open Science Grid ticket 4004 "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) still not working" which is assigned to you, was updated on 08/29/2007 at 01:59:07 with the following information: FootPrints Ticket Description: Hi Jaime, okay, so I'm blind. =) So I guess the two errors were just swapped? Presumably because of intermixing of stdout and stderr? So now that you showed me the light, what else can I do to help track this down and fix it, so that we can run GridEx jobs again? Thanks a lot, Horst Assignees: Operations Workgroup, Arvind Gopu, OSG Support Centers, VDT Status: Support Agency Originating VO Support Center: DOSAR Destination VO Support Center: VDT Originating Ticket Number: Destination Ticket Number: Thank You, OSG Grid Operations Center goc@opensciencegrid.org, 317-278-9699 info: http://www.opensciencegrid.org rss: http://www.grid.iu.edu/news/ |
||||||||||||
| # | Tue Aug 28 22:46:15 2007 | jfrey@cs.wisc.edu - Correspondence added | [Reply] | |||||||||
Would it be possible for me to get a login on the CE? I think that'd be the most efficient way to proceed. -- Jaime On Aug 28, 2007, at 9:21 PM, osg@tick-indy.globalnoc.iu.edu via RT wrote: > When replying, type your text above this line. > ---------------------------------------------- > This message is to let you know that Open Science Grid ticket 4004 > "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) > still not working" which is assigned to you, was updated on > 08/29/2007 at 01:59:07 with the following information: > > FootPrints Ticket Description: > Hi Jaime, > > okay, so I'm blind. =) > > So I guess the two errors were just swapped? Presumably because of > intermixing of stdout and stderr? > > So now that you showed me the light, what else can I do to help > track this down and fix it, so that we can run GridEx jobs again? > > Thanks a lot, > > Horst > > Assignees: Operations Workgroup, Arvind Gopu, OSG Support Centers, VDT > Status: Support Agency > Originating VO Support Center: DOSAR > Destination VO Support Center: VDT > Originating Ticket Number: > Destination Ticket Number: > > Thank You, > OSG Grid Operations Center > goc@opensciencegrid.org, 317-278-9699 > info: http://www.opensciencegrid.org > rss: http://www.grid.iu.edu/news/ > > > -- > View ticket at <http://vdt.cs.wisc.edu/rt/Ticket/Display.html? > user=guest&pass=guest&id=2922> > VDT Support, vdt-support@ivdgl.org +--------------------------------+-----------------------------------+ | Jaime Frey | I used to be a heavy gambler. | +--------------------------------+-----------------------------------+| jfrey@cs.wisc.edu | But now I just make mental bets. | | http://www.cs.wisc.edu/~jfrey/ | That's how I lost my mind. | |
||||||||||||
| # | Tue Aug 28 23:06:55 2007 | osg@tick-indy.globalnoc.iu.edu - Correspondence added | [Reply] | |||||||||
When replying, type your text above this line. ---------------------------------------------- This message is to let you know that Open Science Grid ticket 4004 "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) still not working" which is assigned to you, was updated on 08/29/2007 at 03:47:08 with the following information: FootPrints Ticket Description: Would it be possible for me to get a login on the CE? I think that'd be the most efficient way to proceed. -- Jaime On Aug 28, 2007, at 9:21 PM, osg@tick-indy.globalnoc.iu.edu via RT wrote: > [Duplicate message snipped] Assignees: Operations Workgroup, Arvind Gopu, OSG Support Centers, VDT Status: Support Agency Originating VO Support Center: DOSAR Destination VO Support Center: VDT Originating Ticket Number: Destination Ticket Number: Thank You, OSG Grid Operations Center goc@opensciencegrid.org, 317-278-9699 info: http://www.opensciencegrid.org rss: http://www.grid.iu.edu/news/ |
||||||||||||
| # | Wed Aug 29 11:21:51 2007 | osg@tick-indy.globalnoc.iu.edu - Correspondence added | [Reply] | |||||||||
When replying, type your text above this line. ---------------------------------------------- This message is to let you know that Open Science Grid ticket 4004 "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) still not working" which is assigned to you, was updated on 08/29/2007 at 15:37:20 with the following information: FootPrints Ticket Description: Quick update: The latest issue Horst has reported about unexpected interaction between his existing Condor and RSV's condor-devel has been taken offline, on a separate thread. -Arvind Assignees: Operations Workgroup, Arvind Gopu, OSG Support Centers, VDT Status: Support Agency Originating VO Support Center: DOSAR Destination VO Support Center: VDT Originating Ticket Number: Destination Ticket Number: Thank You, OSG Grid Operations Center goc@opensciencegrid.org, 317-278-9699 info: http://www.opensciencegrid.org rss: http://www.grid.iu.edu/news/ |
||||||||||||
| # | Wed Aug 29 11:22:09 2007 | osg@tick-indy.globalnoc.iu.edu - Correspondence added | [Reply] | |||||||||
When replying, type your text above this line. ---------------------------------------------- This message is to let you know that Open Science Grid ticket 4004 "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) still not working" which is assigned to you, was updated on 08/29/2007 at 15:23:08 with the following information: FootPrints Ticket Description: Hi Jaime, > Would it be possible for me to get a login on the CE? I think that'd > be the most efficient way to proceed. sure, if you give me your rsa public key, I can add that to the usatlas1 user on osgitb1.nhn.ou.edu, and then you can play around locally. By the way, I think there's still something fishy with Condor-Devel/OSG-RSV setup on osg-0.7.0, even after the last update. Occasionally, just out of the blue, our regular Condor installation (6.8.4) loses track of the real jobs with 'condor_q -g', and picks up the RSV jobs instead. These next two commands were issued back to back, with no delay in between, nor anything else: ===== [hs@osgitb1 ~]$ condor_q -g-- Schedd: osgitb1.nhn.ou.edu : <129.15.31.41:63325> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 3748.0 usatlas1 8/29 09:38 0+00:00:00 I 0 9.8 env 1 jobs; 1 idle, 0 running, 0 held [hs@osgitb1 ~]$ condor_q -g -- Schedd: osgitb1.nhn.ou.edu : <129.15.31.41:57699> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 56.0 mis 8/28 18:24 0+08:03:03 R 0 9.8 probe_wrapper.pl / 57.0 mis 8/28 18:24 0+00:23:43 I 0 9.8 probe_wrapper.pl / 58.0 mis 8/28 18:24 0+00:23:46 I 0 9.8 probe_wrapper.pl / 59.0 mis 8/28 18:24 0+00:23:47 I 0 9.8 probe_wrapper.pl / 60.0 mis 8/28 18:24 0+00:23:48 I 0 9.8 probe_wrapper.pl / 61.0 mis 8/28 18:24 0+00:23:49 I 0 9.8 probe_wrapper.pl / 62.0 mis 8/28 18:24 0+00:23:49 I 0 9.8 probe_wrapper.pl / 63.0 mis 8/28 18:24 0+00:23:45 I 0 9.8 probe_wrapper.pl / 64.0 mis 8/28 18:24 0+12:04:29 R 0 9.8 probe_wrapper.pl / 65.0 mis 8/28 18:24 0+00:23:45 I 0 9.8 probe_wrapper.pl / 66.0 mis 8/28 18:24 0+00:23:47 I 0 9.8 probe_wrapper.pl / 67.0 mis 8/28 18:24 0+00:23:48 I 0 9.8 probe_wrapper.pl / 68.0 mis 8/28 18:24 0+00:23:51 I 0 9.8 probe_wrapper.pl / 69.0 mis 8/28 18:24 0+00:39:45 I 0 9.8 gratia-script-cons 70.0 mis 8/28 18:24 0+00:40:15 I 0 9.8 sample-consumer /u 15 jobs; 13 idle, 2 running, 0 held ===== I didn't change CONDOR_LOCATION or CONDOR_CONFIG or anything, so it should've never picked up the RSV jobs like this. It just suddenly started talking to the condor-devel daemon? Then later I tried it again, and it still picked up the wrong ones: ===== [hs@osgitb1 ~]$ condor_q -g-- Schedd: osgitb1.nhn.ou.edu : <129.15.31.41:57699> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 56.0 mis 8/28 18:24 0+08:14:43 R 0 9.8 probe_wrapper.pl / 57.0 mis 8/28 18:24 0+00:23:43 I 0 9.8 probe_wrapper.pl / 58.0 mis 8/28 18:24 0+00:23:46 I 0 9.8 probe_wrapper.pl / 59.0 mis 8/28 18:24 0+00:23:47 I 0 9.8 probe_wrapper.pl / 60.0 mis 8/28 18:24 0+00:23:48 I 0 9.8 probe_wrapper.pl / 61.0 mis 8/28 18:24 0+00:23:49 I 0 9.8 probe_wrapper.pl / 62.0 mis 8/28 18:24 0+00:23:49 I 0 9.8 probe_wrapper.pl / 63.0 mis 8/28 18:24 0+00:23:45 I 0 9.8 probe_wrapper.pl / 64.0 mis 8/28 18:24 0+12:19:03 R 0 9.8 probe_wrapper.pl / 65.0 mis 8/28 18:24 0+00:23:45 I 0 9.8 probe_wrapper.pl / 66.0 mis 8/28 18:24 0+00:23:47 I 0 9.8 probe_wrapper.pl / 67.0 mis 8/28 18:24 0+00:23:48 I 0 9.8 probe_wrapper.pl / 68.0 mis 8/28 18:24 0+00:23:51 I 0 9.8 probe_wrapper.pl / 69.0 mis 8/28 18:24 0+00:40:55 I 0 9.8 gratia-script-cons 70.0 mis 8/28 18:24 0+00:41:32 R 0 9.8 sample-consumer /u 15 jobs; 12 idle, 3 running, 0 held ===== But then I submitted another test job to osgitb1/jobmanager-condor, and then all of a sudden it picked up the right one again: ===== [hs@osgitb1 ~]$ condor_q -g-- Schedd: osgitb1.nhn.ou.edu : <129.15.31.41:63325> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 3749.0 usatlas1 8/29 09:57 0+00:00:00 I 0 9.8 env 1 jobs; 1 idle, 0 running, 0 held ===== How can that be? I'm not sure if this has anything to do with our grid manager problem, but it certainly shouldn't happen, right? How do I know which condor daemon condor_q or condor_status talk to? I have no *CONDOR* env vars set, and it picks up the binaries in /usr/local/bin/, which are soft links to /usr/local/condor/bin/, which are the regular 6.8.4 installation, so there should be no way to suddenly talk to the condor-devel daemons, right? Thanks a lot, Horst Assignees: Operations Workgroup, Arvind Gopu, OSG Support Centers, VDT Status: Support Agency Originating VO Support Center: DOSAR Destination VO Support Center: VDT Originating Ticket Number: Destination Ticket Number: Thank You, OSG Grid Operations Center goc@opensciencegrid.org, 317-278-9699 info: http://www.opensciencegrid.org rss: http://www.grid.iu.edu/news/ |
||||||||||||
| # | Wed Aug 29 11:31:51 2007 | jfrey@cs.wisc.edu - Correspondence added | [Reply] | |||||||||
Here is my public key: ssh-dss AAAAB3NzaC1kc3MAAAEBAKFQpOeVNX/16RfGAALXQ+pwosdlcMzBUtY0Dn6 +YgVgJXq/9mfCdGXuj5OzK2wjO5l1O71drcOjtYu9CvD0rrtzKp5B5xWZU/ pd4f2d11waSIgj0trEGHAMG+VQ69wjBwjR81YPOkI2HcEqbEGGWFS69iIu3yt/X/ 09wxwdOFpEmUKnjxCLD2PS/VlXydgLjdXq6+nUUz/ RFHv2Jbtbff8nSGW6SFdP424YwFazClMYhG8kKAtfSm0uL6bhzFs1ysOhRqHIYmu7w595brI BHsqdeZXwPlwOc4roLH9W99q7Zzozt9v+OTwNs9RWBa5/qnzZOA1mqms5GQySoqM/ +HsAAAAVAMhR5pJ+m9v/ O7RYqbbe0v2fqS2BAAABABIspAFuOztfIXFh6o2C0vwbVNo10rbTC7bcvzAHu5C/ SoemSqfiKSG9UdTWqM6u8Hw8k1StVK1GGcoh +wfUksT1r6PCykTC6uO5FqUIYWEVT8ILf0e/ +DjcuVSUw4jpGhs3hu28onqdKlZHqrnOc4q7ZjZ8+j8aGXnm/ xosrtWz7vhJV15TtKLdpc3hDcaBgdK95JYmBPDhrLRKExRHoOh0Emg07wzfxpr/ ECzXFiKf6DgO7LkeswgknXTrPahRbN2GUNmJKDWq8jVhvRASNendHaNmwjGcZnxBvmpuzuDG /YHcz6BCCqGZlWQakk3NiDnGX3je0mdWkeM0tzK1EYIAAAEAZUKgsTxjr +hrDwiPPQ5NzTO+3/ IgQYlQs3a6x2GsteSI8PDHbT6TKUPXu0wlgIdDhRzezJOIOPStU8geewAdFIzh1aI7E96L3R bvH+xNWk6Q7kGhcCWcqk7IDjL59YLn/JIQnq/5FQGgWgeUzP83jnhIJ/ SqwAnPPWBu0fLZ1UIOXcHDvAQcqondSKB9bEkz0tM44Be/ q3R8KkyZi1DOX4TBXtodoCFenLQGkaA/ NIJbyCjajzYhjuaMC40CHf4W1pagPhzxtT0uDWiRcUrG43EftAVFy27mnq7IrrbXYU +uGfBlPSOAWL0XQlyNE93aAxj8lgnHa3v0e5qyDjZvTw== jfrey@nostos.cs.wisc.edu Could you also add my X509 DN to the grid-mapfile for user usatlas1: /DC=org/DC=doegrids/OU=People/CN=James Frey 259919 -- Jaime On Aug 29, 2007, at 11:22 AM, osg@tick-indy.globalnoc.iu.edu via RT wrote: > When replying, type your text above this line. > ---------------------------------------------- > This message is to let you know that Open Science Grid ticket 4004 > "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) > still not working" which is assigned to you, was updated on > 08/29/2007 at 15:23:08 with the following information: > > FootPrints Ticket Description: > Hi Jaime, > >> Would it be possible for me to get a login on the CE? I think that'd >>> be the most efficient way to proceed. > sure, if you give me your rsa public key, I can add that to the > usatlas1 user on osgitb1.nhn.ou.edu, and then you can play around > locally. > > By the way, I think there's still something fishy with Condor-Devel/ > OSG-RSV > setup on osg-0.7.0, even after the last update. Occasionally, just > out of > the blue, our regular Condor installation (6.8.4) loses track of the > real jobs with 'condor_q -g', and picks up the RSV jobs instead. > > These next two commands were issued back to back, with no delay in > between, > nor anything else: > > ===== > [hs@osgitb1 ~]$ condor_q -g> > -- Schedd: osgitb1.nhn.ou.edu : <129.15.31.41:63325> > ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD > 3748.0 usatlas1 8/29 09:38 0+00:00:00 I 0 9.8 env > > 1 jobs; 1 idle, 0 running, 0 held > [hs@osgitb1 ~]$ condor_q -g > > -- Schedd: osgitb1.nhn.ou.edu : <129.15.31.41:57699> > ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD > 56.0 mis 8/28 18:24 0+08:03:03 R 0 9.8 > probe_wrapper.pl / > 57.0 mis 8/28 18:24 0+00:23:43 I 0 9.8 > probe_wrapper.pl / > 58.0 mis 8/28 18:24 0+00:23:46 I 0 9.8 > probe_wrapper.pl / > 59.0 mis 8/28 18:24 0+00:23:47 I 0 9.8 > probe_wrapper.pl / > 60.0 mis 8/28 18:24 0+00:23:48 I 0 9.8 > probe_wrapper.pl / > 61.0 mis 8/28 18:24 0+00:23:49 I 0 9.8 > probe_wrapper.pl / > 62.0 mis 8/28 18:24 0+00:23:49 I 0 9.8 > probe_wrapper.pl / > 63.0 mis 8/28 18:24 0+00:23:45 I 0 9.8 > probe_wrapper.pl / > 64.0 mis 8/28 18:24 0+12:04:29 R 0 9.8 > probe_wrapper.pl / > 65.0 mis 8/28 18:24 0+00:23:45 I 0 9.8 > probe_wrapper.pl / > 66.0 mis 8/28 18:24 0+00:23:47 I 0 9.8 > probe_wrapper.pl / > 67.0 mis 8/28 18:24 0+00:23:48 I 0 9.8 > probe_wrapper.pl / > 68.0 mis 8/28 18:24 0+00:23:51 I 0 9.8 > probe_wrapper.pl / > 69.0 mis 8/28 18:24 0+00:39:45 I 0 9.8 gratia- > script-cons > 70.0 mis 8/28 18:24 0+00:40:15 I 0 9.8 sample- > consumer /u > > 15 jobs; 13 idle, 2 running, 0 held > ===== >> I didn't change CONDOR_LOCATION or CONDOR_CONFIG or anything, > so it should've never picked up the RSV jobs like this. > It just suddenly started talking to the condor-devel daemon? > > Then later I tried it again, and it still picked up the wrong ones: > > ===== > [hs@osgitb1 ~]$ condor_q -g> > -- Schedd: osgitb1.nhn.ou.edu : <129.15.31.41:57699> > ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD > 56.0 mis 8/28 18:24 0+08:14:43 R 0 9.8 > probe_wrapper.pl / > 57.0 mis 8/28 18:24 0+00:23:43 I 0 9.8 > probe_wrapper.pl / > 58.0 mis 8/28 18:24 0+00:23:46 I 0 9.8 > probe_wrapper.pl / > 59.0 mis 8/28 18:24 0+00:23:47 I 0 9.8 > probe_wrapper.pl / > 60.0 mis 8/28 18:24 0+00:23:48 I 0 9.8 > probe_wrapper.pl / > 61.0 mis 8/28 18:24 0+00:23:49 I 0 9.8 > probe_wrapper.pl / > 62.0 mis 8/28 18:24 0+00:23:49 I 0 9.8 > probe_wrapper.pl / > 63.0 mis 8/28 18:24 0+00:23:45 I 0 9.8 > probe_wrapper.pl / > 64.0 mis 8/28 18:24 0+12:19:03 R 0 9.8 > probe_wrapper.pl / > 65.0 mis 8/28 18:24 0+00:23:45 I 0 9.8 > probe_wrapper.pl / > 66.0 mis 8/28 18:24 0+00:23:47 I 0 9.8 > probe_wrapper.pl / > 67.0 mis 8/28 18:24 0+00:23:48 I 0 9.8 > probe_wrapper.pl / > 68.0 mis 8/28 18:24 0+00:23:51 I 0 9.8 > probe_wrapper.pl / > 69.0 mis 8/28 18:24 0+00:40:55 I 0 9.8 gratia- > script-cons > 70.0 mis 8/28 18:24 0+00:41:32 R 0 9.8 sample- > consumer /u > > 15 jobs; 12 idle, 3 running, 0 held > ===== >> But then I submitted another test job to osgitb1/jobmanager-condor, > and then all of a sudden it picked up the right one again: > > ===== > [hs@osgitb1 ~]$ condor_q -g> > -- Schedd: osgitb1.nhn.ou.edu : <129.15.31.41:63325> > ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD > 3749.0 usatlas1 8/29 09:57 0+00:00:00 I 0 9.8 env > > 1 jobs; 1 idle, 0 running, 0 held > ===== >> How can that be? > > I'm not sure if this has anything to do with our grid manager problem, > but it certainly shouldn't happen, right? > > How do I know which condor daemon condor_q or condor_status talk to? > I have no *CONDOR* env vars set, and it picks up the binaries in > /usr/local/bin/, which are soft links to /usr/local/condor/bin/, > which are the regular 6.8.4 installation, so there should be no way > to suddenly talk to the condor-devel daemons, right? > > Thanks a lot, > > Horst > > Assignees: Operations Workgroup, Arvind Gopu, OSG Support Centers, VDT > Status: Support Agency > Originating VO Support Center: DOSAR > Destination VO Support Center: VDT > Originating Ticket Number: > Destination Ticket Number: > > Thank You, > OSG Grid Operations Center > goc@opensciencegrid.org, 317-278-9699 > info: http://www.opensciencegrid.org > rss: http://www.grid.iu.edu/news/ > > > -- > View ticket at <http://vdt.cs.wisc.edu/rt/Ticket/Display.html? > user=guest&pass=guest&id=2922> > VDT Support, vdt-support@ivdgl.org +--------------------------------+-----------------------------------+ | Jaime Frey | I used to be a heavy gambler. | +--------------------------------+-----------------------------------+| jfrey@cs.wisc.edu | But now I just make mental bets. | | http://www.cs.wisc.edu/~jfrey/ | That's how I lost my mind. | |
||||||||||||
| # | Wed Aug 29 11:35:16 2007 | osg@tick-indy.globalnoc.iu.edu - Correspondence added | [Reply] | |||||||||
When replying, type your text above this line. ---------------------------------------------- This message is to let you know that Open Science Grid ticket 4004 "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) still not working" which is assigned to you, was updated on 08/29/2007 at 16:32:08 with the following information: FootPrints Ticket Description: Here is my public key: ssh-dss AAAAB3NzaC1kc3MAAAEBAKFQpOeVNX/16RfGAALXQ+pwosdlcMzBUtY0Dn6 +YgVgJXq/9mfCdGXuj5OzK2wjO5l1O71drcOjtYu9CvD0rrtzKp5B5xWZU/ pd4f2d11waSIgj0trEGHAMG+VQ69wjBwjR81YPOkI2HcEqbEGGWFS69iIu3yt/X/ 09wxwdOFpEmUKnjxCLD2PS/VlXydgLjdXq6+nUUz/ RFHv2Jbtbff8nSGW6SFdP424YwFazClMYhG8kKAtfSm0uL6bhzFs1ysOhRqHIYmu7w595brI BHsqdeZXwPlwOc4roLH9W99q7Zzozt9v+OTwNs9RWBa5/qnzZOA1mqms5GQySoqM/ +HsAAAAVAMhR5pJ+m9v/ O7RYqbbe0v2fqS2BAAABABIspAFuOztfIXFh6o2C0vwbVNo10rbTC7bcvzAHu5C/ SoemSqfiKSG9UdTWqM6u8Hw8k1StVK1GGcoh +wfUksT1r6PCykTC6uO5FqUIYWEVT8ILf0e/ +DjcuVSUw4jpGhs3hu28onqdKlZHqrnOc4q7ZjZ8+j8aGXnm/ xosrtWz7vhJV15TtKLdpc3hDcaBgdK95JYmBPDhrLRKExRHoOh0Emg07wzfxpr/ ECzXFiKf6DgO7LkeswgknXTrPahRbN2GUNmJKDWq8jVhvRASNendHaNmwjGcZnxBvmpuzuDG /YHcz6BCCqGZlWQakk3NiDnGX3je0mdWkeM0tzK1EYIAAAEAZUKgsTxjr +hrDwiPPQ5NzTO+3/ IgQYlQs3a6x2GsteSI8PDHbT6TKUPXu0wlgIdDhRzezJOIOPStU8geewAdFIzh1aI7E96L3R bvH+xNWk6Q7kGhcCWcqk7IDjL59YLn/JIQnq/5FQGgWgeUzP83jnhIJ/ SqwAnPPWBu0fLZ1UIOXcHDvAQcqondSKB9bEkz0tM44Be/ q3R8KkyZi1DOX4TBXtodoCFenLQGkaA/ NIJbyCjajzYhjuaMC40CHf4W1pagPhzxtT0uDWiRcUrG43EftAVFy27mnq7IrrbXYU +uGfBlPSOAWL0XQlyNE93aAxj8lgnHa3v0e5qyDjZvTw== jfrey@nostos.cs.wisc.edu Could you also add my X509 DN to the grid-mapfile for user usatlas1: /DC=org/DC=doegrids/OU=People/CN=James Frey 259919 -- Jaime On Aug 29, 2007, at 11:22 AM, osg@tick-indy.globalnoc.iu.edu via RT wrote: > [Duplicate message snipped] Assignees: Operations Workgroup, Arvind Gopu, OSG Support Centers, VDT Status: Support Agency Originating VO Support Center: DOSAR Destination VO Support Center: VDT Originating Ticket Number: Destination Ticket Number: Thank You, OSG Grid Operations Center goc@opensciencegrid.org, 317-278-9699 info: http://www.opensciencegrid.org rss: http://www.grid.iu.edu/news/ |
||||||||||||
| # | Fri Aug 31 17:34:22 2007 | jfrey@cs.wisc.edu - Correspondence added | [Reply] | |||||||||
I've figured out part of what's going on, although I don't fully understand it yet. When the Globus job-manager calls out to its perl modules, the VDT sneaks in a ". <vdt>/setup.sh" before perl is launched. This sets a bunch of environment variables. The Grid Monitor calls in the job-manager's perl modules directly, missing the ". <vdt>/setup.sh". Without the environment variables that setup.sh sets, the perl modules fail with this error: Can't locate object method "new" via package "Globus::GRAM::JobManager::condor" at /usr/local/opt/osg-0.7.0/globus/ lib/perl/Globus/GRAM/JobManager/condor.pm line 29. Specifically, LD_LIBRARY_PATH and PERL5LIB are the environment variables that determine the success or failure of the job-manager perl modules. Both need to be set appropriately for the code to work correctly. For LD_LIBRARY_PATH, it needs to contain the expat library directory installed by the VDT. If there are no gram job state files, then the failing codepath isn't hit, which is why the Grid Monitor may run for a while without dying. -- Jaime On Aug 29, 2007, at 11:35 AM, osg@tick-indy.globalnoc.iu.edu via RT wrote: > When replying, type your text above this line. > ---------------------------------------------- > This message is to let you know that Open Science Grid ticket 4004 > "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) > still not working" which is assigned to you, was updated on > 08/29/2007 at 16:32:08 with the following information: > > FootPrints Ticket Description: > Here is my public key: > > ssh-dss AAAAB3NzaC1kc3MAAAEBAKFQpOeVNX/16RfGAALXQ+pwosdlcMzBUtY0Dn6 > +YgVgJXq/9mfCdGXuj5OzK2wjO5l1O71drcOjtYu9CvD0rrtzKp5B5xWZU/ > pd4f2d11waSIgj0trEGHAMG+VQ69wjBwjR81YPOkI2HcEqbEGGWFS69iIu3yt/X/ > 09wxwdOFpEmUKnjxCLD2PS/VlXydgLjdXq6+nUUz/ > RFHv2Jbtbff8nSGW6SFdP424YwFazClMYhG8kKAtfSm0uL6bhzFs1ysOhRqHIYmu7w595b > rI > BHsqdeZXwPlwOc4roLH9W99q7Zzozt9v+OTwNs9RWBa5/qnzZOA1mqms5GQySoqM/ > +HsAAAAVAMhR5pJ+m9v/ > O7RYqbbe0v2fqS2BAAABABIspAFuOztfIXFh6o2C0vwbVNo10rbTC7bcvzAHu5C/ > SoemSqfiKSG9UdTWqM6u8Hw8k1StVK1GGcoh > +wfUksT1r6PCykTC6uO5FqUIYWEVT8ILf0e/ > +DjcuVSUw4jpGhs3hu28onqdKlZHqrnOc4q7ZjZ8+j8aGXnm/ > xosrtWz7vhJV15TtKLdpc3hDcaBgdK95JYmBPDhrLRKExRHoOh0Emg07wzfxpr/ > ECzXFiKf6DgO7LkeswgknXTrPahRbN2GUNmJKDWq8jVhvRASNendHaNmwjGcZnxBvmpuzu > DG > /YHcz6BCCqGZlWQakk3NiDnGX3je0mdWkeM0tzK1EYIAAAEAZUKgsTxjr > +hrDwiPPQ5NzTO+3/ > IgQYlQs3a6x2GsteSI8PDHbT6TKUPXu0wlgIdDhRzezJOIOPStU8geewAdFIzh1aI7E96L > 3R > bvH+xNWk6Q7kGhcCWcqk7IDjL59YLn/JIQnq/5FQGgWgeUzP83jnhIJ/ > SqwAnPPWBu0fLZ1UIOXcHDvAQcqondSKB9bEkz0tM44Be/ > q3R8KkyZi1DOX4TBXtodoCFenLQGkaA/ > NIJbyCjajzYhjuaMC40CHf4W1pagPhzxtT0uDWiRcUrG43EftAVFy27mnq7IrrbXYU > +uGfBlPSOAWL0XQlyNE93aAxj8lgnHa3v0e5qyDjZvTw== > jfrey@nostos.cs.wisc.edu > > Could you also add my X509 DN to the grid-mapfile for user usatlas1: > /DC=org/DC=doegrids/OU=People/CN=James Frey 259919 > > -- Jaime > > On Aug 29, 2007, at 11:22 AM, osg@tick-indy.globalnoc.iu.edu via RT > wrote: > >> [Duplicate message snipped] >> Assignees: Operations Workgroup, Arvind Gopu, OSG Support Centers, VDT > Status: Support Agency > Originating VO Support Center: DOSAR > Destination VO Support Center: VDT > Originating Ticket Number: > Destination Ticket Number: > > Thank You, > OSG Grid Operations Center > goc@opensciencegrid.org, 317-278-9699 > info: http://www.opensciencegrid.org > rss: http://www.grid.iu.edu/news/ > > > -- > View ticket at <http://vdt.cs.wisc.edu/rt/Ticket/Display.html? > user=guest&pass=guest&id=2922> > VDT Support, vdt-support@ivdgl.org +--------------------------------+-----------------------------------+ | Jaime Frey | I used to be a heavy gambler. | +--------------------------------+-----------------------------------+| jfrey@cs.wisc.edu | But now I just make mental bets. | | http://www.cs.wisc.edu/~jfrey/ | That's how I lost my mind. | |
||||||||||||
| # | Fri Aug 31 17:49:28 2007 | osg@tick-indy.globalnoc.iu.edu - Correspondence added | [Reply] | |||||||||
When replying, type your text above this line. ---------------------------------------------- This message is to let you know that Open Science Grid ticket 4004 "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) still not working" which is assigned to you, was updated on 08/31/2007 at 22:35:07 with the following information: FootPrints Ticket Description: I've figured out part of what's going on, although I don't fully understand it yet. When the Globus job-manager calls out to its perl modules, the VDT sneaks in a ". <vdt>/setup.sh" before perl is launched. This sets a bunch of environment variables. The Grid Monitor calls in the job-manager's perl modules directly, missing the ". <vdt>/setup.sh". Without the environment variables that setup.sh sets, the perl modules fail with this error: Can't locate object method "new" via package "Globus::GRAM::JobManager::condor" at /usr/local/opt/osg-0.7.0/globus/ lib/perl/Globus/GRAM/JobManager/condor.pm line 29. Specifically, LD_LIBRARY_PATH and PERL5LIB are the environment variables that determine the success or failure of the job-manager perl modules. Both need to be set appropriately for the code to work correctly. For LD_LIBRARY_PATH, it needs to contain the expat library directory installed by the VDT. If there are no gram job state files, then the failing codepath isn't hit, which is why the Grid Monitor may run for a while without dying. -- Jaime On Aug 29, 2007, at 11:35 AM, osg@tick-indy.globalnoc.iu.edu via RT wrote: > [Duplicate message snipped] Assignees: Operations Workgroup, Arvind Gopu, OSG Support Centers, VDT Status: Support Agency Originating VO Support Center: DOSAR Destination VO Support Center: VDT Originating Ticket Number: Destination Ticket Number: Thank You, OSG Grid Operations Center goc@opensciencegrid.org, 317-278-9699 info: http://www.opensciencegrid.org rss: http://www.grid.iu.edu/news/ |
||||||||||||
| # | Sat Sep 01 01:59:39 2007 | osg@tick-indy.globalnoc.iu.edu - Correspondence added | [Reply] | |||||||||
When replying, type your text above this line. ---------------------------------------------- This message is to let you know that Open Science Grid ticket 4004 "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) still not working" which is assigned to you, was updated on 09/01/2007 at 06:56:08 with the following information: FootPrints Ticket Description: Hi Jaime, interesting; well, that's progress. So how can we fix this, so that the Grid Monitor will work properly without dying? And why does this not happen anywhere else? Is it related to the fact that RHEL5 is using a newer perl version, which may not have the required methods? Wait -- when you setup OSG on osgitb1, you still get perl 5.8.8, even though the OSG comes with 5.8.0, right? So it looks like there's some mismatch even in the regular OSG setup, not just the Grid Monitor? I used 'perl -version' and 'perl -V' both before and after the OSG setup, but I don't really understand all the output, so you may want to try that yourself and see what you get. THanks a lot, good night, and a good long weekend, Horst Assignees: Operations Workgroup, Arvind Gopu, OSG Support Centers, VDT Status: Support Agency Originating VO Support Center: DOSAR Destination VO Support Center: VDT Originating Ticket Number: Destination Ticket Number: Thank You, OSG Grid Operations Center goc@opensciencegrid.org, 317-278-9699 info: http://www.opensciencegrid.org rss: http://www.grid.iu.edu/news/ |
||||||||||||
| # | Tue Sep 04 16:05:50 2007 | jfrey@cs.wisc.edu - Correspondence added | [Reply] | |||||||||
Starting with version 4.0.5, Globus requires perl's XML::Parser module. This in turn requires libexpat. The VDT provides both and ensures they are found when the job-manager calls out to its perl modules. The grid monitor doesn't know about the VDT-installed libraries and uses the system libraries. If XML::Parser and libexpat are installed as part of the system, everything works. If they aren't, the grid monitor will fail. Alain Roy and I have developed a patch to $GL/lib/perl/Globus/GRAM/ JobManager/fork.pm to point the grid monitor (and any other fork jobmanager jobs) to the VDT-installed libraries. I've attached the patch. Can you try applying it on osgitb1.nhn.ou.edu? Then I can test whether the patch makes the grid monitor work. -- Jaime On Sep 1, 2007, at 1:59 AM, osg@tick-indy.globalnoc.iu.edu via RT wrote: > When replying, type your text above this line. > ---------------------------------------------- > This message is to let you know that Open Science Grid ticket 4004 > "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) > still not working" which is assigned to you, was updated on > 09/01/2007 at 06:56:08 with the following information: > > FootPrints Ticket Description: > Hi Jaime, > > interesting; well, that's progress. > So how can we fix this, so that the Grid Monitor will work properly > without dying? > > And why does this not happen anywhere else? Is it related to the > fact that > RHEL5 is using a newer perl version, which may not have the required > methods? > > Wait -- when you setup OSG on osgitb1, you still get perl 5.8.8, > even though the OSG comes with 5.8.0, right? So it looks like there's > some mismatch even in the regular OSG setup, not just the Grid > Monitor? > > I used 'perl -version' and 'perl -V' both before and after the OSG > setup, > but I don't really understand all the output, so you may want to > try that yourself and see what you get. > > THanks a lot, > good night, > and a good long weekend, > > Horst > > Assignees: Operations Workgroup, Arvind Gopu, OSG Support Centers, VDT > Status: Support Agency > Originating VO Support Center: DOSAR > Destination VO Support Center: VDT > Originating Ticket Number: > Destination Ticket Number: > > Thank You, > OSG Grid Operations Center > goc@opensciencegrid.org, 317-278-9699 > info: http://www.opensciencegrid.org > rss: http://www.grid.iu.edu/news/ > > > -- > View ticket at <http://vdt.cs.wisc.edu/rt/Ticket/Display.html? > user=guest&pass=guest&id=2922> > VDT Support, vdt-support@ivdgl.org +--------------------------------+-----------------------------------+ | Jaime Frey | I used to be a heavy gambler. | +--------------------------------+-----------------------------------+| jfrey@cs.wisc.edu | But now I just make mental bets. | | http://www.cs.wisc.edu/~jfrey/ | That's how I lost my mind. | |
||||||||||||
| # | Tue Sep 04 16:30:48 2007 | osg@tick-indy.globalnoc.iu.edu - Correspondence added | [Reply] | |||||||||
When replying, type your text above this line. ---------------------------------------------- This message is to let you know that Open Science Grid ticket 4004 "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) still not working" which is assigned to you, was updated on 09/04/2007 at 21:08:07 with the following information: FootPrints Ticket Description: Starting with version 4.0.5, Globus requires perl's XML::Parser module. This in turn requires libexpat. The VDT provides both and ensures they are found when the job-manager calls out to its perl modules. The grid monitor doesn't know about the VDT-installed libraries and uses the system libraries. If XML::Parser and libexpat are installed as part of the system, everything works. If they aren't, the grid monitor will fail. Alain Roy and I have developed a patch to $GL/lib/perl/Globus/GRAM/ JobManager/fork.pm to point the grid monitor (and any other fork jobmanager jobs) to the VDT-installed libraries. I've attached the patch. Can you try applying it on osgitb1.nhn.ou.edu? Then I can test whether the patch makes the grid monitor work. -- Jaime On Sep 1, 2007, at 1:59 AM, osg@tick-indy.globalnoc.iu.edu via RT wrote: > [Duplicate message snipped] Assignees: Operations Workgroup, Arvind Gopu, OSG Support Centers, VDT Status: Support Agency Originating VO Support Center: DOSAR Destination VO Support Center: VDT Originating Ticket Number: Destination Ticket Number: Thank You, OSG Grid Operations Center goc@opensciencegrid.org, 317-278-9699 info: http://www.opensciencegrid.org rss: http://www.grid.iu.edu/news/ |
||||||||||||
| # | Tue Sep 04 18:27:17 2007 | osg@tick-indy.globalnoc.iu.edu - Correspondence added | [Reply] | |||||||||
When replying, type your text above this line. ---------------------------------------------- This message is to let you know that Open Science Grid ticket 4004 "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) still not working" which is assigned to you, was updated on 09/04/2007 at 23:20:07 with the following information: FootPrints Ticket Description: Hi Jaime and Alain, thanks, I just applied the patch, and it seems to work. At least the grid monitor isn't crashing anymore. It keeps happily running as ---- globusrun -s -r osgitb1.nhn.ou.edu/jobmanager-fork '&(executable=$(GLOBUSRUN_GASS_URL)/usr/local/condor/sbin/grid_monitor.sh)(arguments="--dest-url="#$(GLOBUSRUN_GASS_URL)#"/tmp/job_status")' ---- on ouhep5. But now /tmp/job_status on the condor-g submit host doesn't seem to be picking up the running jobs anymore, it's empty: ---- [hs@ouhep5 hs]$ cat /tmp/job_status 1188947412 1188947412 GRIDMONEOF ---- Whereas I have 7 condor jobs running with ---- [hs@ouhep5 hs]$ globus-job-run osgitb1/jobmanager-condor -np 7 /bin/sleep 150 ---- which are running fine: ---- [hs@osgitb1 ~]$ condor_q -- Submitter: osgitb1.nhn.ou.edu : <129.15.31.41:63325> : osgitb1.nhn.ou.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 4652.0 usatlas1 9/4 18:10 0+00:02:23 R 0 9.8 sleep 150 4652.1 usatlas1 9/4 18:10 0+00:02:21 R 0 9.8 sleep 150 4652.2 usatlas1 9/4 18:10 0+00:02:19 R 0 9.8 sleep 150 4652.3 usatlas1 9/4 18:10 0+00:02:17 R 0 9.8 sleep 150 4652.4 usatlas1 9/4 18:10 0+00:02:15 R 0 9.8 sleep 150 4652.5 usatlas1 9/4 18:10 0+00:02:13 R 0 9.8 sleep 150 4652.6 usatlas1 9/4 18:10 0+00:02:11 R 0 9.8 sleep 150 7 jobs; 0 idle, 7 running, 0 held ---- Is that expected? Thanks a lot, Horst Assignees: Operations Workgroup, Arvind Gopu, OSG Support Centers, VDT Status: Support Agency Originating VO Support Center: DOSAR Destination VO Support Center: VDT Originating Ticket Number: Destination Ticket Number: Thank You, OSG Grid Operations Center goc@opensciencegrid.org, 317-278-9699 info: http://www.opensciencegrid.org rss: http://www.grid.iu.edu/news/ |
||||||||||||
| # | Wed Sep 05 15:17:39 2007 | roy - Priority changed from (no value) to '4' | ||
| # | Thu Sep 06 14:33:24 2007 | jfrey@cs.wisc.edu - Correspondence added | [Reply] | |||||||||
The patch we gave you had a bug in it. I've attached a corrected patch. Can you apply this new patch to $GL/lib/perl/Globus/GRAM/ JobManager/fork.pm? -- Jaime On Sep 4, 2007, at 6:27 PM, osg@tick-indy.globalnoc.iu.edu via RT wrote: > When replying, type your text above this line. > ---------------------------------------------- > This message is to let you know that Open Science Grid ticket 4004 > "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) > still not working" which is assigned to you, was updated on > 09/04/2007 at 23:20:07 with the following information: > > FootPrints Ticket Description: > Hi Jaime and Alain, > > thanks, I just applied the patch, and it seems to work. At least the > grid monitor isn't crashing anymore. It keeps happily running as > > ---- > globusrun -s -r osgitb1.nhn.ou.edu/jobmanager-fork '&(executable=$ > (GLOBUSRUN_GASS_URL)/usr/local/condor/sbin/grid_monitor.sh) > (arguments="--dest-url="#$(GLOBUSRUN_GASS_URL)#"/tmp/job_status")' > ---- > > on ouhep5. > > But now /tmp/job_status on the condor-g submit host doesn't seem to be > picking up the running jobs anymore, it's empty: > > ---- > [hs@ouhep5 hs]$ cat /tmp/job_status > 1188947412 1188947412 > GRIDMONEOF > ---- > > Whereas I have 7 condor jobs running with > > ---- > [hs@ouhep5 hs]$ globus-job-run osgitb1/jobmanager-condor -np 7 /bin/ > sleep 150 > ---- > > which are running fine: > > ---- > [hs@osgitb1 ~]$ condor_q > > -- Submitter: osgitb1.nhn.ou.edu : <129.15.31.41:63325> : > osgitb1.nhn.ou.edu > ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD > 4652.0 usatlas1 9/4 18:10 0+00:02:23 R 0 9.8 sleep 150 > 4652.1 usatlas1 9/4 18:10 0+00:02:21 R 0 9.8 sleep 150 > 4652.2 usatlas1 9/4 18:10 0+00:02:19 R 0 9.8 sleep 150 > 4652.3 usatlas1 9/4 18:10 0+00:02:17 R 0 9.8 sleep 150 > 4652.4 usatlas1 9/4 18:10 0+00:02:15 R 0 9.8 sleep 150 > 4652.5 usatlas1 9/4 18:10 0+00:02:13 R 0 9.8 sleep 150 > 4652.6 usatlas1 9/4 18:10 0+00:02:11 R 0 9.8 sleep 150 > > 7 jobs; 0 idle, 7 running, 0 held > ---- > > Is that expected? > > Thanks a lot, > > Horst > > Assignees: Operations Workgroup, Arvind Gopu, OSG Support Centers, VDT > Status: Support Agency > Originating VO Support Center: DOSAR > Destination VO Support Center: VDT > Originating Ticket Number: > Destination Ticket Number: > > Thank You, > OSG Grid Operations Center > goc@opensciencegrid.org, 317-278-9699 > info: http://www.opensciencegrid.org > rss: http://www.grid.iu.edu/news/ > > > -- > View ticket at <http://vdt.cs.wisc.edu/rt/Ticket/Display.html? > user=guest&pass=guest&id=2922> > VDT Support, vdt-support@ivdgl.org +--------------------------------+-----------------------------------+ | Jaime Frey | I used to be a heavy gambler. | +--------------------------------+-----------------------------------+| jfrey@cs.wisc.edu | But now I just make mental bets. | | http://www.cs.wisc.edu/~jfrey/ | That's how I lost my mind. | |
||||||||||||
| # | Thu Sep 06 14:41:36 2007 | osg@tick-indy.globalnoc.iu.edu - Correspondence added | [Reply] | |||||||||
When replying, type your text above this line. ---------------------------------------------- This message is to let you know that Open Science Grid ticket 4004 "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) still not working" which is assigned to you, was updated on 09/06/2007 at 19:35:07 with the following information: FootPrints Ticket Description: The patch we gave you had a bug in it. I've attached a corrected patch. Can you apply this new patch to $GL/lib/perl/Globus/GRAM/ JobManager/fork.pm? -- Jaime On Sep 4, 2007, at 6:27 PM, osg@tick-indy.globalnoc.iu.edu via RT wrote: > [Duplicate message snipped] Assignees: Operations Workgroup, Arvind Gopu, OSG Support Centers, VDT Status: Support Agency Originating VO Support Center: DOSAR Destination VO Support Center: VDT Originating Ticket Number: Destination Ticket Number: Thank You, OSG Grid Operations Center goc@opensciencegrid.org, 317-278-9699 info: http://www.opensciencegrid.org rss: http://www.grid.iu.edu/news/ |
||||||||||||
| # | Thu Sep 06 18:21:48 2007 | osg@tick-indy.globalnoc.iu.edu - Correspondence added | [Reply] | |||||||||
When replying, type your text above this line. ---------------------------------------------- This message is to let you know that Open Science Grid ticket 4004 "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) still not working" which is assigned to you, was updated on 09/06/2007 at 22:50:08 with the following information: FootPrints Ticket Description: Hi Jaime, thanks, I just applied your new patch, and this seems to work now, since running the grid monitor by hand after submitting a jobmanager-condor job now shows up in /tmp/job_status: 1189106547 1189106547 https://osgitb1.nhn.ou.edu:63003/23116/1189105354/ 2 GRIDMONEOF So Alan D., could you please re-enable osgitb1 in the GridEx and see if it now behaves better? Thanks a lot, Horst Assignees: Operations Workgroup, Arvind Gopu, OSG Support Centers, VDT Status: Support Agency Originating VO Support Center: DOSAR Destination VO Support Center: VDT Originating Ticket Number: Destination Ticket Number: Thank You, OSG Grid Operations Center goc@opensciencegrid.org, 317-278-9699 info: http://www.opensciencegrid.org rss: http://www.grid.iu.edu/news/ |
||||||||||||
| # | Fri Sep 07 17:02:37 2007 | roy - Comments added | [Reply] | |||||||
Commit comment: We patch the fork job manager to set PERL5LIB and LD_LIBRARY_PATH so that the Condor grid monitor (which invokes some scripts directlry, without using our wonderful hack below) can find XML/Parser and expat. Changed files: U vdt/branches/vdt-1.8/Globus-Base-Jobmanager-Common/Globus-Base-Jobmanager-Common.pacman To generate a diff: svn diff -c 6605 file:///p/vdt/workspace/svn |
||||||||||
| # | Fri Sep 07 17:09:11 2007 | roy - Correspondence added | [Reply] | |
|
> Hi Jaime, jobmanager-condor> > thanks, I just applied your new patch, and this seems to work now, > since running the grid monitor by hand after submitting a > job now shows up in /tmp/job_status: This is great! Thanks for patiently helping us to debug the problem, Horst. VDT 1.8.1 now applies this patch during installation, so I think the problem is solved. Thanks again, -alain |
||||
| # | Fri Sep 07 17:27:10 2007 | osg@tick-indy.globalnoc.iu.edu - Correspondence added | [Reply] | |||||||||
When replying, type your text above this line. ---------------------------------------------- This message is to let you know that Open Science Grid ticket 4004 "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) still not working" which is assigned to you, was updated on 09/07/2007 at 22:11:07 with the following information: FootPrints Ticket Description: > Hi Jaime, jobmanager-condor> > thanks, I just applied your new patch, and this seems to work now, > since running the grid monitor by hand after submitting a > job now shows up in /tmp/job_status: This is great! Thanks for patiently helping us to debug the problem, Horst. VDT 1.8.1 now applies this patch during installation, so I think the problem is solved. Thanks again, -alain -- View ticket at <http://vdt.cs.wisc.edu/rt/Ticket/Display.html?user=guest&pass=guest&id=2922> VDT Support, vdt-support@ivdgl.org Assignees: Operations Workgroup, Arvind Gopu, OSG Support Centers, VDT Status: Support Agency Originating VO Support Center: DOSAR Destination VO Support Center: VDT Originating Ticket Number: Destination Ticket Number: Thank You, OSG Grid Operations Center goc@opensciencegrid.org, 317-278-9699 info: http://www.opensciencegrid.org rss: http://www.grid.iu.edu/news/ |
||||||||||||
| # | Sat Sep 08 01:22:28 2007 | osg@tick-indy.globalnoc.iu.edu - Correspondence added | [Reply] | |||||||||
When replying, type your text above this line. ---------------------------------------------- This message is to let you know that Open Science Grid ticket 4004 "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) still not working" which is assigned to you, was updated on 09/08/2007 at 06:20:08 with the following information: FootPrints Ticket Description: Hi Jaime and Alain, well, unfortunately this didn't solve our GridEx problem. :( GridEx jobs are running again, and still no grid monitor process, but 30 globus-job-manager processes. Jaime, could you have a look at osgitb1 and see if you can figure out why the grid monitor is still not running? Or let me know where else I can look? THanks a lot, Horst Assignees: Operations Workgroup, Arvind Gopu, OSG Support Centers, VDT Status: Support Agency Originating VO Support Center: DOSAR Destination VO Support Center: VDT Originating Ticket Number: Destination Ticket Number: Thank You, OSG Grid Operations Center goc@opensciencegrid.org, 317-278-9699 info: http://www.opensciencegrid.org rss: http://www.grid.iu.edu/news/ |
||||||||||||
| # | Mon Sep 10 10:25:51 2007 | jfrey@cs.wisc.edu - Correspondence added | [Reply] | |||||||||
Strange. Looking at the code in fork.pm again, I don't see how the patch could have worked on your system. It references a non-existent variable. Here's yet another revised patch. Can you apply it on osgitb1.nhn.ou.edu? Meanwhile, I'll talk with Alain to see if there's a difference between your fork.pm and the one we based the patch one. -- Jaime On Sep 8, 2007, at 1:22 AM, osg@tick-indy.globalnoc.iu.edu via RT wrote: > When replying, type your text above this line. > ---------------------------------------------- > This message is to let you know that Open Science Grid ticket 4004 > "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) > still not working" which is assigned to you, was updated on > 09/08/2007 at 06:20:08 with the following information: > > FootPrints Ticket Description: > Hi Jaime and Alain, > > well, unfortunately this didn't solve our GridEx problem. :( > GridEx jobs are running again, and still no grid monitor process, > but 30 globus-job-manager processes. > > Jaime, could you have a look at osgitb1 and see if you can figure out > why the grid monitor is still not running? Or let me know where else > I can look? > > THanks a lot, > > Horst > > Assignees: Operations Workgroup, Arvind Gopu, OSG Support Centers, VDT > Status: Support Agency > Originating VO Support Center: DOSAR > Destination VO Support Center: VDT > Originating Ticket Number: > Destination Ticket Number: > > Thank You, > OSG Grid Operations Center > goc@opensciencegrid.org, 317-278-9699 > info: http://www.opensciencegrid.org > rss: http://www.grid.iu.edu/news/ > > > -- > View ticket at <http://vdt.cs.wisc.edu/rt/Ticket/Display.html? > user=guest&pass=guest&id=2922> > VDT Support, vdt-support@ivdgl.org +--------------------------------+-----------------------------------+ | Jaime Frey | I used to be a heavy gambler. | +--------------------------------+-----------------------------------+| jfrey@cs.wisc.edu | But now I just make mental bets. | | http://www.cs.wisc.edu/~jfrey/ | That's how I lost my mind. | |
||||||||||||
| # | Mon Sep 10 10:30:47 2007 | osg@tick-indy.globalnoc.iu.edu - Correspondence added | [Reply] | |||||||||
When replying, type your text above this line. ---------------------------------------------- This message is to let you know that Open Science Grid ticket 4004 "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) still not working" which is assigned to you, was updated on 09/10/2007 at 15:26:09 with the following information: FootPrints Ticket Description: Strange. Looking at the code in fork.pm again, I don't see how the patch could have worked on your system. It references a non-existent variable. Here's yet another revised patch. Can you apply it on osgitb1.nhn.ou.edu? Meanwhile, I'll talk with Alain to see if there's a difference between your fork.pm and the one we based the patch one. -- Jaime On Sep 8, 2007, at 1:22 AM, osg@tick-indy.globalnoc.iu.edu via RT wrote: > [Duplicate message snipped] Assignees: Operations Workgroup, Arvind Gopu, OSG Support Centers, VDT Status: Support Agency Originating VO Support Center: DOSAR Destination VO Support Center: VDT Originating Ticket Number: Destination Ticket Number: Thank You, OSG Grid Operations Center goc@opensciencegrid.org, 317-278-9699 info: http://www.opensciencegrid.org rss: http://www.grid.iu.edu/news/ |
||||||||||||
| # | Mon Sep 10 17:43:40 2007 | osg@tick-indy.globalnoc.iu.edu - Correspondence added | [Reply] | |||||||||
When replying, type your text above this line. ---------------------------------------------- This message is to let you know that Open Science Grid ticket 4004 "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) still not working" which is assigned to you, was updated on 09/10/2007 at 22:38:08 with the following information: FootPrints Ticket Description: Hi Jaime, good point, I wasn't paying attention to the patch, I just applied it. =) So I just applied your latest patch, and then I condor_rm'd all gridex jobs and cleaned out the gass cache, just to be sure, and the load went down to 0.5 or so, but as soon as the new jobs came in, the load went through the roof again, and still no running grid monitor. :( What else can we try? Thanks a lot, Horst Assignees: Operations Workgroup, Arvind Gopu, OSG Support Centers, VDT Status: Support Agency Originating VO Support Center: DOSAR Destination VO Support Center: VDT Originating Ticket Number: Destination Ticket Number: Thank You, OSG Grid Operations Center goc@opensciencegrid.org, 317-278-9699 info: http://www.opensciencegrid.org rss: http://www.grid.iu.edu/news/ |
||||||||||||
| # | Mon Sep 10 17:46:42 2007 | osg@tick-indy.globalnoc.iu.edu - Correspondence added | [Reply] | |||||||||
When replying, type your text above this line. ---------------------------------------------- This message is to let you know that Open Science Grid ticket 4004 "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) still not working" which is assigned to you, was updated on 09/10/2007 at 22:44:09 with the following information: FootPrints Ticket Description: Wait -- I just looked again, and now, after 5 or 10 minutes, the grid monitor is now running, and the load went down to 1! So it seems like the patch did the trick after all! Why did it take a few minutes for this new batch of gridex jobs to start the grid monitor, though? Anyway, I'll monitor the situation till tomorrow morning, but looks like we solved it! Thanks a lot, Horst Assignees: Operations Workgroup, Arvind Gopu, OSG Support Centers, VDT Status: Support Agency Originating VO Support Center: DOSAR Destination VO Support Center: VDT Originating Ticket Number: Destination Ticket Number: Thank You, OSG Grid Operations Center goc@opensciencegrid.org, 317-278-9699 info: http://www.opensciencegrid.org rss: http://www.grid.iu.edu/news/ |
||||||||||||
| # | Mon Sep 10 18:06:29 2007 | jfrey@cs.wisc.edu - Correspondence added | [Reply] | |||||||||
Condor-G submits the real jobs in parallel with the grid monitor job. Once the grid monitor starts reporting back to Condor-G, Condor-G kills the jobmanagers of the real jobs. -- Jaime On Sep 10, 2007, at 5:46 PM, osg@tick-indy.globalnoc.iu.edu via RT wrote: > When replying, type your text above this line. > ---------------------------------------------- > This message is to let you know that Open Science Grid ticket 4004 > "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) > still not working" which is assigned to you, was updated on > 09/10/2007 at 22:44:09 with the following information: > > FootPrints Ticket Description: > Wait -- I just looked again, and now, after 5 or 10 minutes, > the grid monitor is now running, and the load went down to 1! > > So it seems like the patch did the trick after all! > Why did it take a few minutes for this new batch of gridex jobs > to start the grid monitor, though? > > Anyway, I'll monitor the situation till tomorrow morning, > but looks like we solved it! > > Thanks a lot, > > Horst > > Assignees: Operations Workgroup, Arvind Gopu, OSG Support Centers, VDT > Status: Support Agency > Originating VO Support Center: DOSAR > Destination VO Support Center: VDT > Originating Ticket Number: > Destination Ticket Number: > > Thank You, > OSG Grid Operations Center > goc@opensciencegrid.org, 317-278-9699 > info: http://www.opensciencegrid.org > rss: http://www.grid.iu.edu/news/ > > > -- > View ticket at <http://vdt.cs.wisc.edu/rt/Ticket/Display.html? > user=guest&pass=guest&id=2922> > VDT Support, vdt-support@ivdgl.org +--------------------------------+-----------------------------------+ | Jaime Frey | I used to be a heavy gambler. | +--------------------------------+-----------------------------------+| jfrey@cs.wisc.edu | But now I just make mental bets. | | http://www.cs.wisc.edu/~jfrey/ | That's how I lost my mind. | |
||||||||||||
| # | Mon Sep 10 18:12:11 2007 | osg@tick-indy.globalnoc.iu.edu - Correspondence added | [Reply] | |||||||||
When replying, type your text above this line. ---------------------------------------------- This message is to let you know that Open Science Grid ticket 4004 "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) still not working" which is assigned to you, was updated on 09/10/2007 at 23:08:08 with the following information: FootPrints Ticket Description: Condor-G submits the real jobs in parallel with the grid monitor job. Once the grid monitor starts reporting back to Condor-G, Condor-G kills the jobmanagers of the real jobs. -- Jaime On Sep 10, 2007, at 5:46 PM, osg@tick-indy.globalnoc.iu.edu via RT wrote: > [Duplicate message snipped] Assignees: Operations Workgroup, Arvind Gopu, OSG Support Centers, VDT Status: Support Agency Originating VO Support Center: DOSAR Destination VO Support Center: VDT Originating Ticket Number: Destination Ticket Number: Thank You, OSG Grid Operations Center goc@opensciencegrid.org, 317-278-9699 info: http://www.opensciencegrid.org rss: http://www.grid.iu.edu/news/ |
||||||||||||
| # | Mon Sep 10 19:27:35 2007 | osg@tick-indy.globalnoc.iu.edu - Correspondence added | [Reply] | |||||||||
When replying, type your text above this line. ---------------------------------------------- This message is to let you know that Open Science Grid ticket 4004 "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) still not working" which is assigned to you, was updated on 09/11/2007 at 00:14:08 with the following information: FootPrints Ticket Description: Hi Jaime, I see. I guess I just never paid too much attention to this before, so I didn't notice the load spike when gridex jobs started on our old ITB gatekeeper. We had another load spike when LIGO started their work flow a little while ago, but it went down again, so it looks like all is well! Thanks a lot, Horst Assignees: Operations Workgroup, Arvind Gopu, OSG Support Centers, VDT Status: Support Agency Originating VO Support Center: DOSAR Destination VO Support Center: VDT Originating Ticket Number: Destination Ticket Number: Thank You, OSG Grid Operations Center goc@opensciencegrid.org, 317-278-9699 info: http://www.opensciencegrid.org rss: http://www.grid.iu.edu/news/ |
||||||||||||
| # | Wed Sep 12 14:33:14 2007 | roy - Correspondence added | [Reply] | |
|
> Wait -- I just looked again, and now, after 5 or 10 minutes, > the grid monitor is now running, and the load went down to 1! > > So it seems like the patch did the trick after all! OK, so this patch is now part of VDT 1.8.1. Thanks everyone! -alain |
||||
| # | Wed Sep 12 15:03:16 2007 | osg@tick-indy.globalnoc.iu.edu - Correspondence added | [Reply] | |||||||||
When replying, type your text above this line. ---------------------------------------------- This message is to let you know that Open Science Grid ticket 4004 "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) still not working" which is assigned to you, was updated on 09/12/2007 at 19:35:08 with the following information: FootPrints Ticket Description: > Wait -- I just looked again, and now, after 5 or 10 minutes, > the grid monitor is now running, and the load went down to 1! > > So it seems like the patch did the trick after all! OK, so this patch is now part of VDT 1.8.1. Thanks everyone! -alain -- View ticket at <http://vdt.cs.wisc.edu/rt/Ticket/Display.html?user=guest&pass=guest&id=2922> VDT Support, vdt-support@ivdgl.org Assignees: Operations Workgroup, Arvind Gopu, OSG Support Centers, VDT Status: Support Agency Originating VO Support Center: DOSAR Destination VO Support Center: VDT Originating Ticket Number: Destination Ticket Number: Thank You, OSG Grid Operations Center goc@opensciencegrid.org, 317-278-9699 info: http://www.opensciencegrid.org rss: http://www.grid.iu.edu/news/ |
||||||||||||
| # | Wed Sep 12 16:00:23 2007 | osg@tick-indy.globalnoc.iu.edu - Correspondence added | [Reply] | |||||||||
When replying, type your text above this line. ---------------------------------------------- This message is to let you know that Open Science Grid ticket 4004 "grid monitor for GridEx jobs on OUHEP_ITB (osgitb1.nhn.ou.edu) still not working" which is assigned to you, was updated on 09/12/2007 at 19:44:35 with the following information: FootPrints Ticket Description: Great! I am closing this GOC ticket. Alain, please close corresponding VDT ticket. Arvind Assignees: Operations Workgroup, Arvind Gopu, OSG Support Centers, VDT Status: Closed Originating VO Support Center: DOSAR Destination VO Support Center: VDT Originating Ticket Number: Destination Ticket Number: Thank You, OSG Grid Operations Center goc@opensciencegrid.org, 317-278-9699 info: http://www.opensciencegrid.org rss: http://www.grid.iu.edu/news/ |
||||||||||||
| # | Tue Sep 18 17:21:28 2007 | roy - Status changed from 'open' to 'resolved' | ||
Time to display: 7.653685
»|« RT 3.8.2 Copyright 1996-2008 Best Practical Solutions, LLC.