MonitoringHTCondor_v1 - Indico

Report
Monitoring HTCondor
Andrew Lahiff
STFC Rutherford Appleton Laboratory
European HTCondor Site Admins Meeting 2014
Introduction
• Two aspects of monitoring
– General overview of the system
• How many running/idle jobs?
By user/VO? By schedd?
• How full is the farm?
• How many draining worker nodes?
– More detailed views
• What are individual jobs doing?
• What’s happening on individual worker nodes?
• Health of the different components of the HTCondor pool
• ...in addition to Nagios
Introduction
• Methods
– Command line utilities
– Ganglia
– Third-party applications
(which run command-line tools or use python API)
Command line
• Three useful commands
– condor_status
• Overview of the pool (including jobs, machines)
• Information about specific worker nodes
– condor_q
• Information about jobs in the queue
– condor_history
• Information about completed jobs
Overview of jobs
-bash-4.1$ condor_status -collector
Name
Machine
[email protected] condor01.gridpp.rl
[email protected] condor02.gridpp.rl
RunningJobs IdleJobs HostsTotal
10608
10616
8355
8364
11347
11360
Overview of machines
-bash-4.1$ condor_status -total
Total Owner Claimed Unclaimed Matched Preempting Backfill
X86_64/LINUX 11183
95
10441
592
0
0
0
Total 11183
95
10441
592
0
0
0
Jobs by schedd
-bash-4.1$ condor_status -schedd
Name
Machine
TotalRunningJobs TotalIdleJobs TotalHeldJobs
arc-ce01.gridpp.rl.a
arc-ce02.gridpp.rl.a
arc-ce03.gridpp.rl.a
arc-ce04.gridpp.rl.a
arc-ce05.gridpp.rl.a
cream-ce01.gridpp.rl
cream-ce02.gridpp.rl
lcg0955.gridpp.rl.ac
lcgui03.gridpp.rl.ac
lcgui04.gridpp.rl.ac
lcgvm21.gridpp.rl.ac
arc-ce01.g
arc-ce02.g
arc-ce03.g
arc-ce04.g
arc-ce05.g
cream-ce01
cream-ce02
lcg0955.gr
lcgui03.gr
lcgui04.gr
lcgvm21.gr
TotalRunningJobs
2388
2011
4272
1424
1
266
247
0
3
0
0
TotalIdleJobs
Total
10612
8364
1990
1995
1994
2385
0
0
0
0
0
0
0
13
31
9
12
6
0
0
0
0
0
0
TotalHeldJobs
71
Jobs by user, schedd
-bash-4.1$ condor_status -submitters
Name
Machine
[email protected]
[email protected]
group_ATLAS.atlas_pilot.tatl
group_ATLAS.prodatls.patls00
[email protected]
group_CMS.cms_pilot.ttcms022
group_CMS.cms_pilot.ttcms043
[email protected]
[email protected]
group_CMS.prodcms_multicore.
[email protected]
group_LHCB.lhcb_pilot.tlhcb0
group_NONLHC.snoplus.snoplus
…
arc-ce01.gridpp.rl
arc-ce01.gridpp.rl
arc-ce01.gridpp.rl
arc-ce01.gridpp.rl
arc-ce01.gridpp.rl
arc-ce01.gridpp.rl
arc-ce01.gridpp.rl
arc-ce01.gridpp.rl
arc-ce01.gridpp.rl
arc-ce01.gridpp.rl
arc-ce01.gridpp.rl
arc-ce01.gridpp.rl
arc-ce01.gridpp.rl
RunningJobs IdleJobs HeldJobs
0
540
142
82
1
214
68
78
12
47
0
992
0
0
0
0
5
0
390
100
476
910
102
0
0
0
0
1
0
0
0
0
0
4
0
0
0
2
0
…Jobs by user
group_ALICE.alice.al
group_ALICE.alice.al
group_ALICE.alice_pi
group_ATLAS.atlas.at
group_ATLAS.atlas.at
group_ATLAS.atlas_pi
group_ATLAS.atlas_pi
group_ATLAS.prodatls
group_CMS.cms.cmssgm
group_CMS.cms_pilot.
group_CMS.cms_pilot.
group_CMS.cms_pilot.
group_CMS.prodcms.pc
group_CMS.prodcms.pc
group_CMS.prodcms_mu
…
RunningJobs
IdleJobs
HeldJobs
0
3500
0
0
0
414
0
354
1
371
0
68
188
312
47
0
368
0
0
0
12
0
36
0
2223
0
200
1905
3410
102
0
5
0
0
0
10
2
11
0
0
1
0
10
0
0
condor_q
[[email protected] ~]# condor_q
-- Submitter: arc-ce01.gridpp.rl.ac.uk : <130.246.180.236:64454> : arcce01.gridpp.rl.ac.uk
ID
OWNER
SUBMITTED
RUN_TIME ST PRI SIZE CMD
794717.0
pcms054
12/3 12:07
0+00:00:00 I 0
0.0 (gridjob
794718.0
pcms054
12/3 12:07
0+00:00:00 I 0
0.0 (gridjob
794719.0
pcms054
12/3 12:07
0+00:00:00 I 0
0.0 (gridjob
794720.0
pcms054
12/3 12:07
0+00:00:00 I 0
0.0 (gridjob
794721.0
pcms054
12/3 12:07
0+00:00:00 I 0
0.0 (gridjob
794722.0
pcms054
12/3 12:07
0+00:00:00 I 0
0.0 (gridjob
794723.0
pcms054
12/3 12:07
0+00:00:00 I 0
0.0 (gridjob
794725.0
pcms054
12/3 12:07
0+00:00:00 I 0
0.0 (gridjob
794726.0
pcms054
12/3 12:07
0+00:00:00 I 0
0.0 (gridjob
…
3502 jobs; 0 completed, 0 removed, 1528 idle, 1965 running, 9 held, 0 suspended
)
)
)
)
)
)
)
)
)
Multi-core jobs
-bash-4.1$ condor_q -global -constraint 'RequestCpus > 1’
-- Schedd: arc-ce01.gridpp.rl.ac.uk : <130.246.180.236:39356>
ID
OWNER
SUBMITTED
RUN_TIME ST PRI SIZE CMD
832677.0
pcms004
12/5 14:33
0+00:15:07 R 0
2.0 (gridjob
832717.0
pcms004
12/5 14:37
0+00:12:02 R 0
0.0 (gridjob
832718.0
pcms004
12/5 14:37
0+00:00:00 I 0
0.0 (gridjob
832719.0
pcms004
12/5 14:37
0+00:00:00 I 0
0.0 (gridjob
832893.0
pcms004
12/5 14:47
0+00:00:00 I 0
0.0 (gridjob
832894.0
pcms004
12/5 14:47
0+00:00:00 I 0
0.0 (gridjob
…
)
)
)
)
)
)
Multi-core jobs
• Custom print format
-bash-4.1$ condor_q -global -pr queue_mc.cpf
-- Schedd:
ID
832677.0
832717.0
832718.0
832719.0
832893.0
832894.0
…
arc-ce01.gridpp.rl.ac.uk : <130.246.180.236:39356>
OWNER
SUBMITTED
RUN_TIME ST SIZE CMD
pcms004
12/5 14:33
0+00:00:00 R 2.0 (gridjob)
pcms004
12/5 14:37
0+00:00:00 R 0.0 (gridjob)
pcms004
12/5 14:37
0+00:00:00 I 0.0 (gridjob)
pcms004
12/5 14:37
0+00:00:00 I 0.0 (gridjob)
pcms004
12/5 14:47
0+00:00:00 I 0.0 (gridjob)
pcms004
12/5 14:47
0+00:00:00 I 0.0 (gridjob)
https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=ExperimentalCustomPrintFormats
CORES
8
8
8
8
8
8
Jobs with specific DN
-bash-4.1$ condor_q -global -constraint
'x509userproxysubject=="/DC=ch/DC=cern/OU=Organic
Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1”’
-- Schedd: arc-ce03.gridpp.rl.ac.uk : <130.246.181.25:62763>
ID
OWNER
SUBMITTED
RUN_TIME ST PRI SIZE CMD
678275.0
tatls015
12/2 17:57
2+06:07:15 R 0
2441.4 (arc_pilot
681762.0
tatls015
12/3 03:13
1+21:12:31 R 0
2197.3 (arc_pilot
705153.0
tatls015
12/4 07:36
0+16:49:12 R 0
2197.3 (arc_pilot
705807.0
tatls015
12/4 08:16
0+16:09:27 R 0
2197.3 (arc_pilot
705808.0
tatls015
12/4 08:16
0+16:09:27 R 0
2197.3 (arc_pilot
706612.0
tatls015
12/4 09:16
0+15:09:37 R 0
2197.3 (arc_pilot
706614.0
tatls015
12/4 09:16
0+15:09:26 R 0
2197.3 (arc_pilot
…
)
)
)
)
)
)
)
Jobs killed
• Jobs which were removed
[[email protected] ~]# condor_history -constraint 'JobStatus ==
ID
OWNER
SUBMITTED
RUN_TIME
ST COMPLETED
823881.0
alicesgm
12/5 01:01
1+06:13:22 X
831849.0
tlhcb005
12/5 13:19
0+18:52:26 X
832753.0
tlhcb005
12/5 14:38
0+17:07:07 X
819636.0
alicesgm
12/4 19:27
1+12:13:56 X
825511.0
alicesgm
12/5 03:03
0+18:52:10 X
823799.0
alicesgm
12/5 00:56
1+05:58:15 X
820001.0
alicesgm
12/4 19:48
1+06:43:22 X
833589.0
alicesgm
12/5 16:01
0+14:06:34 X
778644.0
tlhcb005
12/2 05:56
4+00:00:10 X
…
3’
CMD
??? /var/spool/arc/grid03/CVuMDmBSwGlnCIXDjqi
??? /var/spool/arc/grid09/gWmLDm5x7GlnCIXDjqi
??? /var/spool/arc/grid00/5wqKDm7C9GlnCIXDjqi
??? /var/spool/arc/grid00/mlrNDmoErGlnCIXDjqi
??? /var/spool/arc/grid04/XpuKDmxLyGlnCIXDjqi
??? /var/spool/arc/grid03/DYuMDmzMwGlnCIXDjqi
??? /var/spool/arc/grid08/cmzNDmpYrGlnCIXDjqi
??? /var/spool/arc/grid09/HKSLDmqUAHlnCIXDjqi
??? /var/spool/arc/grid00/pIJNDm6cvFlnCIXDjqi
Jobs killed
• Jobs removed for exceeding memory limit
[[email protected] ~]# condor_history -constraint 'JobStatus==3 &&
ResidentSetSize>1024*RequestMemory' -af ClusterId Owner ResidentSetSize RequestMemory
823953 alicesgm 3500000 3000
824438 alicesgm 3250000 3000
820045 alicesgm 3500000 3000
823881 alicesgm 3250000 3000
…
[[email protected] ~]# condor_history -constraint 'JobStatus==3 &&
ResidentSetSize>1024*RequestMemory' -af x509UserProxyVOName | sort | uniq -c
515 alice
5 cms
70 lhcb
condor_who
• What jobs are currently running on a worker node?
[[email protected] ~]# condor_who
OWNER
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
CLIENT
SLOT JOB
arc-ce02.gridpp.rl.ac.uk 1_2 654753.0
arc-ce02.gridpp.rl.ac.uk 1_5 654076.0
arc-ce04.gridpp.rl.ac.uk 1_10 1337818.0
arc-ce04.gridpp.rl.ac.uk 1_7 1337776.0
arc-ce02.gridpp.rl.ac.uk 1_1 651508.0
arc-ce03.gridpp.rl.ac.uk 1_4 737874.0
arc-ce04.gridpp.rl.ac.uk 1_6 1336938.0
arc-ce01.gridpp.rl.ac.uk 1_8 826808.0
arc-ce03.gridpp.rl.ac.uk 1_3 722597.0
RUNTIME
0+00:01:54
0+00:56:50
0+02:51:34
0+03:06:51
0+05:02:45
0+05:44:24
0+08:42:18
1+02:50:16
1+08:44:28
PID
15743
21916
31893
32295
17556
5032
26911
3485
22966
PROGRAM
/usr/libexec/condor/co
/usr/libexec/condor/co
/usr/libexec/condor/co
/usr/libexec/condor/co
/usr/libexec/condor/co
/usr/libexec/condor/co
/usr/libexec/condor/co
/usr/libexec/condor/co
/usr/libexec/condor/co
Startd history
• If STARTD_HISTORY defined on your WNs
[[email protected] ~]# condor_history
ID
OWNER
SUBMITTED
RUN_TIME
ST
841989.0
tatls015
12/6 07:58
0+00:02:39
/var/spool/arc/grid03/PZ6NDmPQPHlnCIXDjqi
841950.0
tatls015
12/6 07:56
0+00:02:40
/var/spool/arc/grid03/mckKDm4OPHlnCIXDjqi
841889.0
tatls015
12/6 07:53
0+00:02:33
/var/spool/arc/grid01/X3bNDmTMPHlnCIXDjqi
841847.0
tatls015
12/6 07:50
0+00:02:35
/var/spool/arc/grid00/yHHODmfJPHlnCIXDjqi
841816.0
tatls015
12/6 07:48
0+00:02:36
/var/spool/arc/grid04/iizMDmVHPHlnCIXDjqi
841791.0
tatls015
12/6 07:45
0+00:02:33
/var/spool/arc/grid00/N3vKDmKEPHlnCIXDjqi
716804.0
alicesgm
12/4 18:28
1+13:15:07
/var/spool/arc/grid07/TUQNDmUJqGlnzEJDjqI
…
COMPLETED
CMD
C 12/6 08:01
C
12/6
07:59
C
12/6
07:56
C
12/6
07:54
C
12/6
07:51
C
12/6
07:48
C
12/6
07:44
Ganglia
• condor_gangliad
– Runs on a single host (can be any host)
– Gathers daemon ClassAds from the collector
– Publishes metrics to ganglia with host spoofing
• At RAL we have on one host
GANGLIAD_VERBOSITY = 2
GANGLIAD_PER_EXECUTE_NODE_METRICS = False
GANGLIAD = $(LIBEXEC)/condor_gangliad
GANGLIA_CONFIG = /etc/gmond.conf
GANGLIAD_METRICS_CONFIG_DIR = /etc/condor/ganglia.d
GANGLIA_SEND_DATA_FOR_ALL_HOSTS = true
DAEMON_LIST = MASTER, GANGLIAD
Ganglia
• Small subset from schedd
Ganglia
• Small subset from central manager
Easy to make custom plots
Total running, idle, held jobs
• f
Running jobs by schedd
Negotiator health
• s
Negotiation cycle duration
Number of AutoClusters
Draining & multi-core slots
(Some) Third party tools
Job overview
• Condor Job Overview Monitor
http://sarkar.web.cern.ch/sarkar/doc/condor_jobview.html
Mimic
• Internal RAL application
htcondor-sysview
htcondor-sysview
• Hover mouse over a core to get job information
Nagios
• Most (all?) sites probably use Nagios or an alternative
• At RAL
– Process checks for condor_master on all nodes
– Central mangers
• Check for at least 1 collector
• Check for the negotiator
• Check for worker nodes
Number of startd ClassAds needs to be above a threshold
Number of non-broken worker nodes above a threshold
– CEs
• Check for schedd
• Job submission test

similar documents