Support for sending hadd jobs to condor #15

abbywarden · 2022-09-16T11:58:54Z

An additional flag (--submit) has been added to the mhadd command to send the hadd jobs to condor.

jreic · 2022-09-16T17:29:59Z

Hi Abby,

This looks nice! Just a few comments for now:

It looks like mhadd -s doesn't give any error message when trying to hadd a file where the output already exists--we should add a printout for this.
The executable in the condor submission file seems to be hardcoded as

executable = /afs/hep.wisc.edu/home/acwarden/work/llp/mfv_1068p1/src/JMTucker/Tools/python/haddOnWorker.py,

it would be good to use the path relevant for whichever copy of the code is checked out (there is surely an env variable for this, though I don't recall it).

I think it would be good for the printouts to mention that it is submitting to condor somewhere, e.g. when I run mhadd -s MiniTreeV28Bm/ for some random minitree files, the output is:

MiniTreeV28Bm/condor_ttbar_2017: expecting 3 files if all jobs succeeded hadding 3 files to MiniTreeV28Bm/ttbar_2017.root at 2022-09-16 13:04:01.971853

and it wasn't obvious to me that a condor job had even been submitted (in fact, you mention this in your TODOs in the Evernote). I'm also not sure where the output file goes? It didn't appear as MiniTreeV28Bm/ttbar_2017.root, even after the jobs finished. Do you think it is a question of transferring the output back to the submission node (which is my assumption--maybe the default behavior at Wisconsin is different from LPC), or that the output root file is saved to a different directory (which I thought I would see in hadd.py)?

To be sure, which inputs have you tested with? Mainly histos jobs, or also MiniTree? We should probably make sure to test with all of our usual inputs (or if we expect it only to be useful for certain types of inputs, we could make it auto-fail for, say, MiniTrees).
When hadding 501 input files, condor_q suggested that the jobs might have the same inputs, at least based on the limited printout I see:

158138.0   joeyr           9/16 13:20   0+00:00:00 I  0    0.0 haddOnWorker.py MiniTreeV28Bm/ttbar_2017.root MiniTreeV28Bm/condor_ttbar_2017/minitree_100.root MiniTreeV28Bm/condor_ttbar_2017/minitree_99.root MiniTreeV28Bm/condor_ttbar_2017/minitree_98.root MiniTreeV28Bm/

158139.0   joeyr           9/16 13:20   0+00:00:00 I  0    0.0 haddOnWorker.py MiniTreeV28Bm/ttbar_2017.root MiniTreeV28Bm/condor_ttbar_2017/minitree_100.root MiniTreeV28Bm/condor_ttbar_2017/minitree_99.root MiniTreeV28Bm/condor_ttbar_2017/minitree_98.root MiniTreeV28Bm/

158140.0   joeyr           9/16 13:20   0+00:00:00 I  0    0.0 haddOnWorker.py MiniTreeV28Bm/ttbar_2017.root MiniTreeV28Bm/condor_ttbar_2017/minitree_100.root MiniTreeV28Bm/condor_ttbar_2017/minitree_99.root MiniTreeV28Bm/condor_ttbar_2017/minitree_98.root MiniTreeV28Bm/

This would be important to check, since we could miss some of our input files (and duplicate the rest) otherwise! It's possible that I did something unconventional though, so we should double check against what you've seen...

Maybe on a similar note, the output files I see after submitting the three jobs are only these:

-rw-r--r-- 1 joeyr us_cms      116 Sep 16 13:21 ttbar_2017.err
-rw-r--r-- 1 joeyr us_cms    15474 Sep 16 13:21 ttbar_2017.out
-rw-r--r-- 1 joeyr us_cms     5045 Sep 16 13:21 ttbar_2017.log

I think we should have a .err, .out, and log file for each condor job that runs (and a .root file too, of course!).

Let's look into these for now, and we can do a bit more testing after that. We should also think if the 200 input files per job (in haddcondor) is sufficient, or if we should split more heavily. It might require us to iteratively submit hadd jobs to condor though, which is a bit of a pain (especially with the way mhadd is set up). Maybe for simplicity it's okay to leave that 200 input files as-is.

Thanks!
Joey

…ondor

abbywarden · 2022-09-21T14:34:57Z

Hi Joey. Here is my next set of changes.

I have included print statements now : it will now give output based on which log files are present in the crab/condor directory (and if the root file is nonzero in the case of condor submit, the jobs will be resubmitted for a bit of protection?)
the path to the executable is now labelled as dir_executable and can be used universally
I now have it printing p.stdout.read() as jobs are submitted
I've only tested it on histograms, not minitrees -- currently don't have any minitrees to mess around with

5 & 6. I may have said this previously, but because I was only popping 200 files at a time, the return value was only from the last(?) set of 200 ==> my conclusion (for now) was actually to simplify things and not have an upper limit to the files to hadd together so that max condor job submitted per sample is 1 ==> and 1 set of outputs.

abbywarden · 2022-09-21T15:39:48Z

Now I have double checked that the new print statements work when you have a mix of jobs that were hadded locally and hadded via condor.
Additionally, since minitrees are just copied over (since they are just 1 output), they will not be submitted to condor

jreic · 2022-09-21T20:35:55Z

Hi Abby, for me on the LPC, it doesn't seem like the input files are available for hadding when running things out-of-the-box. I wonder if it's a matter of bad input files on my side (very old ones!), or a difference between the LPC and the Wisconsin system. The condor jobs run without any obvious failures (!), but no output file is written.

For testing purposes, it might be useful to have the same set of input files copied over to the LPC to remove one possible failure point--from the printouts I've done, it looks like the input files simply aren't on the condor node, but I could be wrong!

jreic · 2022-09-21T20:42:44Z

Tools/scripts/mhadd

            else:
+                #assume that the last case to consider is the one in which files are copied over (which do not produce any log but the .root file does exist)


For some reason, I find the following error if I have an existing file that I made with mhadd locally and try to run via mhadd -s:

Traceback (most recent call last): File "/uscms_data/d3/joeyr/abby_pr/mfv_946p1/bin/slc7_amd64_gcc630/mhadd", line 179, in <module> hadd_(is_crab, d, new_dir=x) File "/uscms_data/d3/joeyr/abby_pr/mfv_946p1/bin/slc7_amd64_gcc630/mhadd", line 153, in hadd_ print colors.boldred('skipping existing file %s (for which num sources %i = njobs %i)' % (new_name, log_njobs, njobs)) UnboundLocalError: local variable 'log_njobs' referenced before assignment

It looks like log_njobs isn't defined when it gets to this line (at least in my case), so mhadd crashes here.

abbywarden · 2022-10-04T09:11:14Z

Ang is the main author of the most recent commit. We have restructured the submission to condor in a few ways :

hadd_on_condor.py is it's own file (which writes the bash file and condor submission file) instead of adding a condor submission specific hadd function in hadd.py
removed haddOnWorker.py as it was unnecessary in the new setup
output files are the same regardless of how you hadd files together, so print statements have been reverted back to their original statements.

jreic · 2022-10-04T18:22:01Z

Tools/python/hadd_on_condor.py

+  #requirements necessary for wisconsin machines -- comment out if not needed
+  #    the CentOS == 7 requires proxy with 2048 bit and will fail with usual 1024
+  #    voms-proxy-init -rfc -valid 144:00 -voms cms -bits 2048


Which ones are to be commented out, is it these?
TARGET.HAS_OSG_WN_CLIENT =?= TRUE
TARGET.OpSysMajorVer == 7

Rather than commenting in/out, you could do something like https://github.com/DisplacedVertices/cmssw-usercode/blob/master/Tools/python/ROOTTools.py#L1576-L1595 with the HOSTNAME

jreic · 2022-10-04T20:32:22Z

Hi @abbywarden and @Ang-Li-95,

Thanks for the update to this!

I skimmed the code and didn't see any obvious issues. But when trying to test it out, I noticed one thing:

if the inputs are going to be invalid (e.g. if the input files are in nobackup on LPC, then the condor job will always fail), it would be nice to skip submitting the job and throw an error instead. This is mostly for when we forget about this issue in the future, and will undoubtedly spend time trying to debug the job failures later on!

Beyond that: have you been able to validate that the mhadd -s and mhadd outputs are identical for (say) a histos job for a relevant signal sample, the same for a background sample, and (if available) for a minitree job? I think this would be a good sanity check, to make sure that the script properly handles histograms in our typical histos.root files, trees, and histograms in our minitree files.

support to submit hadd jobs to condor

9795972

abbywarden force-pushed the master branch from 573c562 to 9795972 Compare September 16, 2022 14:07

better output print statements for condor mhadd and simplified hadd_c…

71a5a7b

…ondor

more complete print statements for mhadd

91cad65

jreic reviewed Sep 21, 2022

View reviewed changes

more robust hadding on condor to work on all machines; main author : Ang

c728510

jreic reviewed Oct 4, 2022

View reviewed changes

jreic mentioned this pull request Sep 19, 2024

Attempt to combine b-jet/displaced-jet analysis work and lepton analysis work #25

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for sending hadd jobs to condor #15

Support for sending hadd jobs to condor #15

abbywarden commented Sep 16, 2022

jreic commented Sep 16, 2022

abbywarden commented Sep 21, 2022

abbywarden commented Sep 21, 2022 •

edited

Loading

jreic commented Sep 21, 2022

jreic Sep 21, 2022

abbywarden commented Oct 4, 2022

jreic Oct 4, 2022

jreic commented Oct 4, 2022

		else:
		#assume that the last case to consider is the one in which files are copied over (which do not produce any log but the .root file does exist)

Support for sending hadd jobs to condor #15

Are you sure you want to change the base?

Support for sending hadd jobs to condor #15

Conversation

abbywarden commented Sep 16, 2022

jreic commented Sep 16, 2022

abbywarden commented Sep 21, 2022

abbywarden commented Sep 21, 2022 • edited Loading

jreic commented Sep 21, 2022

jreic Sep 21, 2022

Choose a reason for hiding this comment

abbywarden commented Oct 4, 2022

jreic Oct 4, 2022

Choose a reason for hiding this comment

jreic commented Oct 4, 2022

abbywarden commented Sep 21, 2022 •

edited

Loading