Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for sending hadd jobs to condor #15

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

abbywarden
Copy link
Contributor

An additional flag (--submit) has been added to the mhadd command to send the hadd jobs to condor.

@jreic
Copy link
Contributor

jreic commented Sep 16, 2022

Hi Abby,

This looks nice! Just a few comments for now:

  1. It looks like mhadd -s doesn't give any error message when trying to hadd a file where the output already exists--we should add a printout for this.

  2. The executable in the condor submission file seems to be hardcoded as

executable = /afs/hep.wisc.edu/home/acwarden/work/llp/mfv_1068p1/src/JMTucker/Tools/python/haddOnWorker.py,

it would be good to use the path relevant for whichever copy of the code is checked out (there is surely an env variable for this, though I don't recall it).

  1. I think it would be good for the printouts to mention that it is submitting to condor somewhere, e.g. when I run mhadd -s MiniTreeV28Bm/ for some random minitree files, the output is:

MiniTreeV28Bm/condor_ttbar_2017: expecting 3 files if all jobs succeeded hadding 3 files to MiniTreeV28Bm/ttbar_2017.root at 2022-09-16 13:04:01.971853

and it wasn't obvious to me that a condor job had even been submitted (in fact, you mention this in your TODOs in the Evernote). I'm also not sure where the output file goes? It didn't appear as MiniTreeV28Bm/ttbar_2017.root, even after the jobs finished. Do you think it is a question of transferring the output back to the submission node (which is my assumption--maybe the default behavior at Wisconsin is different from LPC), or that the output root file is saved to a different directory (which I thought I would see in hadd.py)?

  1. To be sure, which inputs have you tested with? Mainly histos jobs, or also MiniTree? We should probably make sure to test with all of our usual inputs (or if we expect it only to be useful for certain types of inputs, we could make it auto-fail for, say, MiniTrees).

  2. When hadding 501 input files, condor_q suggested that the jobs might have the same inputs, at least based on the limited printout I see:

158138.0   joeyr           9/16 13:20   0+00:00:00 I  0    0.0 haddOnWorker.py MiniTreeV28Bm/ttbar_2017.root MiniTreeV28Bm/condor_ttbar_2017/minitree_100.root MiniTreeV28Bm/condor_ttbar_2017/minitree_99.root MiniTreeV28Bm/condor_ttbar_2017/minitree_98.root MiniTreeV28Bm/

158139.0   joeyr           9/16 13:20   0+00:00:00 I  0    0.0 haddOnWorker.py MiniTreeV28Bm/ttbar_2017.root MiniTreeV28Bm/condor_ttbar_2017/minitree_100.root MiniTreeV28Bm/condor_ttbar_2017/minitree_99.root MiniTreeV28Bm/condor_ttbar_2017/minitree_98.root MiniTreeV28Bm/

158140.0   joeyr           9/16 13:20   0+00:00:00 I  0    0.0 haddOnWorker.py MiniTreeV28Bm/ttbar_2017.root MiniTreeV28Bm/condor_ttbar_2017/minitree_100.root MiniTreeV28Bm/condor_ttbar_2017/minitree_99.root MiniTreeV28Bm/condor_ttbar_2017/minitree_98.root MiniTreeV28Bm/

This would be important to check, since we could miss some of our input files (and duplicate the rest) otherwise! It's possible that I did something unconventional though, so we should double check against what you've seen...

  1. Maybe on a similar note, the output files I see after submitting the three jobs are only these:
-rw-r--r-- 1 joeyr us_cms      116 Sep 16 13:21 ttbar_2017.err
-rw-r--r-- 1 joeyr us_cms    15474 Sep 16 13:21 ttbar_2017.out
-rw-r--r-- 1 joeyr us_cms     5045 Sep 16 13:21 ttbar_2017.log

I think we should have a .err, .out, and log file for each condor job that runs (and a .root file too, of course!).

Let's look into these for now, and we can do a bit more testing after that. We should also think if the 200 input files per job (in haddcondor) is sufficient, or if we should split more heavily. It might require us to iteratively submit hadd jobs to condor though, which is a bit of a pain (especially with the way mhadd is set up). Maybe for simplicity it's okay to leave that 200 input files as-is.

Thanks!
Joey

@abbywarden
Copy link
Contributor Author

Hi Joey. Here is my next set of changes.

  1. I have included print statements now : it will now give output based on which log files are present in the crab/condor directory (and if the root file is nonzero in the case of condor submit, the jobs will be resubmitted for a bit of protection?)

  2. the path to the executable is now labelled as dir_executable and can be used universally

  3. I now have it printing p.stdout.read() as jobs are submitted

  4. I've only tested it on histograms, not minitrees -- currently don't have any minitrees to mess around with

5 & 6. I may have said this previously, but because I was only popping 200 files at a time, the return value was only from the last(?) set of 200 ==> my conclusion (for now) was actually to simplify things and not have an upper limit to the files to hadd together so that max condor job submitted per sample is 1 ==> and 1 set of outputs.

@abbywarden
Copy link
Contributor Author

abbywarden commented Sep 21, 2022

Now I have double checked that the new print statements work when you have a mix of jobs that were hadded locally and hadded via condor.
Additionally, since minitrees are just copied over (since they are just 1 output), they will not be submitted to condor

@jreic
Copy link
Contributor

jreic commented Sep 21, 2022

Hi Abby, for me on the LPC, it doesn't seem like the input files are available for hadding when running things out-of-the-box. I wonder if it's a matter of bad input files on my side (very old ones!), or a difference between the LPC and the Wisconsin system. The condor jobs run without any obvious failures (!), but no output file is written.

For testing purposes, it might be useful to have the same set of input files copied over to the LPC to remove one possible failure point--from the printouts I've done, it looks like the input files simply aren't on the condor node, but I could be wrong!

else:
#assume that the last case to consider is the one in which files are copied over (which do not produce any log but the .root file does exist)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason, I find the following error if I have an existing file that I made with mhadd locally and try to run via mhadd -s:

Traceback (most recent call last):
  File "/uscms_data/d3/joeyr/abby_pr/mfv_946p1/bin/slc7_amd64_gcc630/mhadd", line 179, in <module>
    hadd_(is_crab, d, new_dir=x)
  File "/uscms_data/d3/joeyr/abby_pr/mfv_946p1/bin/slc7_amd64_gcc630/mhadd", line 153, in hadd_
    print colors.boldred('skipping existing file %s (for which num sources %i = njobs %i)' % (new_name, log_njobs, njobs))
UnboundLocalError: local variable 'log_njobs' referenced before assignment

It looks like log_njobs isn't defined when it gets to this line (at least in my case), so mhadd crashes here.

@abbywarden
Copy link
Contributor Author

Ang is the main author of the most recent commit. We have restructured the submission to condor in a few ways :

  1. hadd_on_condor.py is it's own file (which writes the bash file and condor submission file) instead of adding a condor submission specific hadd function in hadd.py
  2. removed haddOnWorker.py as it was unnecessary in the new setup
  3. output files are the same regardless of how you hadd files together, so print statements have been reverted back to their original statements.

Comment on lines +59 to +61
#requirements necessary for wisconsin machines -- comment out if not needed
# the CentOS == 7 requires proxy with 2048 bit and will fail with usual 1024
# voms-proxy-init -rfc -valid 144:00 -voms cms -bits 2048
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which ones are to be commented out, is it these?
TARGET.HAS_OSG_WN_CLIENT =?= TRUE
TARGET.OpSysMajorVer == 7

Rather than commenting in/out, you could do something like https://github.com/DisplacedVertices/cmssw-usercode/blob/master/Tools/python/ROOTTools.py#L1576-L1595 with the HOSTNAME

@jreic
Copy link
Contributor

jreic commented Oct 4, 2022

Hi @abbywarden and @Ang-Li-95,

Thanks for the update to this!

I skimmed the code and didn't see any obvious issues. But when trying to test it out, I noticed one thing:

  • if the inputs are going to be invalid (e.g. if the input files are in nobackup on LPC, then the condor job will always fail), it would be nice to skip submitting the job and throw an error instead. This is mostly for when we forget about this issue in the future, and will undoubtedly spend time trying to debug the job failures later on!

Beyond that: have you been able to validate that the mhadd -s and mhadd outputs are identical for (say) a histos job for a relevant signal sample, the same for a background sample, and (if available) for a minitree job? I think this would be a good sanity check, to make sure that the script properly handles histograms in our typical histos.root files, trees, and histograms in our minitree files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants