Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data differences running with multiprocessing #8221

Closed
stscijgbot-jp opened this issue Jan 24, 2024 · 15 comments · Fixed by spacetelescope/stcal#239
Closed

Data differences running with multiprocessing #8221

stscijgbot-jp opened this issue Jan 24, 2024 · 15 comments · Fixed by spacetelescope/stcal#239

Comments

@stscijgbot-jp
Copy link
Collaborator

Issue JP-3521 was created on JIRA by Melanie Clarke:

I'm seeing some data differences between detector1 runs with maximum_cores='all' and the default serial processing.  I also see a warning about "leaked semaphore objects" after the run finishes when using multiple cores.

My test data is file jw02123001001_11101_00001_nrs2_uncal.fits.  I'm using fitsdiff to compare the output rateints files, with and without parallel processing in both steps.  I haven't tried turning it off in one and leaving it on in the other.  I see differences in the SCI, ERR, DQ, VAR_POISSON, and VAR_READNOISE extensions, in about 1% of the pixels, mostly at the 10-20% level.

@stscijgbot-jp
Copy link
Collaborator Author

Comment by Melanie Clarke on JIRA:

Test data, including output jump files, is at: ███████████████████████████████████████████████████

The full command line used and software and CRDS versions are noted in the README.

@stscijgbot-jp
Copy link
Collaborator Author

Comment by Michael Regan on JIRA:

I've put a PR into STCAL that should fix this. The PR number is #⁠239.

@stscijgbot-jp
Copy link
Collaborator Author

Comment by Melanie Clarke on JIRA:

Thanks - I will see if I can test against your branch.

Can we please also add a regression test to run detector1 with and without multiprocessing and make sure the results are the same?

@stscijgbot-jp
Copy link
Collaborator Author

Comment by Michael Regan on JIRA:

I added a regression test. It was useful. I found out that there was only one problem due to the multiple flagging of pixels with both JUMP and DO_NOT_USE or SATURATION. 

I've added the test to the PR

@stscijgbot-jp
Copy link
Collaborator Author

Comment by Melanie Clarke on JIRA:

Testing with your branch and my test data, I see some changes to the version processed in parallel, but it still doesn't match the serial version.  I'm copying the data up to the same location now.

Thanks for the unit test!  Howard Bushouse - would it be useful to also add a regression test to the jwst suite, for a more general end-to-end integrity check on the multiprocessing steps in detector1?

@stscijgbot-jp
Copy link
Collaborator Author

Comment by Howard Bushouse on JIRA:

Melanie Clarke adding a regression test sounds like a good idea. We could experiment with one of the existing regtests that runs Detector1 and see if the dataset(s) in use currently result in differences between the two methods and, if so, then just add an extra test to one or more of them to run jump and/or ramp_fit both ways. If none of the existing tests exhibit the behavior, we could create a new test from your dataset.

@stscijgbot-jp
Copy link
Collaborator Author

Comment by Melanie Clarke on JIRA:

That sounds perfect. Thanks!

@stscijgbot-jp
Copy link
Collaborator Author

Comment by Howard Bushouse on JIRA:

Fixed by spacetelescope/stcal#239

@stscijgbot-jp
Copy link
Collaborator Author

Comment by Melanie Clarke on JIRA:

Testing with my sample file, I now see identical jump files with and without parallel processing.

However, I do still see some very small differences between the rateints files. For this particular example, it is only 10 pixels in the SCI extension, different at the .001% level, so much less important than the differences were before this fix, but it is still consistently different.

It looks like these changes were only in the jump code.  Can we please review the ramp fitting code as well for possible differences with multiprocessing?

@stscijgbot-jp
Copy link
Collaborator Author

Comment by Michael Regan on JIRA:

There were bugs in the jump code that were exposed in your comparison between serial and parallel processing. The bug fixes caused changes in the GROUPDQ. Pretty much any change in the GROUPDQ will lead to a change in the rateints and rate products.

@stscijgbot-jp
Copy link
Collaborator Author

Comment by Melanie Clarke on JIRA:

These are rateints files generated from identical jump files, produced with the latest stcal, so the only difference is that parallel processing is on for the ramp_fit step.  I just confirmed again that running the ramp fit step directly on the same jump file produces the same differences:

strun jwst.ramp_fitting.RampFitStep jw02123001001_11101_00001_nrs2_jump.fits

produces a slightly different file than:

strun jwst.ramp_fitting.RampFitStep jw02123001001_11101_00001_nrs2_jump.fits --maximum_cores=all

@stscijgbot-jp
Copy link
Collaborator Author

Comment by Michael Regan on JIRA:

Melanie Clarke I've looked at the rateints you created and it really just looks like a numerical precision issue. Here are the stats on the rateints pixels that have different slope values:

Number of pixel = 153755 = 1.22%
Maximum positive deviation = 1.5e-4
Minimum negative deviation = -2.2e-4
median of non-zero pixel values = -8.5e-5
mean of non-zero pixel values =  -7.3e-5

So, I don't think we have a problem here. Thanks for looking at this. I never did it for either Jump or ramp_fit and I should have.

 

 

@stscijgbot-jp
Copy link
Collaborator Author

Comment by Melanie Clarke on JIRA:

Thanks for checking, Michael Regan.  That's reassuring.

@stscijgbot-jp
Copy link
Collaborator Author

Comment by David Law on JIRA:

Adding a note that I just tested this for some MIRI MRS data and got pretty much the same results as Melanie Clarke did for NIRSpec.  Namely, differences in 0.004% of pixels at the level of 1 part in 1e5.  Seems consistent with numerical precision, and would not be scientifically relevant.

Given a pass from both MIRI and NIRSpec I'm closing this ticket as done.

@stscijgbot-jp
Copy link
Collaborator Author

Comment by Jo Taylor on JIRA:

For completeness, NIRISS tested this as part of B10.2 and found this issue was fixed in v1.14.0
Under v1.13.3, running Detector1 with maximum_cores=1 and "all" produced rate files with differences in 15% of pixels, with differences up to 100%.
Under v1.14.0, running Detector1 with maximum_cores=1 and "all" produced rate files with differences in just 2 pixels, with differences on the order of e-5. 
Dataset used: /ifs/jwst/wit/niriss/pipeline_testing/build_10.2/JP-3521/jw01515006001_03101_00001_nis_uncal.fits

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant