Data differences running with multiprocessing #8221

stscijgbot-jp · 2024-01-24T20:23:05Z

Issue JP-3521 was created on JIRA by Melanie Clarke:

I'm seeing some data differences between detector1 runs with maximum_cores='all' and the default serial processing. I also see a warning about "leaked semaphore objects" after the run finishes when using multiple cores.

My test data is file jw02123001001_11101_00001_nrs2_uncal.fits. I'm using fitsdiff to compare the output rateints files, with and without parallel processing in both steps. I haven't tried turning it off in one and leaving it on in the other. I see differences in the SCI, ERR, DQ, VAR_POISSON, and VAR_READNOISE extensions, in about 1% of the pixels, mostly at the 10-20% level.

stscijgbot-jp · 2024-01-24T21:28:05Z

Comment by Melanie Clarke on JIRA:

Test data, including output jump files, is at: ███████████████████████████████████████████████████

The full command line used and software and CRDS versions are noted in the README.

stscijgbot-jp · 2024-01-25T20:08:08Z

Comment by Michael Regan on JIRA:

I've put a PR into STCAL that should fix this. The PR number is #⁠239.

stscijgbot-jp · 2024-01-25T21:08:06Z

Comment by Melanie Clarke on JIRA:

Thanks - I will see if I can test against your branch.

Can we please also add a regression test to run detector1 with and without multiprocessing and make sure the results are the same?

stscijgbot-jp · 2024-01-26T20:08:06Z

Comment by Michael Regan on JIRA:

I added a regression test. It was useful. I found out that there was only one problem due to the multiple flagging of pixels with both JUMP and DO_NOT_USE or SATURATION.

I've added the test to the PR

stscijgbot-jp · 2024-01-26T21:43:07Z

Comment by Melanie Clarke on JIRA:

Testing with your branch and my test data, I see some changes to the version processed in parallel, but it still doesn't match the serial version. I'm copying the data up to the same location now.

Thanks for the unit test! Howard Bushouse - would it be useful to also add a regression test to the jwst suite, for a more general end-to-end integrity check on the multiprocessing steps in detector1?

stscijgbot-jp · 2024-01-29T12:43:07Z

Comment by Howard Bushouse on JIRA:

Melanie Clarke adding a regression test sounds like a good idea. We could experiment with one of the existing regtests that runs Detector1 and see if the dataset(s) in use currently result in differences between the two methods and, if so, then just add an extra test to one or more of them to run jump and/or ramp_fit both ways. If none of the existing tests exhibit the behavior, we could create a new test from your dataset.

stscijgbot-jp · 2024-01-29T14:28:09Z

Comment by Melanie Clarke on JIRA:

That sounds perfect. Thanks!

stscijgbot-jp · 2024-02-01T20:18:06Z

Comment by Howard Bushouse on JIRA:

Fixed by spacetelescope/stcal#239

stscijgbot-jp · 2024-02-01T22:48:09Z

Comment by Melanie Clarke on JIRA:

Testing with my sample file, I now see identical jump files with and without parallel processing.

However, I do still see some very small differences between the rateints files. For this particular example, it is only 10 pixels in the SCI extension, different at the .001% level, so much less important than the differences were before this fix, but it is still consistently different.

It looks like these changes were only in the jump code. Can we please review the ramp fitting code as well for possible differences with multiprocessing?

stscijgbot-jp · 2024-02-02T00:53:08Z

Comment by Michael Regan on JIRA:

There were bugs in the jump code that were exposed in your comparison between serial and parallel processing. The bug fixes caused changes in the GROUPDQ. Pretty much any change in the GROUPDQ will lead to a change in the rateints and rate products.

stscijgbot-jp · 2024-02-02T15:13:06Z

Comment by Melanie Clarke on JIRA:

These are rateints files generated from identical jump files, produced with the latest stcal, so the only difference is that parallel processing is on for the ramp_fit step. I just confirmed again that running the ramp fit step directly on the same jump file produces the same differences:

strun jwst.ramp_fitting.RampFitStep jw02123001001_11101_00001_nrs2_jump.fits

produces a slightly different file than:

strun jwst.ramp_fitting.RampFitStep jw02123001001_11101_00001_nrs2_jump.fits --maximum_cores=all

stscijgbot-jp · 2024-02-05T16:13:07Z

Comment by Michael Regan on JIRA:

Melanie Clarke I've looked at the rateints you created and it really just looks like a numerical precision issue. Here are the stats on the rateints pixels that have different slope values:

Number of pixel = 153755 = 1.22%
Maximum positive deviation = 1.5e-4
Minimum negative deviation = -2.2e-4
median of non-zero pixel values = -8.5e-5
mean of non-zero pixel values = -7.3e-5

So, I don't think we have a problem here. Thanks for looking at this. I never did it for either Jump or ramp_fit and I should have.

stscijgbot-jp · 2024-02-05T16:18:09Z

Comment by Melanie Clarke on JIRA:

Thanks for checking, Michael Regan. That's reassuring.

stscijgbot-jp · 2024-02-23T21:43:11Z

Comment by David Law on JIRA:

Adding a note that I just tested this for some MIRI MRS data and got pretty much the same results as Melanie Clarke did for NIRSpec. Namely, differences in 0.004% of pixels at the level of 1 part in 1e5. Seems consistent with numerical precision, and would not be scientifically relevant.

Given a pass from both MIRI and NIRSpec I'm closing this ticket as done.

stscijgbot-jp · 2024-04-30T21:18:07Z

Comment by Jo Taylor on JIRA:

For completeness, NIRISS tested this as part of B10.2 and found this issue was fixed in v1.14.0
Under v1.13.3, running Detector1 with maximum_cores=1 and "all" produced rate files with differences in 15% of pixels, with differences up to 100%.
Under v1.14.0, running Detector1 with maximum_cores=1 and "all" produced rate files with differences in just 2 pixels, with differences on the order of e-5.
Dataset used: /ifs/jwst/wit/niriss/pipeline_testing/build_10.2/JP-3521/jw01515006001_03101_00001_nis_uncal.fits

stscijgbot-jp added Software Affected: jump team-coffee labels Jan 24, 2024

hbushouse mentioned this issue Jan 25, 2024

JP-3521: Fix jump multiprocessing error spacetelescope/stcal#239

Merged

5 tasks

hbushouse closed this as completed in spacetelescope/stcal#239 Feb 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data differences running with multiprocessing #8221

Data differences running with multiprocessing #8221

stscijgbot-jp commented Jan 24, 2024

stscijgbot-jp commented Jan 24, 2024

stscijgbot-jp commented Jan 25, 2024

stscijgbot-jp commented Jan 25, 2024

stscijgbot-jp commented Jan 26, 2024

stscijgbot-jp commented Jan 26, 2024

stscijgbot-jp commented Jan 29, 2024

stscijgbot-jp commented Jan 29, 2024

stscijgbot-jp commented Feb 1, 2024

stscijgbot-jp commented Feb 1, 2024

stscijgbot-jp commented Feb 2, 2024

stscijgbot-jp commented Feb 2, 2024

stscijgbot-jp commented Feb 5, 2024

stscijgbot-jp commented Feb 5, 2024

stscijgbot-jp commented Feb 23, 2024

stscijgbot-jp commented Apr 30, 2024

Data differences running with multiprocessing #8221

Data differences running with multiprocessing #8221

Comments

stscijgbot-jp commented Jan 24, 2024

stscijgbot-jp commented Jan 24, 2024

stscijgbot-jp commented Jan 25, 2024

stscijgbot-jp commented Jan 25, 2024

stscijgbot-jp commented Jan 26, 2024

stscijgbot-jp commented Jan 26, 2024

stscijgbot-jp commented Jan 29, 2024

stscijgbot-jp commented Jan 29, 2024

stscijgbot-jp commented Feb 1, 2024

stscijgbot-jp commented Feb 1, 2024

stscijgbot-jp commented Feb 2, 2024

stscijgbot-jp commented Feb 2, 2024

stscijgbot-jp commented Feb 5, 2024

stscijgbot-jp commented Feb 5, 2024

stscijgbot-jp commented Feb 23, 2024

stscijgbot-jp commented Apr 30, 2024