-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[UPath I/O managers] special case handling of None outputs #18820
[UPath I/O managers] special case handling of None outputs #18820
Conversation
9569d69
to
bb786c6
Compare
e18a44d
to
3ca7f0b
Compare
be728b5
to
067f7f6
Compare
70b8610
to
9051a62
Compare
25d63d1
to
c97a875
Compare
9051a62
to
5cb3793
Compare
c97a875
to
b12f2f7
Compare
5cb3793
to
56a2790
Compare
0bee290
to
0ac8116
Compare
56a2790
to
047325b
Compare
0ac8116
to
f3c472d
Compare
047325b
to
75d69d2
Compare
f3c472d
to
ae0ed4f
Compare
75d69d2
to
b7aba1e
Compare
ae0ed4f
to
d58f206
Compare
b7aba1e
to
3f1cf4e
Compare
d58f206
to
44a3567
Compare
3f1cf4e
to
19eee0e
Compare
44a3567
to
ca41494
Compare
19eee0e
to
963ec15
Compare
@@ -979,7 +979,7 @@ def build_memoized_plan( | |||
resources=resources, | |||
version=step_output_versions[step_output_handle], | |||
) | |||
if not io_manager.has_output(context): | |||
if not io_manager.has_output(context): # TODO - this is the problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to update has_output to check for the tag so that assets are supported for memoization, if supporting memoization for assets is even a thing we do
febab28
to
cfcc61b
Compare
ef85e4f
to
23102ab
Compare
cfcc61b
to
434eeb8
Compare
5e898b9
to
dce20ac
Compare
dce20ac
to
ddf1bc8
Compare
Putting the decision history for this PR's design here for future reference if we revive this effort: Problem statement: The UPath I/O manager stores
We want to modify the UPath I/O manager so that it would not create files for Solution 1: Use the existence of a file to determine if the asset was
Solution 1 seemed bad, so we came up with… Solution 2: Mark that the output was At load time, we can piggyback on the versioning code path to fetch the Implementation 1: add this mark on the metadata for the Issues:
Implementation 2: add this mark on the tags of the This avoids issue 2 from above, but still has issues
More issues:
If this does get revived, I'd recommend considering going directly into a deprecation cycle for storing None outputs. This metadata/tag shenanigans felt like it would add a lot of code smell and lead to confusion down the road as to what this code was for, and uncertainty about whether it could be deleted. It also relied heavily on assumptions about how AssetMaterializations are emitted for range-partition backfills. We require one AssetMaterialization per partition in the range, and that is a constraint that may not hold in the future. We can defend against this with unit tests, but it is still a fragility in the system and a non-intuitive code dependency. |
Summary & Motivation
Long term, we’d like to have the following system re: I/O managers
** returning any non-None value from an asset or op
** doing a parameter-based dependency (def my_asset(upstream))
** returning None from an asset or op
** therefore you should set up dependencies using deps (Nothing dependencies for op).
** If you do a parameter-based dependency on an asset that returns None, the expectation is that you have met the contract of handle_output in some other way
The DB IO managers already error when
None
is returned, but the UPath I/O manager does store Nones, which means that this setupwill break if we go directly to not storing all returned
None
s. However, creating a lot of files that just storeNone
causes other problems:So as a workaround we will add special behavior to the UPath I/O manager so that
None
values will not be stored in the file system, but will still be loadable by the I/O manager in downstream assets. We will not add special behavior for ops because it conflicts with expectations for memoization. Once memoization is deprecated and removed we can add special casing for ops based on whether the expected output file exists. If it does not exist we can assume the output was None and provide None.For assets:
In
execute_step.py
we will add a "marker" when aNone
value is handled. In UPath IO managerhandle_output
we will not store theNone
in the file system. Then inload_input
, if the corresponding output has the "marker" we will automatically returnNone
rather that reading from the file system. The "marker" will be stored as tags, which are surfaced as part of the data versioning code in #19324 so this does not add any additional reads to the dbMaking these changes will allow us to continue supporting
None
values with the UPath I/O manager with no change in behavior, but also solve issues 1 and 2 described above.To Do:
Notes:
reference PR #15611
How I Tested These Changes