Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Split off pyarrow-* builds #1381

Closed
h-vetinari opened this issue May 3, 2024 · 15 comments · Fixed by conda-forge/pyarrow-feedstock#111
Closed

RFC: Split off pyarrow-* builds #1381

h-vetinari opened this issue May 3, 2024 · 15 comments · Fixed by conda-forge/pyarrow-feedstock#111

Comments

@h-vetinari
Copy link
Member

h-vetinari commented May 3, 2024

The more things change, the more they stay the same...

Almost 4 years ago, the https://github.com/conda-forge/pyarrow-feedstock feedstock was archived and the builds were moved here in #146. The package split alluded to in #93 and clarified in #862 took a bit longer to materialize (in #875). With the impending #1376, we're now getting a very hefty 30(!) artefacts per CI job,

List of artefacts as of v16 + pyarrow{,-core,-all}
anaconda upload \
    /home/conda/feedstock_root/build_artifacts/linux-64/apache-arrow-proc-5.0.0-cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/libarrow-16.0.0-hefa796f_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/libarrow-acero-16.0.0-hbabe93e_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/libarrow-flight-16.0.0-hc4f8a93_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/libarrow-gandiva-16.0.0-hc1954e9_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/libparquet-16.0.0-hacf5a1f_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/libarrow-dataset-16.0.0-hbabe93e_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/libarrow-flight-sql-16.0.0-he4f5ca8_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/libarrow-substrait-16.0.0-h8508dc1_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/libarrow-all-16.0.0-ha770c72_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-core-16.0.0-py38hc396e17_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-core-16.0.0-py39h38d04b8_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-core-16.0.0-py312h3f82784_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-core-16.0.0-py310hd207890_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-core-16.0.0-py311hd5e4297_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-16.0.0-py38hb563948_0.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-16.0.0-py39h8003fee_0.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-16.0.0-py312h8da182e_0.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-16.0.0-py310h17c5347_0.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-16.0.0-py311h781c19f_0.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-all-16.0.0-py38hb563948_0.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-all-16.0.0-py39h8003fee_0.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-all-16.0.0-py312h8da182e_0.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-all-16.0.0-py310h17c5347_0.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-all-16.0.0-py311h781c19f_0.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-tests-16.0.0-py38hc396e17_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-tests-16.0.0-py39h38d04b8_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-tests-16.0.0-py312h3f82784_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-tests-16.0.0-py310hd207890_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-tests-16.0.0-py311hd5e4297_0_cpu.conda

...which is also pushing conda-smithy and the rerender bots to its limits (in terms of rendering time), c.f. conda-forge/conda-forge-pinning-feedstock#5815. This also spurred some performance improvements in conda-build, but fundamentally the issue remains that this is getting very large. Quoting @beckermr from the pinning issue:

Splitting that feedstock up into more than one would potentially have other benefits in addition to easing the burden on our already overstressed tools.

At first I thought this was not going to work, but after a closer look and especially with the split after #875, there's actually a pretty clean separation between the C++ libarrow* side and the python pyarrow* bits. In short, I think there's no technical barrier to do this.

Here's some pros/cons as I see them.

Cons:

  • We lose the 1:1 correspondence between pyarrow and libarrow, which made it easy to do invasive changes on the libarrow side.
    • For example, we've recently had issues with specific GCC versions that manifested in segfaults in the pyarrow test suite1, which is much easier to debug if we see the effect of a changed compiler version directly in the PR, without having to first publish a libarrow build to then test pyarrow.
    • For example, if we were to enable orc-for-pyarrow on windows (which will be possible as of orc 2.0.1), we'd have to avoid on the pyarrow-side that a too-old libarrow gets pulled in which doesn't have support yet. In this case it works because we've had enabled orc-support in libarrow for a long time.
  • More maintenance effort from keeping two recipes times ~4 maintenance branches running
  • Upstream arrow runs the conda recipes in their CI, this would need some work to follow suit (CC @kou @pitrou @assignUser)

Pros:

  • Less burden for our infrastructure (i.e. migrator timeouts)
  • More reliable bot PRs --> less maintenance effort
  • Much faster rerendering times
  • Less artefact churn for pyarrow (i.e. the libarrow bits are migrated very often without actually requiring a rebuild of pyarrow; in the current recipe, this always leads to pushing new pyarrow builds as well)
  • Shorter build times on both sides
  • More reliable builds for pyarrow (due to being split into individual runs, instead of having 5x the opportunity to run into a flaky error in the test suite in a single job).

Assuming we want to do this, we could unarchive pyarrow, and update it to v16. In that case, it would make sense to split the pyarrow bits off from #1376. I'm not sure we would want to touch any of the older still-supported versions for this, but that'd be possible as well (maybe from v13, as we're about to drop v12 once we migrate v16).

Thoughts @conda-forge/arrow-cpp?

CC @conda-forge/core

Footnotes

  1. arguably also related to the fact that we're not running the test suite for libarrow, which is hard because it unconditionally depends on the very-hard-to-package-because-incompatible-with-our-pinnings testbench

@kkraus14
Copy link
Contributor

kkraus14 commented May 6, 2024

cc @assignUser @raulcd

@assignUser
Copy link

The more things change, the more they stay the same...

🔁 😀

It sounds like the current situation is putting an undue strain on the common resources so +1 from me. We will sync the recipes after the split as always or any specific objections @raulcd ?

@xhochy
Copy link
Member

xhochy commented May 6, 2024

I'm fine with splitting things up again. In the future, when we have rattler-build support, we can reevaluate again, but for now, I feel that missing automation and the massive rerendering times are a bigger strain than maintaining two separate feedstocks.

@jakirkham
Copy link
Member

Ok with consensus here

Though I think there have been a few changes in conda-build, which are about to be released in 24.5.0 ( conda/conda-build#5319 ) to help with some performance issues. So it may be worth seeing how that goes

@assignUser
Copy link

So it may be worth seeing how that goes

I think the fact that changes to the tooling are necessary to make the current setup work (better) is a pretty good indication that changes are needed ^^

@h-vetinari
Copy link
Member Author

OK, thanks for the inputs so far! Another open question: do we want to do this from v16 onwards, or for all currently supported versions? I'd start with v16 for now, and if we want we can still create extra branches on the pyarrow feedstock later.

@jakirkham
Copy link
Member

Which branches/versions are having issues currently?

@h-vetinari
Copy link
Member Author

Which branches/versions are having issues currently?

I don't know if there are open migrations where the bot is failing to open a PR, but the rerender times mainly went up with v14, where we split libarrow into more granular pieces.

@xhochy
Copy link
Member

xhochy commented May 7, 2024

We should do the split for v16 now and for the v14,15 later if we see the need for it. Splitting already released versions could more likely break. Let's first get some experience with doing it for v16.

@xhochy
Copy link
Member

xhochy commented May 7, 2024

Which branches/versions are having issues currently?

I don't know if there are open migrations where the bot is failing to open a PR, but the rerender times mainly went up with v14, where we split libarrow into more granular pieces.

Yes, we have open migrations like the ucx one.

@h-vetinari
Copy link
Member Author

Does someone else from upstream arrow still want to comment before we move ahead? @kkraus14 @raulcd @kou @pitrou

Thanks @assignUser for chiming in already!

@h-vetinari
Copy link
Member Author

The pyarrow feedstock was now unarchived, and I have a PR that moves over the python bits from #1376 to conda-forge/pyarrow-feedstock#111. Neither are merged yet, but that way it's hopefully easier to see how the split would look like.

@jdblischak
Copy link
Member

Which branches/versions are having issues currently?

In my recent experience rerendering the currently supported branches (conda-forge/conda-forge-pinning-feedstock#5815 (comment)), I don't think it would be worth the effort to port 12.x or 13.x. Both of these branches rerender quickly. 14.x would be more of a judgement call. It takes longer to rerender, but is safely under the 15 minutes timeout limit.

@raulcd
Copy link
Member

raulcd commented May 7, 2024

Splitting the feedstocks from 16.0.0 onwards sounds like a sensible approach to me. As per the older versions I would agree that unless required we should continue with the status quo.
As per the upstream jobs that run the conda recipes on Arrow CI they have been failing for the last 2-3 months. The current nightly failures can be seen here.
image

We have to update those but we haven't done lately.

@h-vetinari
Copy link
Member Author

OK, it seems everyone is in agreement then! :)

I'll get to merging then, thanks for the inputs everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants