-
Notifications
You must be signed in to change notification settings - Fork 15
Meeting notes
Giorgis
- Release 0.1.3
- Repo/package rename
- numbaWithOpenmp / numba => numba-pyomp
- llvmliteWithOpenmp / llvmlite => llvmlite-pyomp
Todd, Stuart, Bronis, Giorgis
- Recap on SC24 and PyOMP
- Release for non-constant step support in parallel for range
- Next technical task for PyOMP
- Todd suggests to create a nightly/merge build
- Stuart will disseminate Proteus to numba maintainers, Giorgis is open to discuss
- [ACTION] Giorgis will Release 0.1.3 with update for numbawithopenmp
- [ACTION] Giorgis find out what is the diff between the upstream Numba and our fork
- Use upstream numba but our version of llvmlite
- 3.10+ in most recent numba 0.61
- [ACTION] Giorgis work with numba-cuda, numba-hip for pyomp
Todd, Stuart, Giorgis
- Discuss variable auto-privatization
- Check whether Tim has fixed bugs (reduction mul, reproduce x shared expected data race)
- Check on Todd's PRs (slicing, fix caching)
foo = <sth1>
with openmp("..."):
bar = foo
foo = <sth2>
foo = <sth3>
- Auto-privatization:
- "If a variable is defined/assigned within an openmp region and it is not used or defined after the region, then it is private" DOESN'T WORK FOR ABOVE SNIPPET
- TO DO: "If a variable used inside a parallel region is read or written outside the parallel region, then it is shared"
- CURRENT IMPL: "If a value written to a variable could be read inside a parallel region then that variable will be shared, if a value written to a variable inside a parallel region could be read following the parallel region the variable will be shared. In all other cases it will be private."
- [ACTION] Giorgis enumerate all possibilities (<18) and map definition and simplify
- [ACTION] Giorgis mail Tim on bugs
- [ACTION] Giorgis merge Todd's PRs and release PyOMP 0.1.2
- [ACTION] Giorgis create and share short slides for the SC'24 workshop
Todd, Tim, Stuart, Giorgis
- Publish Zenodo video and abstract on overleaf
- Tim bug with reduction mul
- Tim bug with pi loop making x shared
- Todd's PRs (fix caching, slicing)
- Giorgis created the Zenodo, ready to ship
- Giorgis updated the abstract for acks and links
- Reduction mul works on Giorgis, not on Tim
- Todd will help in a consulting capacity
- [ACTION] Todd, Giorgis, Tim Define when a variable becomes automatically private/shared rigorously
- [ACTION] Giorgis review the PRs in numbawithopenmp
Todd, Stuart, Giorgis
- Tim's slides
- pypi pyomp package
- [DONE] Todd give access to Giorgis to the pypi package page
- [TODO] Create a wheel package for PyOMP, look at how Numba does it
- There may be naming issues with wheel packaging
- Caching
- CPU side works, GPU doesn't
- [ACTION] Giorgis send email to Tim replying Tim's comments
Todd, Stuart, Giorgis
- Tim's slides
- Caching
- Caching works with non-target regions, problem with target regions. We are introducing something unpicklable in the code.
- [Action] Todd & Giorgis create slides for workshop presentation
- [Action] Todd works on resolving caching issues for target regions
- [Action] Giorgis, Todd, Tim, Involve more NERSC people and find one user from it
- [Action] Todd, Giorgis Create numba openmp caching tests
- [Action] Todd look at used OpenMP HeCBench directives to see what we support and what we don't
- [Action] Giorgis email HPPS workshop chairs for the presentation time slot and presentation requirements
Todd, Stuart, Giorgis
- Tim's question on implicit data movement
- Slicing arrays for GPU execution
- Increase our social media presence (Tim?)
- Create videos for PyOMP on a youtube channel
- Post on numba's discourse forum
- PR on lazy init has needed code for caching, revive. How to test? Check
numba/cuda/tests/cudapy/test_caching.py
.
[Action] Giorgis will create a short demo video- [Action] Todd & Giorgis create slides for workshop presentation
[Action] Giorgis, Todd Numba's create topic on discourse (showcase category)[Action] Email Tim about the slides and data movement issues and ping on adding PyOMP to the openmp webside, how is he going to show code running?[Action] Todd will create a branch to test slicing of arrays and comm. with Giorgis- [Action] Giorgis, Todd, Tim, Involve more NERSC people and find one user from it
- [Action] Todd, Giorgis Create numba openmp caching tests
- [Action] Todd look at used OpenMP HeCBench directives to see what we support and what we don't
- [Action] Giorgis email HPPS workshop chairs for the presentation time slot and presentation requirements
Tim, Todd, Stuart, Giorgis
- Updates on tutorial slides
- Demo video presentation
- Presenters for the workshop and tutorial
- PyOMP Release
- Next dev
- AMD GPU
- Update llvmlite/numba
-
[Action] Giorgis will create a short demo video
- Intro to PyOMP
- PyOMP installation
- Alternative ways to try (binder, containers)
- Hello world (parallel)
- Vector addition (CPU)
- Vector addition (GPU)
-
[Tutorial] Tim will present the tutorial, Todd & Giorgis answer questions
-
[Action] Todd & Giorgis create slides for workshop presentation
-
[Action] Giorgis email HPPS workshop chairs for the presentation time slot and presentation requirements
-
[Action] Giorgis provide comments and release the lock on tutorial slides
- Default handling of data
- Map/target data clauses
-
Release 0.1.1 ~next week
-
Update llvmlite/numba: wait for bump to newer LLVM version
Attendees: Todd, Giorgis
Still trying to find time to work on conda packaging (Giorgis) and looking into differences between C and PyOMP for CFD (Todd).
Attendees: Todd, Giorgis, Stuart, Tim
- Just finished supercomputing paper.
- Talked about using os.which to find llvm-config and then use that to find location of bin directories.
- Talked about how to use conda-forge for building for all the platforms. Would have to change packages names...too much trouble.
- Could use the CI system that runs on different platforms to do the building.
- Put our scripts in pyomp repo. Even better would be to switch to using conda-build only and updating the build scripts in those conda recipes.
- Still have problem with conda-build and llvm openmp-runtime build using the clang that was just built.
Attendees: Todd A. Anderson, Giorgis, Stuart
- Numba changes review. a) Tuple typing should be handled in Numba related to call convention. b) find_top_level_loops create patch c) uninstall_registry - probably try to be additive. implement a new approach d) itercount - use intrinsic in function that looks inside range object base 708 e) optimize_final_module revert change f) run static constructors upstream g) add_intrinsics_openmp_pass - Andre's approach h) initialize_all_targets - move this to some openmp run once location i) parent_state addition to many compilation routines - Stuart will think about j) WithLifting - Todd will try in original location to see if it works now...if not then discussion k) excinfo should start with a period in generated LLVM l) enable_ssa - Todd will try without that option turned on m) cpointer arg types - Stuart will think on how to do this without major Numba replacement.
Attendees: Todd A. Anderson, Giorgis
- Giorgis fixed the bug in new version of the code in two ways. a) Making var introduced by new openmp ir builder as private (previously thought it was shared but it isn't). b) Using custom outliner...will be the default approach.
- TODOs a) Check target functionality on Windows. b) Do a diff of pyomp versus Numba and report on the changes. c) Do a diff of the llvmlite changes that pyomp we're carrying.
- Ask Andre for time estimate on totally isolated LLVM passes.
- Need more PyOMP documentation.
Attendees: Todd A. Anderson, Stuart Archibald, Giorgis
- Use "conda build config" file to get matrix of Python and numpy versions.
- Python 3.8 and 3.9. numpy 1.17 - 1.21 numba.readthedocs.io/en/stable/user/installing.html#version-support-information
- cfunc_wrapper has right signature.
- For next version:
- Update to latest Numba.
- Get rid of privatization.
- Make sure that llvmdev conda build is building openmp runtime.
- Talk with Stan about legalities of having a package named Numba in a different channel.
- Make Python-for-HPC channel on anaconda?
Attendees: Todd A. Anderson, Stuart Archibald, Siu Kwan Lam, Giorgis
- More discussion of object lifetimes.
Attendees: Todd A. Anderson, Giorgis Georgakoudis, Daniel Anderson
- Giorgis ran a bunch of test with big-ugly directive. a) openmp for directive with reduction - fixed b) nested parallel for test_openmp.py::2000 - inner parallel for index variable omp_iv1 is shared instead of private. omp_ub1 is shared but should be firstprivate. c) pi_task - problem is unlisted vars should be firstpriavte for task but is becoming shared.
- How to convey how to copy object to firstprivate.
- Stuart mentioned decorator on test functions to run isolated in a test process. Patch #8239. Remove
@needs_subprocess
from Test class then add in #8239 - Giorgis will create a document discussing the options for firstprivate variables with respect to reference counting and we'll distribute it to Siu and ask him to attend a future meeting to discuss.
Attendees: Todd A. Anderson, Giorgis Georgakoudis, Daniel Anderson
- Todd gave update on target_data_nested test. Had to create TypingAssign Numba IR node that does typing like a regular assignment but doesn't generate any executable code. Added code for slice copying to the IR copying code. Fixed the test so that all the arrays are integer arrays which gets rid of lowering error. Now is giving error because the index variable inside the region is identified as an output and is therefore added to the signature when it shouldn't be.
- Giorgis gave an update on the big ugly directive support. Needed to add back in omp_lb code the way it was before to support code generation for the distribute clause.
- We'll go with the current approach and do a release after big ugly directive and pi task are working.
- For next release, we're going to try to get rid of variable privatization. After the meeting, Todd had a thought that the STRUCT-based approach that we use for target map might not work for firstprivate on the CPU side. You could try to copy an array firstprivate struct and then duplicate the data pointer but then array decref would all be operating on the same meminfo structure and is going to get really confused when the reference count tries to go negative by the number of openmp threads.
- After we get the big combined target directive and the pi_task example working then we'll do a release.
- After that, we will make the LLVM pass into a plugin and then we can use Numba's llvmdev build and just have the LLVM pass plugin in llvmlite.
Attendees: Todd A. Anderson, Tim Mattson, Giorgis Georgakoudis, Stuart Archibald, Daniel Anderson
- Lots of discussion around difference between C and Python arrays and how those interact with implicit behavior in target regions. We are going to see if at runtime openmp will generate an error if you have a target region where there is a map(tofrom:...) and the mapped array(s) already exist within the data environment from a previous target enter data directive. The main proposal seems to be that if there is a tofrom generated for a target region implicitly by the pyomp frontend and the arrays have already been mapped then it is a no-op. There is some question as to how to get this behavior whether with options: 1) modify openmp runtime (bad!), 2) have pyomp runtime that wraps the openmp runtime, or 3) just do the checks in code generation (most likely approach).
- Todd to make "target teams loop" alias to "target teams distribute parallel for simd".
Attendees: Todd A. Anderson, Stuart Archibald, Giorgis Georgakoudis
- Look at subTest in the tests directory for how to test device(0) and device(1) without code duplication.
- Todd gave update on changes.
- Giorgis to send all target examples to Todd who will add to his own and send to Daniel.
- Users try private first, then reductions, and if that can't work then fall back to shared vars with critical regions or atomics. Do we support atomics at this point?
- Daniel to see if caching works for openmp functions both for non-target and target.
- Is Intel generating an openmp target runtime for Intel GPUs?
- What is relationship between spirv uniform GPU backend and openmp target runtime?
for i in range(3): with self.subTest(f'somename {i} '): @njit def foo(): device = i foo.compile(()) foo.inspect_types()
Attendees: Todd A. Anderson, Stuart Archibald, Giorgis Georgakoudis
- Make sure default DSA for teams is shared.
- Problem with changing a to arg.a so that sharing works correctly on GPU is likely that the code copy isn't deep enough and renaming in the outlined_ir effects the original ir.
- Todd needs to use minimal call convention on CPU-side as well.
Attendees: Todd A. Anderson, Stuart Archibald, Giorgis Georgakoudis
- Giorgis - teams working on GPU side.
- Problem with num_teams in separate teams directive. (Todd fixed this.)
- Talked about openmp runtime calls. The Numba version we are using may be too old...the current code may work for a newer Numba version. In short term, we can use @lower on cuda/cudaimpl/registry.
- Giorgis will be getting the CPU side for target teams working then move on to distribute and parallel for.
- More discussion on whether the long-term solution is to outline every region so that we don't need to do variable renaming in Numba which leads to lots of problems.
- Todd will try to turn off the renaming and turn off Numba's constant value propagation. Can Stuart tell Todd where that is easily?
- Todd to create new target context with new minimal call convention that doesn't have the single parameter ret. That will be used for device(1) so that LLVM better handles the offloaded function.
6/2023 Attendees: Todd A. Anderson, Giorgis Georgakoudis, Stuart Archibald
- PyOMP will only support Python 3.8+ going forward from this point.
- Giorgis found problem with parallel for "extra code" detection in python 3.8. Todd will fix that and assume that only Python 3.8 needs to work.
- Giorgis found some issue with some numba functions like numba_gil_ensure not available in target code so he linked in numba's helperlib for the cpu target to get around that issue.
- Giorgis found some issues with separate parallel for inside target region and will send those code examples to Todd.
- There are issues with omp functions like omp_get_thread_num inside non-cpu target regions. We need an overload or the older separate typing and lowering methods for the cuda target for those functions as mentioned by Stuart with some code examples below.
[9:53 AM] stuart (Guest) function = types.ExternalFunction('omp_get_num_threads', types.intp())
[9:55 AM] stuart (Guest) @overload(omp_get_num_threads)def ol(): def impl(): return function() return impl
Attendees: Todd A. Anderson, Giorgis Georgakoudis, Stuart Archibald
- Todd is working on adding support for "target teams distribute parallel for" and "target teams distribute parallel for simd". Some initial version of that should be working in the next couple days.
- To represent those in the LLVM IR directives, we can 1) make the main directive as DIR.OMP.TARGET but then have QUAL.OMP.TEAMS, QUAL.OMP.DISTRIBUTE, QUAL.OMP.PARALLEL.LOOP, QUAL.OMP.SIMD or 2) create a new directive DIR.OMP.TARGET.TEAMS.DISTRIBUTE.PARALLEL.LOOP.SIMD. Giorgis is fine with either for now and #2 seems to match the structure of the code better at the moment so that is what we are going forward with.
- Todd is extracting the loop processing code that does the bottom increment and handles last private so that we can use the same code for "for", "parallel for" and all target variants that include "parallel for".
- We decided that for now Giorgis would try to do the code manipulation to add the outer chunking loop necessary to implement "distribute".
- Todd mentioned that he would try to get the Intel product team that would eventually maintain this in open source to engage early and as a first step add their Numba spirv backend and compile the outlined target functions using that backend to allow us to target multiple accelerator architectures.
- Stuart mentioned Tardis supernova and stumpy as two projects that might be interested in engaging with our target work early.
Attendes: Todd, Giorgis, Tim Mattson
- Giorgis created Linux docket containers for armv8: one that is a development environment and another that is a Jupyter notebook. (armv8 on Macs and Nvidia Grace).
- We need additional containers for x64 with GPU support.
- Giorgis did internal linkage for omp globals and that fixed the linking issue. Todd to go back to cfd application now that that bug is fixed.
- Discussion of available hardware for doing pyomp demo with GPU support.
- Tim: "SC paper would be great" but its competitive. Tutorial there would also be nice but again very competitive. SciPy another possibility. eScience another possible venue. Plan on SC paper for April.
- We know that caching is broken for GPU codes.