-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support multithreading in groupreduce #2491
base: main
Are you sure you want to change the base?
Conversation
@@ -453,18 +453,22 @@ julia> combine(gd, :, AsTable(Not(:a)) => sum, renamecols=false) | |||
``` | |||
""" | |||
function combine(f::Base.Callable, gd::GroupedDataFrame; | |||
keepkeys::Bool=true, ungroup::Bool=true, renamecols::Bool=true) | |||
keepkeys::Bool=true, ungroup::Bool=true, renamecols::Bool=true, | |||
nthreads::Int=NTHREADS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we should use a keyword argument name that is standard across packages, but for now I haven't found examples of similar cases elsewhere.
We should also anticipate what it will look like once we switch the default to the automatic optimal number of threads. Would it be nthreads=nothing
, nthreads=:auto
, nthreads=0
? We could also use another name: threads=true
looks good at it also works with e.g. threads=5
. Though using a different type forces recompilation, so maybe 0
is better as a default, even though it's not super logical.
Note that after #2481 method signatures will change so this PR will have merge conflicts. |
b676e63
to
bc4dc71
Compare
1a977d3
to
1933482
Compare
Keep the default to a single thread until we find a reliable way of predicting a reasonably optimal number of threads.
1933482
to
2d57734
Compare
I've rebased against master. Should be good for a review now. I just hope CI really used 2 threads but it's not visible in the logs AFAICT. EDIT: seeing Codecov, probably not. |
@quinnj - could you please have a quick look at the CI set-up? Thank you! |
Wait, I just did something completely stupid when setting the environment variable. I've committed a probable fix. |
OK that worked! |
ac55a99
to
011a9b8
Compare
I've added |
I was just thinking about |
Yes, I also thought about Adding a struct for global settings sounds overkill at this point since it's not clear we're going to add more settings in the future. I'd rather keep this minimal, so that it can be extended if needed in the future, e.g. using Preferences.jl for storage across sessions. |
src/other/utils.jl
Outdated
Use [`DataFrames.nthreads`](@ref) to access the value. | ||
""" | ||
function nthreads!(n::Int) | ||
n > 0 || throw(ArgumentError("n must be equal to or greater than 1 (got $n)")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might also cap this at Threads.nthreads()
here so that later DataFrames.nthreads
does report the number of threads that is going to be actually used
I am OK with the current design as it is an implementation detail. Though probably having a Regarding the naming - I am OK with |
The |
I am OK to have them as a part of the public API. @StefanKarpinski - if Julia Base were to add an option to change number of available threads dynamically during the Julia session, would |
This can be closed given the other approach we took? Or we keep it open for the future? |
This PR is complementary with the other threaded grouping PRs, since without it optimized reductions are not multithreaded. We should check whether we can identify conditions under which multithreading is faster than single threading. |
I have totally forgotten about this PR. I thought you have merged it :). So what is the status here - are you working on the condition or I should review it as is? Thank you! |
Given that we now automatically spawn as many tasks as there are threads in other places, requiring users to set |
I agree
Is it possible to give some safe condition though? (I do not remember the details of our earlier discussion) |
Probably. I need to look at the data again. What's difficult is that there are several dimensions to cross: number of groups, number of rows, number of threads. And IIRC the overhead due to threading is larger when Julia was started with many threads. Finally, there's always the possibility that all CPUs are already busy with another threaded operation, so that spawning multiple tasks is counter-productive. Maybe that's not a serious issue. |
Should I mark it as 1.0, or keep it 1.x and we will merge it when we are comfortable? |
I think this doesn't have to happen before 1.0. |
start = 1 + ((tid - 1) * length(groups)) ÷ nt | ||
stop = (tid * length(groups)) ÷ nt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avoid overflow on 32-bit machines, on 64-bit machine this is a no-op
start = 1 + ((tid - 1) * length(groups)) ÷ nt | |
stop = (tid * length(groups)) ÷ nt | |
start = 1 + ((tid - 1) * Int64(length(groups))) ÷ nt | |
stop = (tid * Int64(length(groups))) ÷ nt |
@nalimilan - what do we do with this PR? Close it (as it is likely outdated)? In general - for the future we should rather add a kwarg allowing to choose if users wants to use multithreading c.f. #2992. |
bump @nalimilan - what do you think we should do with this PR? |
I'm not sure. There are two aspects to address:
|
@nalimilan - following our discussion I recommend to do what is written in #2988 (comment). We would then implement this proposed change. There are the following things to be done to implement this change:
@nalimilan - can you please comment which parts of this list you would be willing to have a look at (I understand that 1, 2, and 4 must be covered in this PR, but 3 can be done in a separate follow-up PR that I can do if you prefer after this PR is finished but before 1.4 release) Additional maintenance tasks:
|
OK I'll rebase this PR, but I'd rather leave point 3 for another one. |
Sure. Thank you! |
(probably starting a new branch and making proper changes + force push rather than rebase would be easier, as I am not sure by how much we have changed the sources you touched) |
@nalimilan - what are your plans with this PR 😄? (I am asking because it is getting so old that there is a risk that it is better to re-implement it from scratch if we want to have it given the recent changes) |
Now that we have the So it would probably be worth experimenting more to find conditions under which we can be sure starting multiple tasks is faster. |
OK - so let us keep it open for now. |
I converted this PR to draft to avoid accidental merging. |
Keep the default to a single thread until we find a reliable way of predicting a reasonably optimal number of threads.