-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: avoid additional array allocation in host to device transfer #2966
base: main
Are you sure you want to change the base?
fix: avoid additional array allocation in host to device transfer #2966
Conversation
/intelci: run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like #2375 was just missing the wait_and_throw() :(
Please dont merge, until the additional check on local machines |
not sure, looks like in all places where the function is called there is additional wait_and_throw() |
/intelci: run |
/intelci: run |
/intelci: run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made comments help to me understand. Would it be worth just adding some? Otherwise good to go by me (if my comments in the review are correct)
@@ -130,33 +130,20 @@ static void pull_row_major_impl(const Policy& policy, | |||
|
|||
auto src_data = origin_data.get_data() + origin_offset * origin_dtype_size; | |||
auto dst_data = block_data.get_mutable_data(); | |||
|
|||
if (block_info.get_column_count() > 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just because I am not familiar, it looks like just removing the if and relying on the for loop (a simplification here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes!
dal::detail::check_mul_overflow(src_element_size_in_bytes, element_count); | ||
const std::int64_t src_stride_in_bytes = | ||
dal::detail::check_mul_overflow(src_element_size_in_bytes, src_stride); | ||
if (src_type == dst_type && src_size_in_bytes == dst_size_in_bytes && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If allocation exists, skip the convert_convert vector and uses memcpy_host2usm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue, that when we copy from host to device in different data types(float on host and allocated on device in double), we still just copy bytes, and in the scenario of usage different data types it makes sense to use strides
@@ -227,33 +227,50 @@ sycl::event convert_vector_device2host(sycl::queue& q, | |||
// To perform conversion, we gather data from device to host in temporary | |||
// contigious array and then run host conversion function | |||
|
|||
const std::int64_t element_size_in_bytes = dal::detail::get_data_type_size(src_type); | |||
const std::int64_t dst_element_size_in_bytes = dal::detail::get_data_type_size(dst_type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
collect more information about the destination to see if a shortcut can be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed good suggestion! Thank you @Alexandr-Solovev
Could you please share perf improvements?
/intelci: run |
Are you still seeing seg faults when experimenting locally or has this issue been resolved? |
/intelci: run |
Description
PR introduces fix for memcpy function in oneDAL. It saves memory and improve the performance of transferring data from host to device and back and gives an opportunity to use memcpy directly.
PR should start as a draft, then move to ready for review state after CI is passed and all applicable checkboxes are closed.
This approach ensures that reviewers don't spend extra time asking for regular requirements.
You can remove a checkbox as not applicable only if it doesn't relate to this PR in any way.
For example, PR with docs update doesn't require checkboxes for performance while PR with any change in actual code should have checkboxes and justify how this code change is expected to affect performance (or justification should be self-evident).
Checklist to comply with before moving PR from draft:
PR completeness and readability
Testing
Performance