Skip to content
This repository has been archived by the owner on Aug 19, 2024. It is now read-only.

Discussion: Offload semantics on host data #2

Open
oleksandr-pavlyk opened this issue Mar 23, 2021 · 2 comments
Open

Discussion: Offload semantics on host data #2

oleksandr-pavlyk opened this issue Mar 23, 2021 · 2 comments

Comments

@oleksandr-pavlyk
Copy link
Contributor

oleksandr-pavlyk commented Mar 23, 2021

A sycl-powered Python package is said to use "computation follows data" paradigm when its functions/methods infer the queue to which it submits kernel for execution based on (sycl::device, sycl::context) pairs associated with input USM data, encapsulated in sycl::queue.

Any ambiguity in such an inference process raises an error.

This ticket is to discuss a possible resolution for a scenario where offloaded computation is desired for host data (e.g. on C++ side this would correspond to using sycl::buffer to wrap host data for use by SYCL kernels, a real-case scenario of this workflow is likely the need to work on host data too large to fit into GPU memory in its entirety).

This scenario raises a question about how to semantically specify which sycl::queue will kernels be submitted to?

The proposed solution is to introduce target_offload(obj, queue=q) wrapper, so that semantics becomes

cls.fit(target_offload(X_host, queue=q), y_host)

The target_offload function will create a data-only class serving to associate the specified queue to the X_host object.

class DataWithQueue:
    cdef object base # 
    cdef SyclQueue q
def target_offload(host_obj, queue=q):
    return DataWithQueue(host_obj, queue=q)

The responsibility is on authors of cls.fit to recognize such inputs, and infer the intent to offload from the arguments.

target_to_offload called on USMArray usm_ary with same queue keyword argument as usm_ary.queue simply returns the usm_ary itself.

target_to_offload called on usm_ary and queue different from usm_ary.queue raises an error, unless both queues have the same sycl::context in which case the proposed interpretation is that usm_ary pointer is to be used in kernels submitted to the specified queue (no explicit copy is needed).

# the following raises hard error
cls.fit(target_offload(X_cpu_usm_array, queue=gpu_queue), Y_cpu_usm_array)

# the next line is equivalent to cls.fit(X_cpu_usm_array, Y_cpu_usm_array)
cls.fit(target_offload(X_cpu_usm_array, queue=cpu_queue), Y_cpu_usm_array)
# here queues X_usm_tile1.queue and q_tile2 have common multi-device context
target_offload(X_usm_tile1, queue=q_tile2)
@michael-smirnov
Copy link

  1. What about data location of results for such a call:
result = foo(target_offload(host_data, queue=gpu_queue)

Is the result supposed to be on the device associated with gpu_queue?

  1. Which package will contain this function? dpctl?
  2. What about restrictions for the objects passed as the first argument to that function? Are they expected to be some data containers or it can be an arbitrary type?
  3. Is the way specifying target device aligned with other API that allocates data on this device or transfers it to the device? All of these functions accept a queue parameter?

@napetrov
Copy link

napetrov commented Mar 25, 2021

I would add also several cents on users ramp up and code transformation from host based to gpu enabled.

It should be in same time simple to understand and to convert code. So let's look from end to end perspective
Here is host only code

import numpy as np

from sklearn.cluster import DBSCAN

X = np.array([[1., 2.], [2., 2.], [2., 3.],
              [8., 7.], [8., 8.], [25., 80.]], dtype=np.float32)
clustering = DBSCAN(eps=3, min_samples=2).fit(X)
    

Here is current implementation that use context loads operation to GPU. This assumes that X is host data and clustering would be results residing on GPU

import numpy as np
from daal4py.sklearn import patch_sklearn
from daal4py.oneapi import sycl_context
patch_sklearn()

from sklearn.cluster import DBSCAN

X = np.array([[1., 2.], [2., 2.], [2., 3.],
              [8., 7.], [8., 8.], [25., 80.]], dtype=np.float32)
with sycl_context("gpu"):
    clustering = DBSCAN(eps=3, min_samples=2).fit(X)

What would be full code example for new semantics? Because we have to use not only target_offload but also create a queue and we have to explain this to user - what he/she is doing

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants