Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL] Embed bfloat16 devicelib into executable if necessary #16729

Open
wants to merge 42 commits into
base: sycl
Choose a base branch
from

Conversation

jinge90
Copy link
Contributor

@jinge90 jinge90 commented Jan 22, 2025

Currently, sycl bfloat16 conversion functions are implemented in 2 devicelib spvs(fallback version and native version).
The native version targets for any platform which supports "cl_intel_bfloat16_conversions" extension and fallback version is used for all other platforms.
SYCL runtime will select the bfloat16 spvs during execution time by checking bfloat16 extension.
The design requires us to ship 2 spv files together with sycl runtime which some users may dislike.
The PR uses sycl dynamic library mechanism to re-implement this behavior. These 2 bfloat16 lib files are regarded as dynamic library and embedded to final executable, so we don't need to ship any bfloat16 spv libs.
The PR consists following changes:

  1. Driver: pass the devicelib file location to sycl-post-link tool.
  2. sycl-post-link: analyze user's device image to see whether bfloat16 devicelib functions are used. If yes, add the 2 bfloat16 devicelib files as "required" dynamic library. All required bfloat16 devicelib functions are treated as "imported" symbols in user's device image and all functions in bfloat16 devicelib are "exported" symbols.
  3. Sycl runtime will load and link the required bfloat16 devicelib image and resolve the imported symbols.

Fallback and native version of bfloat16 devicelib files have exactly same exported functions, we add a new metadata("SYCL_DEVICELIB_BF16_TYPE") to indicate the version in them. SYCL runtime will check cl_intel_bfloat16_conversions extension and this metadata to decide which version will be linked.

@jinge90 jinge90 requested review from a team as code owners January 22, 2025 08:58
@jinge90 jinge90 marked this pull request as draft January 22, 2025 08:58
Signed-off-by: jinge90 <[email protected]>
@jinge90
Copy link
Contributor Author

jinge90 commented Feb 21, 2025

Hi, @asudarsa and @AlexeySachkov
Kind ping~.
Thanks very much.

@jopperm
Copy link
Contributor

jopperm commented Feb 24, 2025

I'm looking at this in the context of SYCL-RTC. Overall this change LGTM and I think we can adapt our equivalent of sycl-post-link to embed the bf16 libraries as device images as well.

However, is it a problem if multiple of these special images are present at runtime? This would be the case if the application compiles multiple kernel bundles at runtime using bf16, or even if the application contains normal kernels using bf16 and also runtime-compiles a single bf16-using bundle.

@jinge90
Copy link
Contributor Author

jinge90 commented Feb 24, 2025

I'm looking at this in the context of SYCL-RTC. Overall this change LGTM and I think we can adapt our equivalent of sycl-post-link to embed the bf16 libraries as device images as well.

However, is it a problem if multiple of these special images are present at runtime? This would be the case if the application compiles multiple kernel bundles at runtime using bf16, or even if the application contains normal kernels using bf16 and also runtime-compiles a single bf16-using bundle.

Hi, @jopperm
No matter how many user's device images are present, sycl-post-link tool will analyze all of them and once any user's device image calls bf16 devicelib functions, 2 actions will be taken:

  1. For the user device image using bf16 devicelib functions, the used function name will be added to its imported symbols
  2. bf16 devicelib image will be bundled into final executable and all functions included will be added to exported symbols.

When sycl runtime begins to launch a user kernel, it will resolves all imported symbols from the belonging user's device image, if the the imported symbols include bf16 devicelib functions, sycl runtime will link the user's device image with corresponding devicelib image.

Thanks very much.

@jopperm
Copy link
Contributor

jopperm commented Feb 24, 2025

Thanks, got it, after post-link, a single sycl_device_binaries contains N user images plus 1 fallback- and 1 native-bfloat16-devicelib-image. But is it problematic if multiple sycl_device_binaries, each containing the special bfloat16-images, are registered with the runtime?

@jinge90
Copy link
Contributor Author

jinge90 commented Feb 25, 2025

Thanks, got it, after post-link, a single sycl_device_binaries contains N user images plus 1 fallback- and 1 native-bfloat16-devicelib-image. But is it problematic if multiple sycl_device_binaries, each containing the special bfloat16-images, are registered with the runtime?

Hi, @jopperm
Do you mean there are multiple user device images and they all use bf16 devicelib functions? If so, the scenario is same, suppose we have 3 user kernels and we use device-code-split=per_kernel, then we will have 3 device images and each of them includes one kernel, the 3 kernels all use bf16 devicelib functions.
In sycl-post-link, it will scan all of these 3 device images and find bf16 devicelib functions are needed by them, so all of these 3 user device images will have a imported symbol property and corresponding bf16 devicelib function name will be added to each device image's imported symbol list. sycl-post-link will also embed fallback and native bf16 device image into final program, all of the bf16 functions are added to there exported symbol list, so there are 5 device images included(3 user device image + 2 bf16 devicelib image(native and fallback version)).
In execution time, when one user kernel is going to be executed, sycl runtime needs to build the program from corresponding device image first. Sycl runtime will collect deps for this device image and find it has imported symbols, so bf16 devicelib image will be linked with this device image to resolve the all imported symbols. After it is done, sycl runtime will build final kernel and run it. For each of the user kernel, the process is same, if it is to be launched, sycl runtime need to link corresponding device image with dependent bf16 devicelib image and first.
Thanks very much.

//
// Run clang-linker-wrapper test
//
// RUN: clang-linker-wrapper -sycl-post-link-options="SYCL_POST_LINK_OPTIONS" -llvm-spirv-options="LLVM_SPIRV_OPTIONS" "--host-triple=x86_64-pc-windows-msvc" "--linker-path=/usr/bin/ld" "--" HOST_LINKER_FLAGS "-dynamic-linker" HOST_DYN_LIB "-o" "a.out" HOST_LIB_PATH HOST_STAT_LIB %t5.o -sycl-device-libraries=libsycl-crt.new.o -sycl-device-library-location=%S/Inputs/SYCL/lib --dry-run 2>&1 | FileCheck -check-prefix=CHK-CMDS-DEVICE-LIB-DIR %s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this extra file? Can we not just use a dummy file? Please see

// Generate .o file as SYCL device library file.

If you do not care about what's inside the file, then touch %t1.devicelib.o will suffice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @asudarsa
We need to pass "-sycl-device-library-location" to clang-linker-wrapper tool and clang-linker-wrapper will look for specified devicelib files in this specified "-sycl-device-library-location", so we put an empty devicelib file in Inputs/SYCL/lib
Thanks very much.

//
// Run clang-linker-wrapper test
//
// RUN: clang-linker-wrapper -sycl-post-link-options="SYCL_POST_LINK_OPTIONS" -llvm-spirv-options="LLVM_SPIRV_OPTIONS" "--host-triple=x86_64-pc-windows-msvc" "--linker-path=/usr/bin/ld" "--" HOST_LINKER_FLAGS "-dynamic-linker" HOST_DYN_LIB "-o" "a.out" HOST_LIB_PATH HOST_STAT_LIB %t5.o -sycl-device-libraries=libsycl-crt.new.o -sycl-device-library-location=%S/Inputs/SYCL/lib --dry-run 2>&1 | FileCheck -check-prefix=CHK-CMDS-DEVICE-LIB-DIR %s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, a bit confused. Why name this '-crt' when you are interested only in bfloat16 device lib file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @asudarsa
Since the device library list can't be empty, I just randomly put libsycl-crt in the list to avoid compiler warning.
Thanks very much.

Copy link
Contributor

@asudarsa asudarsa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor nits. Overall looks good. Adding @maksimsab to list of reviewers to get feedback on sycl-post-link related changes.

Thanks

@asudarsa asudarsa requested a review from maksimsab February 26, 2025 03:10
Signed-off-by: jinge90 <[email protected]>
@jinge90
Copy link
Contributor Author

jinge90 commented Feb 26, 2025

Thanks, got it, after post-link, a single sycl_device_binaries contains N user images plus 1 fallback- and 1 native-bfloat16-devicelib-image. But is it problematic if multiple sycl_device_binaries, each containing the special bfloat16-images, are registered with the runtime?

Hi, @jopperm Do you mean there are multiple user device images and they all use bf16 devicelib functions? If so, the scenario is same, suppose we have 3 user kernels and we use device-code-split=per_kernel, then we will have 3 device images and each of them includes one kernel, the 3 kernels all use bf16 devicelib functions. In sycl-post-link, it will scan all of these 3 device images and find bf16 devicelib functions are needed by them, so all of these 3 user device images will have a imported symbol property and corresponding bf16 devicelib function name will be added to each device image's imported symbol list. sycl-post-link will also embed fallback and native bf16 device image into final program, all of the bf16 functions are added to there exported symbol list, so there are 5 device images included(3 user device image + 2 bf16 devicelib image(native and fallback version)). In execution time, when one user kernel is going to be executed, sycl runtime needs to build the program from corresponding device image first. Sycl runtime will collect deps for this device image and find it has imported symbols, so bf16 devicelib image will be linked with this device image to resolve the all imported symbols. After it is done, sycl runtime will build final kernel and run it. For each of the user kernel, the process is same, if it is to be launched, sycl runtime need to link corresponding device image with dependent bf16 devicelib image and first. Thanks very much.

Hi, @jopperm
Does the explanation above remove your concern?
Thanks very much.

@jopperm
Copy link
Contributor

jopperm commented Feb 26, 2025

Does the explanation above remove your concern?

Not entirely, your explanation is clear, thank you, but I was asking something else. Let me try to rephrase. Assume you have libary libX containing kernels using bf16. Your application also contains kernels using bf16 and links to libX. At runtime, the library's and the app's sycl_device_binaries (= the result of post-link embedded into the binaries) are registered with the program manager. Both would contain the special bfloat images exporting the symbols of the conversion builtins. Does that interfere with linking either the app's or libX's device images importing these symbols?

@jinge90
Copy link
Contributor Author

jinge90 commented Feb 26, 2025

Does the explanation above remove your concern?

Not entirely, your explanation is clear, thank you, but I was asking something else. Let me try to rephrase. Assume you have libary libX containing kernels using bf16. Your application also contains kernels using bf16 and links to libX. At runtime, the library's and the app's sycl_device_binaries (= the result of post-link embedded into the binaries) are registered with the program manager. Both would contain the special bfloat images exporting the symbols of the conversion builtins. Does that interfere with linking either the app's or libX's device images importing these symbols?

Hi, @jopperm
Do you mean sycl dynamic library when referring to "libX"?
Thanks very much.

@jopperm
Copy link
Contributor

jopperm commented Feb 26, 2025

Do you mean sycl dynamic library when referring to "libX"?

Let's say it's a static lib. The situation I'm interested in is __sycl_register_lib is called more than once.

@jinge90
Copy link
Contributor Author

jinge90 commented Feb 26, 2025

__sycl_register_lib

Hi, @jopperm
I expect that the bf16 devicelib functions still work as expected but we may add duplicate device images for bf16 devicelib, this should be optimized. I will double check, thanks for pointing out this!

Copy link
Contributor

@maksimsab maksimsab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, add tests corresponding to sycl-post-link changes.

@@ -3,7 +3,7 @@
// RUN: %clangxx -### --target=x86_64-pc-windows-msvc -fsycl \
// RUN: -Xdevice-post-link -O0 %s 2>&1 \
// RUN: | FileCheck -check-prefix OPTIONS_POSTLINK_JIT_OLD %s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add --sysroot=%S/Inputs/SYCL to the command line as there is no guarantee the base lib will be created with the compiler build (for example with the sanitizer based builds).

@@ -3,7 +3,7 @@
// RUN: %clangxx --target=x86_64-unknown-linux-gnu -fsycl -### \
// RUN: --no-offload-new-driver -Xdevice-post-link -O0 %s 2>&1 \
// RUN: | FileCheck -check-prefix OPTIONS_POSTLINK_JIT_OLD %s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add --sysroot=%S/Inputs/SYCL here as well

Copy link
Contributor

@AlexeySachkov AlexeySachkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, but I agree with @maksimsab's point about adding tests. Some unit-tests for SYCL RT should also be added, you can look at https://github.com/intel/llvm/blob/sycl/sycl/unittests/program_manager/DynamicLinking/DynamicLinking.cpp tests for reference

// DeviceLib module doesn't include any entry point,it can be constructed
// using ctor without any entry point related parameter.
for (auto Fn : SYCLDeviceLibs) {
if (Name == std::string(Fn))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (Name == std::string(Fn))
if (Name == StringRef(Fn))

There is no need to allocate heap memory and make a string copy for a sake of a simple comparison

// using ctor without any entry point related parameter.
for (auto Fn : SYCLDeviceLibs) {
if (Name == std::string(Fn))
IsSYCLDeviceLib = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add break;, there is no need to scan other function names once we've found what we were looking for

@@ -1798,14 +1837,28 @@ ProgramManager::kernelImplicitLocalArgPos(const std::string &KernelName) const {
return {};
}

static bool skipEmptyImage(sycl_device_binary RawImg) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static bool skipEmptyImage(sycl_device_binary RawImg) {
static bool shouldSkipEmptyImage(sycl_device_binary RawImg) {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants