前情提要
在前两篇文章中,我们已经升级了基础的 ROCm LLVM 编译环境,并且解决了 HIP 程序的运行问题。现在,我们继续升级 ROCm 6.2.2 的其他组件,其中的大头就是 CK(composable kernel), MIOpen 和 MIGraphX。
升级 rocMLIR
在 [Tracking] ROCm packages 中提到了两个 patch,打上就可以解决找不到头文件的问题,正常编译。
升级 composable_kernel
CMake HIP Language support
升级完源代码后,在 CMake 阶段就会出现问题:
-- The HIP compiler identification is unknown
-- Detecting HIP compiler ABI info
-- Detecting HIP compiler ABI info - failed
-- Check for working HIP compiler: /nix/store/h0aa4jal4731bhrc5k865y927qndsjwa-rocm-llvm-clang-wrapper-6.2.2/bin/clang++
-- Check for working HIP compiler: /nix/store/h0aa4jal4731bhrc5k865y927qndsjwa-rocm-llvm-clang-wrapper-6.2.2/bin/clang++ - broken
...
clang++: error: cannot find ROCm device library; provide its path via '--rocm-path' or '--rocm-device-lib-path', or pass '-nogpulib' to build without ROCm device library
检查 CMakeLists.txt,发现问题出在 project(composable_kernel VERSION ${version} LANGUAGES CXX HIP)
这一行,在 commit 7b027d5 中进行了重构,使用了 CMake 新的 HIP language support 。在旧版本 6.0.2 中并没有这个功能,所以那时的编译环境没有问题。
找到了问题,但是解决没有那么简单。翻看 CMake 的源码,和 HIP 相关的部分主要在:
- https://gitlab.kitware.com/cmake/cmake/-/blob/v3.30.0/Modules/CMakeDetermineHIPCompiler.cmake?ref_type=tags
- https://gitlab.kitware.com/cmake/cmake/-/blob/v3.30.0/Modules/CMakeTestHIPCompiler.cmake?ref_type=tags
整体流程是,CMake 会从环境中寻找 HIP,然后检查 ABI,如果失败就尝试编译一个简单的 HIP 程序。这里 CMake 找到了 ROCm llvm 的 clang++,但是这里的 clang++ 并没有配套的环境,所以编译失败。我们希望 CMake 找到的应该是已经包装好的 hipcc 。
那我们直接给 CMake 指定 HIP 的编译器行不行?在参数中增加 -DCMAKE_HIP_COMPILER=hipcc
,结果如下:
CMake Error at /nix/store/irwcgpm8csj7br49jih6kdpqkaqya3vf-cmake-3.30.4/share/cmake-3.30/Modules/CMakeDetermineHIPCompiler.cmake:73 (message):
CMAKE_HIP_COMPILER is set to the hipcc wrapper:
hipcc
This is not supported. Use Clang directly, or let CMake pick a default.
Call Stack (most recent call first):
CMakeLists.txt:26 (project)
很遗憾不行, CMake 不支持使用包装好的 hipcc ,必须使用 clang++ 。
那么现在就有矛盾的地方了, CMake 只接受光秃秃的 clang++,这在一般的发行版中不是问题,因为 ROCm 库已经在系统中了, clang++ 可以执行。然而,在 NixOS 中,安装的库并不会直接暴露在系统环境中,二进制可执行文件需要通过环境变量到对应的 nix store 中寻找库。这就导致了 clang++ 找不到库,从而编译失败。那么这里的一个 workaround 就是在这个编译环境中,将 ROCm 库的路径添加到环境变量中,让 clang++ 找到库。
rocmtoolkit-merged = symlinkJoin {
name = "rocmtoolkit-merged";
paths = [
rocm-core
rocm-thunk
rocm-device-libs
roctracer
rocdbgapi
rocm-smi
hsa-amd-aqlprofile-bin
clr
];
postBuild = ''
rm -rf $out/nix-support
'';
};
propagatedBuildInputs = [ rocmtoolkit-merged ];
ROCM_PATH = "${rocmtoolkit-merged}";
DEVICE_LIB_PATH = "${rocmtoolkit-merged}/amdgpu/bitcode";
这里使用了 NixOS 的 symlinkJoin 来将多个库合并成一个,然后通过环境变量来传递给编译环境。笔者尝试过只将基础库 ${clr}
添加到环境变量 ROCM_PATH
中,但发现 CMake 还是无法识别 ABI 信息,所以最后还是将所有相关库都添加到环境变量中。
选择硬件架构
如果不给定硬件架构,composable kernel 默认会编译所有支持的架构,这会导致编译时间过长(笔者使用六核 Ryzen 3600 跑了一整天还没有编译完,官方文档说整个软件栈编译需要 500+ 核时,目前看来 composable kernel 这一部分占了很大比例)。因此,在调试阶段最好指定硬件架构来节省时间。
nix 文件中提供了 GPU_TARGETS
变量来指定硬件架构,但是笔者传入本地的硬件架构 gfx1030
后,发现编译时间依然很长。ps aux | grep clang
后发现,编译器依然在编译所有支持的架构。无奈之下只能回头检查 CMakeLists.txt
文件,发现其中有一段代码:
# https://github.com/ROCm/composable_kernel/blob/665934078ecc6743e049de61e67d28d2a0e5dfe9/library/src/tensor_operation_instance/gpu/CMakeLists.txt#L40
if(INSTANCES_ONLY)
set(INST_TARGETS ${DEFAULT_GPU_TARGETS})
else()
set(INST_TARGETS ${GPU_TARGETS})
endif()
编译器真正使用的架构列表是 INST_TARGETS
,而 INST_TARGETS
的值取决于 INSTANCES_ONLY
变量。如果 INSTANCES_ONLY
为真,则 INST_TARGETS
的值为 DEFAULT_GPU_TARGETS
,否则为 GPU_TARGETS
。之前笔者为了跳过硬件测试,将 INSTANCES_ONLY
设置为真,导致编译器只编译了 DEFAULT_GPU_TARGETS
中的架构。因此,在调试阶段,需要将 INSTANCES_ONLY
设置为假,并指定 GPU_TARGETS
,这样编译器才会编译指定的硬件架构。
这个设计的缘由应该是假设了两种情况:
INSTANCES_ONLY=true
: 开发者本地没有硬件,编译所有支持的架构,跳过测试INSTANCES_ONLY=false
: 开发者本地有硬件,编译指定架构,进行测试
但是文档中并没有提及这点,只是说了如果 GPU_TARGETS
为空,则默认使用所有支持的架构,而如果指定了 GPU_TARGETS
,读者就会默认只编译指定的架构,完全不知道 INSTANCES_ONLY
的影响。
找到问题,设置 INSTANCES_ONLY=false
后,再用 ps aux | grep clang
查看编译器参数,确认编译器只编译了 gfx1030
架构,编译时间也明显缩短了。
升级 MIOpen
回头修复 composable_kernel
照常升级好源代码版本后( MIOpen 的源代码空间占用很大,笔者为此扩容了根目录,期间文件系统损坏,不得不重装系统,详见NixOS 根目录重装),在 config 阶段, composable_kernel 的 cmake 文件报错:
CMake Error at /nix/store/50nxgvdjdcr2wdhbx5zn3df91zhbndbr-unpack-composable_kernel-6.2.2/lib/cmake/composable_kernel/composable_kernelConfig.cmake:24 (include):
include could not find requested file:
/nix/store/50nxgvdjdcr2wdhbx5zn3df91zhbndbr-unpack-composable_kernel-6.2.2/lib/cmake/composable_kernel/composable_kerneldevice_conv_operationsTargets.cmake
Call Stack (most recent call first):
CMakeLists.txt:305 (find_package)
CMake Error at /nix/store/50nxgvdjdcr2wdhbx5zn3df91zhbndbr-unpack-composable_kernel-6.2.2/lib/cmake/composable_kernel/composable_kernelConfig.cmake:24 (include):
include could not find requested file:
/nix/store/50nxgvdjdcr2wdhbx5zn3df91zhbndbr-unpack-composable_kernel-6.2.2/lib/cmake/composable_kernel/composable_kerneldevice_contraction_operationsTargets.cmake
Call Stack (most recent call first):
CMakeLists.txt:305 (find_package)
意思是说,在 composable kernel 的库中,找不到 composable_kerneldevice_conv_operationsTargets.cmake
和 composable_kerneldevice_contraction_operationsTargets.cmake
这两个文件。笔者到这个库的文件夹下查看:
nix/store/50nxgvdjdcr2wdhbx5zn3df91zhbndbr-unpack-composable_kernel-6.2.2/lib/
├── cmake
│ └── composable_kernel
│ ├── composable_kernelConfig.cmake
│ ├── composable_kernelConfigVersion.cmake
│ ├── composable_kerneldevice_gemm_operationsTargets.cmake
│ ├── composable_kerneldevice_gemm_operationsTargets-release.cmake
│ ├── composable_kerneldevice_other_operationsTargets.cmake
│ ├── composable_kerneldevice_other_operationsTargets-release.cmake
│ ├── composable_kerneldevice_reduction_operationsTargets.cmake
│ ├── composable_kerneldevice_reduction_operationsTargets-release.cmake
│ ├── composable_kernelutilityTargets.cmake
│ └── composable_kernelutilityTargets-release.cmake
├── libdevice_gemm_operations.a
├── libdevice_other_operations.a
├── libdevice_reduction_operations.a
└── libutility.a
3 directories, 14 files
的确没有这两个文件,而且相应的库文件也不存在,所以问题就在 composable kernel 的编译上,并没有生成这两个库文件。
还是硬件架构的问题
回头继续检查 cmake 文件,核心代码如下:
# https://github.com/ROCm/composable_kernel/blob/665934078ecc6743e049de61e67d28d2a0e5dfe9/library/src/tensor_operation_instance/gpu/CMakeLists.txt#L168
if(("${cmake_instance}" MATCHES "quantization") AND (DEFINED DTYPES) AND (NOT DTYPES MATCHES "int8"))
message("quantization instances will not be built!")
set(add_inst 0)
endif()
if(("${cmake_instance}" MATCHES "ONLY DL_KERNELS") AND (NOT DEFINED DL_KERNELS))
message("Found only dl instances, but DL_KERNELS is not set. Skipping.")
set(add_inst 0)
endif()
if(("${cmake_instance}" MATCHES "ONLY XDL_KERNELS") AND (NOT INST_TARGETS MATCHES "gfx9"))
message("Found only xdl instances, but gfx9 is not on the targets list. Skipping.")
set(add_inst 0)
endif()
if(("${cmake_instance}" MATCHES "ONLY WMMA_KERNELS") AND (NOT INST_TARGETS MATCHES "gfx11"))
message("Found only wmma instances, but gfx11 is not on the targets list. Skipping.")
set(add_inst 0)
endif()
if(("${cmake_instance}" MATCHES "ONLY XDL_AND_DL_KERNELS") AND (NOT DEFINED DL_KERNELS) AND (NOT INST_TARGETS MATCHES "gfx9"))
message("Found only xdl and dl instances, but gfx9 is not on the targets listand DL_KERNELS is not set. Skipping.")
set(add_inst 0)
endif()
if(("${cmake_instance}" MATCHES "ONLY XDL_AND_WMMA_KERNELS") AND (NOT INST_TARGETS MATCHES "gfx11") AND (NOT INST_TARGETS MATCHES "gfx9"))
message("Found only xdl and wmma instances, but gfx11 and gfx9 are not on the targets list. Skipping.")
set(add_inst 0)
endif()
if(("${cmake_instance}" MATCHES "XDL_DL_WMMA_KERNELS") AND (NOT INST_TARGETS MATCHES "gfx11") AND (NOT INST_TARGETS MATCHES "gfx9") AND (NOT DEFINED DL_KERNELS))
message("Found xdl, dl, and wmma instances, but none of those meet the target list. Skipping.")
set(add_inst 0)
endif()
if((add_inst EQUAL 1))
get_filename_component(target_dir ${subdir_path} NAME)
add_subdirectory(${target_dir})
if("${cmake_instance}" MATCHES "gemm")
list(APPEND CK_DEVICE_GEMM_INSTANCES $<TARGET_OBJECTS:device_${target_dir}_instance>)
elseif("${cmake_instance}" MATCHES "conv")
list(APPEND CK_DEVICE_CONV_INSTANCES $<TARGET_OBJECTS:device_${target_dir}_instance>)
elseif("${cmake_instance}" MATCHES "mha")
list(APPEND CK_DEVICE_MHA_INSTANCES $<TARGET_OBJECTS:device_${target_dir}_instance>)
elseif("${cmake_instance}" MATCHES "contr")
list(APPEND CK_DEVICE_CONTRACTION_INSTANCES $<TARGET_OBJECTS:device_${target_dir}_instance>)
elseif("${cmake_instance}" MATCHES "reduce")
list(APPEND CK_DEVICE_REDUCTION_INSTANCES $<TARGET_OBJECTS:device_${target_dir}_instance>)
else()
list(APPEND CK_DEVICE_OTHER_INSTANCES $<TARGET_OBJECTS:device_${target_dir}_instance>)
endif()
message("add_instance_directory ${subdir_path}")
else()
message("skip_instance_directory ${subdir_path}")
endif()
这里的 ${cmake_instance}
是读取了子文件夹的 CMakeLists.txt
文件,比如:
# ONLY XDL_KERNELS
set(BATCHED_GEMM_INSTANCES)
list(APPEND BATCHED_GEMM_INSTANCES device_batched_gemm_xdl_f16_f16_f16_gmk_gkn_gmn_instance.cpp
device_batched_gemm_xdl_f16_f16_f16_gmk_gnk_gmn_instance.cpp
device_batched_gemm_xdl_f16_f16_f16_gkm_gkn_gmn_instance.cpp
device_batched_gemm_xdl_f16_f16_f16_gkm_gnk_gmn_instance.cpp
device_batched_gemm_xdl_bf16_bf16_bf16_gmk_gkn_gmn_instance.cpp
device_batched_gemm_xdl_bf16_bf16_bf16_gmk_gnk_gmn_instance.cpp
device_batched_gemm_xdl_bf16_bf16_bf16_gkm_gkn_gmn_instance.cpp
device_batched_gemm_xdl_bf16_bf16_bf16_gkm_gnk_gmn_instance.cpp
device_batched_gemm_xdl_f32_f32_f32_gmk_gkn_gmn_instance.cpp
device_batched_gemm_xdl_f32_f32_f32_gmk_gnk_gmn_instance.cpp
device_batched_gemm_xdl_f32_f32_f32_gkm_gkn_gmn_instance.cpp
device_batched_gemm_xdl_f32_f32_f32_gkm_gnk_gmn_instance.cpp
device_batched_gemm_xdl_int8_int8_int8_gmk_gkn_gmn_instance.cpp
device_batched_gemm_xdl_int8_int8_int8_gmk_gnk_gmn_instance.cpp
device_batched_gemm_xdl_int8_int8_int8_gkm_gkn_gmn_instance.cpp
device_batched_gemm_xdl_int8_int8_int8_gkm_gnk_gmn_instance.cpp)
add_instance_library(device_batched_gemm_instance ${BATCHED_GEMM_INSTANCES})
第一行的注释标记了 kernel 类型, XDL
kernel 需要 gf9
系列架构, WMMA
kernel 需要 gfx11
系列架构。如果没有对应的 kernel
,则不会生成对应的 instance。笔者的 gfx1030
不属于这两个架构,所以需要在 CMake 中启用 DL_KERNEL
选项。
启用 DL_KERNEL
选项重新编译后,还是缺少库文件,回头检查 CK 的源码,发现 DL_KERNEL
并没有实现所有的算子,比如 conv1d_bwd_data 只有 XDL_KERNEL
的实现。无奈之下,暂时只能添加其他硬件架构,让 CK 生成完整的库文件来满足下游 MIOpen 的需求。这里笔者选择了 gfx90a
。
MIOpen 的编译
解决了 CK 的问题后重新编译,检查产生的动态链接库:
$ ldd -r /nix/store/60riwd6fsp2kb314n6hs3jhlfjkfpdla-miopen-6.2.2/lib/libMIOpen.so
linux-vdso.so.1 (0x000079b29f972000)
libzstd.so.1 => /nix/store/5hg6h4zjxc3ax7j4ywn6ksd509yl4pmd-zstd-1.5.6/lib/libzstd.so.1 (0x000079b26f12c000)
libdl.so.2 => /nix/store/3bvxjkkmwlymr0fssczhgi39c3aj1l7i-glibc-2.40-36/lib/libdl.so.2 (0x000079b29f967000)
libbz2.so.1 => /nix/store/6amvjdayglsifq4lb2lrj4q5gwf4dicd-bzip2-1.0.8/lib/libbz2.so.1 (0x000079b29f954000)
libhiprtc.so.6 => /nix/store/52ylf6idbizs4ms7wd80y6l2z9k6mgc2-clr-6.2.2/lib/libhiprtc.so.6 (0x000079b26ef4e000)
libamd_comgr.so.2 => /nix/store/wqarg6yfjm9m6rw82f7qnp7fvw9dxach-rocm-llvm-comgr-6.2.2/lib/libamd_comgr.so.2 (0x000079b266200000)
librocblas.so.4 => /nix/store/z73h13qlf5lp0r0dym87n7hzim8ixk87-rocblas-6.2.2/lib/librocblas.so.4 (0x000079b23a200000)
libsqlite3.so.0 => /nix/store/b4cqpglpjvkjhb8l4dh1ammpg8gk6h9p-sqlite-3.46.1/lib/libsqlite3.so.0 (0x000079b23a097000)
librt.so.1 => /nix/store/3bvxjkkmwlymr0fssczhgi39c3aj1l7i-glibc-2.40-36/lib/librt.so.1 (0x000079b29f94d000)
libroctx64.so.4 => /nix/store/ymihn5p7a9ix9jciqf3i3ljrdxxj3xiy-rocmtoolkit-merged/lib/libroctx64.so.4 (0x000079b29f946000)
libamdhip64.so.6 => /nix/store/52ylf6idbizs4ms7wd80y6l2z9k6mgc2-clr-6.2.2/lib/libamdhip64.so.6 (0x000079b238800000)
libunwind.so.1 => /nix/store/m1n213xhzip6sbrvzfzmv1rcgd5vckw2-rocm-llvm-runtimes-6.2.2/lib/libunwind.so.1 (0x000079b29f937000)
libm.so.6 => /nix/store/3bvxjkkmwlymr0fssczhgi39c3aj1l7i-glibc-2.40-36/lib/libm.so.6 (0x000079b239fb0000)
libpthread.so.0 => /nix/store/3bvxjkkmwlymr0fssczhgi39c3aj1l7i-glibc-2.40-36/lib/libpthread.so.0 (0x000079b29f932000)
libc.so.6 => /nix/store/3bvxjkkmwlymr0fssczhgi39c3aj1l7i-glibc-2.40-36/lib/libc.so.6 (0x000079b238607000)
/nix/store/wn7v2vhyyyi6clcyn0s9ixvl7d4d87ic-glibc-2.40-36/lib64/ld-linux-x86-64.so.2 (0x000079b29f974000)
libhsa-runtime64.so.1 => /nix/store/aq0s19mgpqvq6qwwwcniq03vlx1phd3j-rocm-runtime-6.2.2/lib/libhsa-runtime64.so.1 (0x000079b238200000)
libnuma.so.1 => /nix/store/jljd230kkw0pxdzbvcc5vdgvvrqq0x60-numactl-2.0.18/lib/libnuma.so.1 (0x000079b29f921000)
libz.so.1 => /nix/store/ncjma3lhr1sf6mqaczl9mfhjmp6ix58w-zlib-1.3.1/lib/libz.so.1 (0x000079b26ef30000)
libncursesw.so.6 => /nix/store/c7qx9iq6lqfjd4dxrdd6723l6rvzsj5s-ncurses-6.4.20221231/lib/libncursesw.so.6 (0x000079b239f3a000)
libstdc++.so.6 => /nix/store/s94fwp43xhzkvw8l8nqslskib99yifzi-gcc-13.3.0-lib/lib/libstdc++.so.6 (0x000079b237e00000)
libgcc_s.so.1 => /nix/store/s94fwp43xhzkvw8l8nqslskib99yifzi-gcc-13.3.0-lib/lib/libgcc_s.so.1 (0x000079b26ef0b000)
librocprofiler-register.so.0 => /nix/store/l7h1x86mh3ajhcy7wqyky8h39v7sjcl9-rocprofiler-register-6.2.2/lib/librocprofiler-register.so.0 (0x000079b26eed9000)
libelf.so.1 => /nix/store/p2i9dd3fslfhsc593h2xfs4s75lzbpki-elfutils-0.191/lib/libelf.so.1 (0x000079b2661e3000)
libdrm.so.2 => /nix/store/sbm8cmk0i6xz6af63cgmhqr26gfdcjgq-libdrm-2.4.123/lib/libdrm.so.2 (0x000079b29f905000)
libdrm_amdgpu.so.1 => /nix/store/sbm8cmk0i6xz6af63cgmhqr26gfdcjgq-libdrm-2.4.123/lib/libdrm_amdgpu.so.1 (0x000079b2661d7000)
libatomic.so.1 => /nix/store/s94fwp43xhzkvw8l8nqslskib99yifzi-gcc-13.3.0-lib/lib/libatomic.so.1 (0x000079b239f2f000)
libfmt.so.10 => /nix/store/5796jkcv6398ygfzkbkd6g4lpfbig19r-fmt-10.2.1/lib/libfmt.so.10 (0x000079b2381db000)
libglog.so.1 => /nix/store/i159a630nxxwsb1yhqfjcmjsnwjw2lay-glog-0.6.0/lib/libglog.so.1 (0x000079b23819f000)
libgflags.so.2.2 => /nix/store/halb5g04abisgmk3fyhlp8v36mad9xdm-gflags-2.2.2/lib/libgflags.so.2.2 (0x000079b238170000)
liblzma.so.5 => /nix/store/wp5xd6g6v3mafvdh2185v19diqqf0cm2-xz-5.6.3/lib/liblzma.so.5 (0x000079b23813f000)
undefined symbol: LLVMInitializeX86TargetInfo (/nix/store/60riwd6fsp2kb314n6hs3jhlfjkfpdla-miopen-6.2.2/lib/libMIOpen.so)
undefined symbol: LLVMInitializeX86Target (/nix/store/60riwd6fsp2kb314n6hs3jhlfjkfpdla-miopen-6.2.2/lib/libMIOpen.so)
undefined symbol: LLVMInitializeX86TargetMC (/nix/store/60riwd6fsp2kb314n6hs3jhlfjkfpdla-miopen-6.2.2/lib/libMIOpen.so)
undefined symbol: LLVMInitializeX86AsmPrinter (/nix/store/60riwd6fsp2kb314n6hs3jhlfjkfpdla-miopen-6.2.2/lib/libMIOpen.so)
undefined symbol: LLVMInitializeX86AsmParser (/nix/store/60riwd6fsp2kb314n6hs3jhlfjkfpdla-miopen-6.2.2/lib/libMIOpen.so)
出现了 undefined symbol ,导致下游 onnxruntime 运行时报错。检查了依赖后,定位到 rocMLIR
上:
$ nm librockCompiler.a | grep LLVMInit
0000000000000000 T LLVMInitializeAMDGPUAsmParser
0000000000000000 T LLVMInitializeAMDGPUAsmPrinter
0000000000000000 T LLVMInitializeAMDGPUTarget
0000000000000000 T LLVMInitializeAMDGPUTargetMC
0000000000000000 T LLVMInitializeAMDGPUDisassembler
0000000000000000 T LLVMInitializeAMDGPUTargetInfo
nm: TFLiteUtils.cpp.o: no symbols
0000000000000000 T LLVMInitializeFunctionPassManager
nm: blake3_neon.c.o: no symbols
nm: AutoConvert.cpp.o: no symbols
U LLVMInitializeAMDGPUAsmParser
U LLVMInitializeAMDGPUAsmPrinter
U LLVMInitializeAMDGPUTarget
U LLVMInitializeAMDGPUTargetInfo
U LLVMInitializeAMDGPUTargetMC
U LLVMInitializeAMDGPUAsmParser
U LLVMInitializeAMDGPUAsmPrinter
U LLVMInitializeAMDGPUTarget
U LLVMInitializeAMDGPUTargetInfo
U LLVMInitializeAMDGPUTargetMC
U LLVMInitializeAMDGPUAsmParser
U LLVMInitializeAMDGPUAsmPrinter
U LLVMInitializeAMDGPUTarget
U LLVMInitializeAMDGPUTargetInfo
U LLVMInitializeAMDGPUTargetMC
U LLVMInitializeX86AsmParser
U LLVMInitializeX86AsmPrinter
U LLVMInitializeX86Target
U LLVMInitializeX86TargetInfo
U LLVMInitializeX86TargetMC
查找 LLVMInitialize
相关 symbol 后,定位到了 mlir/tools/rocmlir-lib/librockcompiler_deps.cmake:
### DO NOT EDIT!
### Generated by mlir/utils/jenkins/static-checks/get_fat_library_deps_list.pl
set(__mlir_libs
LLVMAMDGPUAsmParser
LLVMAMDGPUCodeGen
LLVMAMDGPUDesc
LLVMAMDGPUDisassembler
LLVMAMDGPUInfo
LLVMAMDGPUUtils
......
lldCommon
lldELF
)
问题就在这里, LLVMAMDGPU
相关的 symbol 包含进来了,但是 LLVMX86
相关的没有,导致 librockCompiler.a
中并没有相关的 symbol 。根据提示,检查 mlir/utils/jenkins/static-checks/get_fat_library_deps_list.pl
文件,最关键的部分在于:
foreach (@deps) {
last if /outputs:/;
if (m#external/llvm-project/llvm/lib/lib(\w+)\.a#) {
push @mlirLibs, $1;
} elsif (m#lib/lib(\w+)\.a#) {
push @rocmlirLibs, $1;
}
}
这个文件会自动枚举 external/llvm-project/llvm/lib
里面的静态链接库,加入依赖。猜测上游开发者编译的时候,应该只有 AMDGPU
,没有 X86
,所以这个脚本只添加了 AMDGPU
的依赖。回头检查 nix 配置:
cmakeFlags = [
"-DLLVM_TARGETS_TO_BUILD=AMDGPU;${llvmNativeTarget}"
"-DLLVM_ENABLE_ZSTD=ON"
"-DLLVM_ENABLE_ZLIB=ON"
"-DLLVM_ENABLE_TERMINFO=ON"
"-DLLVM_BUILD_LIBRARY_DIR=/build/source/build/external/llvm-project/llvm/lib"
"-DROCM_PATH=${clr}"
# Manually define CMAKE_INSTALL_<DIR>
# See: https://github.com/NixOS/nixpkgs/pull/197838
"-DCMAKE_INSTALL_BINDIR=bin"
"-DCMAKE_INSTALL_LIBDIR=lib"
"-DCMAKE_INSTALL_INCLUDEDIR=include"
] ++ lib.optionals buildRockCompiler [
"-DBUILD_FAT_LIBROCKCOMPILER=ON"
] ++ lib.optionals (!buildRockCompiler) [
"-DROCM_TEST_CHIPSET=gfx000"
];
这里 llvmNativeTarget
是 X86
,所以 librockCompiler.a
依赖 LLVMX86
相关的 symbol ,但是附带的 mlir/tools/rocmlir-lib/librockcompiler_deps.cmake
并没有相关依赖,所以 librockCompiler.a
中 LLVMX86
相关的 symbol 处于 undefined 状态。解决方案有两个:
- 修改
mlir/tools/rocmlir-lib/librockcompiler_deps.cmake
,添加LLVMX86
相关的依赖。 - 修改
cmakeFlags
,将llvmNativeTarget
设置为AMDGPU
。
这里先尝试第一种方案。添加依赖后重新编译,再次检查 librockCompiler.a
:
$ nm librockCompiler.a | grep LLVMInit
0000000000000000 T LLVMInitializeAMDGPUAsmParser
0000000000000000 T LLVMInitializeAMDGPUAsmPrinter
0000000000000000 T LLVMInitializeAMDGPUTarget
0000000000000000 T LLVMInitializeAMDGPUTargetMC
0000000000000000 T LLVMInitializeAMDGPUDisassembler
0000000000000000 T LLVMInitializeAMDGPUTargetInfo
0000000000000000 T LLVMInitializeX86AsmParser
0000000000000000 T LLVMInitializeX86AsmPrinter
0000000000000000 T LLVMInitializeX86Target
0000000000000000 T LLVMInitializeX86TargetMC
0000000000000000 T LLVMInitializeX86Disassembler
0000000000000000 T LLVMInitializeX86TargetInfo
nm: TFLiteUtils.cpp.o: no symbols
0000000000000000 T LLVMInitializeFunctionPassManager
nm: blake3_neon.c.o: no symbols
nm: AutoConvert.cpp.o: no symbols
U LLVMInitializeAMDGPUAsmParser
U LLVMInitializeAMDGPUAsmPrinter
U LLVMInitializeAMDGPUTarget
U LLVMInitializeAMDGPUTargetInfo
U LLVMInitializeAMDGPUTargetMC
U LLVMInitializeAMDGPUAsmParser
U LLVMInitializeAMDGPUAsmPrinter
U LLVMInitializeAMDGPUTarget
U LLVMInitializeAMDGPUTargetInfo
U LLVMInitializeAMDGPUTargetMC
U LLVMInitializeAMDGPUAsmParser
U LLVMInitializeAMDGPUAsmPrinter
U LLVMInitializeAMDGPUTarget
U LLVMInitializeAMDGPUTargetInfo
U LLVMInitializeAMDGPUTargetMC
U LLVMInitializeX86AsmParser
U LLVMInitializeX86AsmPrinter
U LLVMInitializeX86Target
U LLVMInitializeX86TargetInfo
U LLVMInitializeX86TargetMC
按预想中出现了 LLVMInitializeX86
的符号,说明问题的确在这里。
接下来尝试第二种方案,在 nix 配置中去除 llvmNativeTarget
后重新编译,检查结果如下:
[nix-shell:/nix/store/nfw1ki7cwgv4i394jmpp4m9kj18ncd5r-rocmlir-rock-6.2.2/lib]$ nm librockCompiler.a | grep LLVMInit
0000000000000000 T LLVMInitializeAMDGPUAsmParser
0000000000000000 T LLVMInitializeAMDGPUAsmPrinter
0000000000000000 T LLVMInitializeAMDGPUTarget
0000000000000000 T LLVMInitializeAMDGPUTargetMC
0000000000000000 T LLVMInitializeAMDGPUDisassembler
0000000000000000 T LLVMInitializeAMDGPUTargetInfo
nm: TFLiteUtils.cpp.o: no symbols
0000000000000000 T LLVMInitializeFunctionPassManager
nm: AutoConvert.cpp.o: no symbols
nm: blake3_neon.c.o: no symbols
U LLVMInitializeAMDGPUAsmParser
U LLVMInitializeAMDGPUAsmPrinter
U LLVMInitializeAMDGPUTarget
U LLVMInitializeAMDGPUTargetInfo
U LLVMInitializeAMDGPUTargetMC
U LLVMInitializeAMDGPUAsmParser
U LLVMInitializeAMDGPUAsmPrinter
U LLVMInitializeAMDGPUTarget
U LLVMInitializeAMDGPUTargetInfo
U LLVMInitializeAMDGPUTargetMC
U LLVMInitializeAMDGPUAsmParser
U LLVMInitializeAMDGPUAsmPrinter
U LLVMInitializeAMDGPUTarget
U LLVMInitializeAMDGPUTargetInfo
U LLVMInitializeAMDGPUTargetMC
没有了 X86
相关的 symbol,这应该就是上游开发者原本的编译配置。
MIGraphX 的编译
MIGraphX 的编译主要遇到了 头文件的问题,一旦启用 CK 支持,src/targets/gpu/include/migraphx/gpu/ck.hpp 中就会需要:
#include "ck/host/device_gemm_multiple_d.hpp"
#include "ck/host/device_batched_gemm_softmax_gemm.hpp"
但是 CK 安装的文件中并没有这些文件。一番搜索过后,在 codegen/include/ck/host 中找到了 device_gemm_multiple_d.hpp
,但是 device_batched_gemm_softmax_gemm.hpp
并没有找到。这里很奇怪
codegen
这个文件夹并没有加入主CMakeLists.txt
中,而是自己拥有一个独立CMakeLists.txt
- 尝试在
codegen
目录底下编译,还是没有device_batched_gemm_softmax_gemm.hpp
文件 - 这个文件夹是在 Add host lib #1134 这个 三月 issue 引进的,但是 src/targets/gpu/include/migraphx/gpu/ck.hpp 在去年就引入了
- 根据 requirements.txt 中记录的 commit id 去找对应的 CK 版本,
codegen
文件夹根本不存在
目前的线索就这些,暂时不启用 CK 编译,跳过这个问题。
修复 RocBLAS 的 Tensile 支持
之前编译 RocBLAS 的时候,增加 Tensile 支持会编译失败,为了尽快推进暂时禁止了这个选项,但是下游运行 onnxruntime 的时候会报 rocblas_status=14
错误,在 library/include/internal/rocblas-types.h 中有错误代码的定义:
/* ============================================================================================ */
/**
* @brief rocblas status codes definition
*/
typedef enum rocblas_status_
{
rocblas_status_success = 0, /**< Success */
rocblas_status_invalid_handle = 1, /**< Handle not initialized, invalid or null */
rocblas_status_not_implemented = 2, /**< Function is not implemented */
rocblas_status_invalid_pointer = 3, /**< Invalid pointer argument */
rocblas_status_invalid_size = 4, /**< Invalid size argument */
rocblas_status_memory_error = 5, /**< Failed internal memory allocation, copy or dealloc */
rocblas_status_internal_error = 6, /**< Other internal library failure */
rocblas_status_perf_degraded = 7, /**< Performance degraded due to low device memory */
rocblas_status_size_query_mismatch = 8, /**< Unmatched start/stop size query */
rocblas_status_size_increased = 9, /**< Queried device memory size increased */
rocblas_status_size_unchanged = 10, /**< Queried device memory size unchanged */
rocblas_status_invalid_value = 11, /**< Passed argument not valid */
rocblas_status_continue = 12, /**< Nothing preventing function to proceed */
rocblas_status_check_numerics_fail
= 13, /**< Will be set if the vector/matrix has a NaN/Infinity/denormal value */
rocblas_status_excluded_from_build
= 14, /**< Function is not available in build, likely a function requiring Tensile built without Tensile */
rocblas_status_arch_mismatch
= 15, /**< The function requires a feature absent from the device architecture */
} rocblas_status;
对应的问题就是有些需要 Tensile 的函数没有编译,所以还是需要解决 Tensile 的问题。
加上 Tensile 支持时编译 RocBLAS 会出现:
################################################################################
# Tensile Create Library
Tensile::WARNING: Did not detect SupportedISA: [(8, 0, 3), (9, 0, 0), (9, 0, 6), (9, 0, 8), (9, 0, 10), (9, 4, 0), (9, 4, 1), (9, 4, 2), (10, 1, 0), (10, 1, 1), (10, 1, 2), (10, 3, 0), (10, 3, 1), (11, 0, 0), (11, 0, 1), (11, 0, 2)]; cannot benchmark assembly kernels.
# Found hipcc version 6.2.41134-0
Tensile::WARNING: Architecture (8, 0, 3) not supported by ROCm 6.2.41134-0
Tensile::WARNING: Architecture (9, 0, 0) not supported by ROCm 6.2.41134-0
Tensile::WARNING: Architecture (9, 0, 6) not supported by ROCm 6.2.41134-0
Tensile::WARNING: Architecture (9, 0, 8) not supported by ROCm 6.2.41134-0
Tensile::WARNING: Architecture (9, 0, 10) not supported by ROCm 6.2.41134-0
Tensile::WARNING: Architecture (9, 4, 0) not supported by ROCm 6.2.41134-0
Tensile::WARNING: Architecture (9, 4, 1) not supported by ROCm 6.2.41134-0
Tensile::WARNING: Architecture (9, 4, 2) not supported by ROCm 6.2.41134-0
Tensile::WARNING: Architecture (10, 1, 0) not supported by ROCm 6.2.41134-0
Tensile::WARNING: Architecture (10, 1, 1) not supported by ROCm 6.2.41134-0
Tensile::WARNING: Architecture (10, 1, 2) not supported by ROCm 6.2.41134-0
Tensile::WARNING: Architecture (10, 3, 0) not supported by ROCm 6.2.41134-0
Tensile::WARNING: Architecture (10, 3, 1) not supported by ROCm 6.2.41134-0
Tensile::WARNING: Architecture (11, 0, 0) not supported by ROCm 6.2.41134-0
Tensile::WARNING: Architecture (11, 0, 1) not supported by ROCm 6.2.41134-0
Tensile::WARNING: Architecture (11, 0, 2) not supported by ROCm 6.2.41134-0
后续编译的时候, Tensile 会报 kernel 缺少 VectorWidthA
key 的错误,让人很迷惑,只能先从这些 warning 下手。查找一番后,在 Tensile 的源码中找到了 相关代码:
def GetAsmCaps(isaVersion):
""" Determine assembler capabilities by testing short instructions sequences """
if globalParameters["AssemblerPath"] is not None:
derivedAsmCaps = {}
derivedAsmCaps["SupportedISA"] = tryAssembler(isaVersion, "")
derivedAsmCaps["HasExplicitCO"] = tryAssembler(isaVersion, "v_add_co_u32 v0,vcc,v0,1")
derivedAsmCaps["HasExplicitNC"] = tryAssembler(isaVersion, "v_add_nc_u32 v0,v0,1")
# Syntax of DirectToLds loads has changed: destination vgpr should be omitted
# Old syntax should be removed in a future update as it is no longer supported
derivedAsmCaps["HasDirectToLdsDest"] = tryAssembler(isaVersion, "buffer_load_dword v40, v36, s[24:27], s28 offen offset:0 lds") \
or tryAssembler(isaVersion, "buffer_load_b32 v40, v36, s[24:27], s28 offen offset:0 lds")
derivedAsmCaps["HasDirectToLdsNoDest"] = tryAssembler(isaVersion, "buffer_load_dword v36, s[24:27], s28 offen offset:0 lds") \
or tryAssembler(isaVersion, "buffer_load_b32 v36, s[24:27], s28 offen offset:0 lds")
derivedAsmCaps["HasAddLshl"] = tryAssembler(isaVersion, "v_add_lshl_u32 v47, v36, v34, 0x2")
derivedAsmCaps["HasLshlOr"] = tryAssembler(isaVersion, "v_lshl_or_b32 v47, v36, 0x2, v34")
derivedAsmCaps["HasSMulHi"] = tryAssembler(isaVersion, "s_mul_hi_u32 s47, s36, s34")
derivedAsmCaps["HasWMMA"] = tryAssembler(isaVersion, "v_wmma_f32_16x16x16_f16 v[0:7], v[8:15], v[16:23], v[0:7]")
derivedAsmCaps["HasMFMA"] = tryAssembler(isaVersion, "v_mfma_f32_32x32x2bf16 a[0:31], v32, v33, a[0:31]") \
or tryAssembler(isaVersion, "v_mfma_f32_32x32x1_2b_f32 a[0:31], v0, v1, a[0:31]")
derivedAsmCaps["HasMFMA_constSrc"] = tryAssembler(isaVersion, "v_mfma_f32_32x32x2bf16 a[0:31], v32, v33, 0") \
or tryAssembler(isaVersion, "v_mfma_f32_32x32x1_2b_f32 a[0:31], v0, v1, 0")
derivedAsmCaps["HasMFMA_vgpr"] = tryAssembler(isaVersion, "v_mfma_f32_32x32x2bf16 v[0:31], v32, v33, v[0:31]") \
or tryAssembler(isaVersion, "v_mfma_f32_32x32x1_2b_f32 v[0:31], v0, v1, v[0:31]")
derivedAsmCaps["HasMFMA_f64"] = tryAssembler(isaVersion, "v_mfma_f64_16x16x4f64 v[0:7], v[32:33], v[36:37], v[0:7]") \
or tryAssembler(isaVersion, "v_mfma_f64_16x16x4_f64 v[0:7], v[32:33], v[36:37], v[0:7]")
derivedAsmCaps["HasMFMA_bf16_original"] = tryAssembler(isaVersion, "v_mfma_f32_32x32x2bf16 a[0:31], v32, v33, a[0:31]")
derivedAsmCaps["HasMFMA_bf16_1k"] = tryAssembler(isaVersion, "v_mfma_f32_32x32x4bf16_1k a[0:31], v[32:33], v[36:37], a[0:31]")
derivedAsmCaps["HasMFMA_xf32"] = tryAssembler(isaVersion, "v_mfma_f32_32x32x4_xf32 a[0:15], v[32:33], v[36:37], a[0:15]")
derivedAsmCaps["HasMFMA_f8"] = tryAssembler(isaVersion, "v_mfma_f32_16x16x32_fp8_fp8 a[0:3], v[2:3], v[4:5], a[0:3]")
derivedAsmCaps["HasMFMA_b8"] = tryAssembler(isaVersion, "v_mfma_f32_16x16x32_bf8_bf8 a[0:3], v[2:3], v[4:5], a[0:3]")
derivedAsmCaps["HasMFMA_i8_908"] = tryAssembler(isaVersion, "v_mfma_i32_32x32x8i8 a[0:15], v2, v3, a[0:15]")
derivedAsmCaps["HasMFMA_i8_940"] = tryAssembler(isaVersion, "v_mfma_i32_32x32x16_i8 a[0:15], v[2:3], v[4:5], a[0:15]")
derivedAsmCaps["v_mac_f16"] = tryAssembler(isaVersion, "v_mac_f16 v47, v36, v34")
derivedAsmCaps["v_fma_f16"] = tryAssembler(isaVersion, "v_fma_f16 v47, v36, v34, v47, op_sel:[0,0,0,0]")
derivedAsmCaps["v_fmac_f16"] = tryAssembler(isaVersion, "v_fma_f16 v47, v36, v34")
derivedAsmCaps["v_pk_fma_f16"] = tryAssembler(isaVersion, "v_pk_fma_f16 v47, v36, v34, v47, op_sel:[0,0,0]")
derivedAsmCaps["v_pk_fmac_f16"] = tryAssembler(isaVersion, "v_pk_fma_f16 v47, v36, v34")
derivedAsmCaps["v_mad_mix_f32"] = tryAssembler(isaVersion, "v_mad_mix_f32 v47, v36, v34, v47, op_sel:[0,0,0] op_sel_hi:[1,1,0]")
derivedAsmCaps["v_fma_mix_f32"] = tryAssembler(isaVersion, "v_fma_mix_f32 v47, v36, v34, v47, op_sel:[0,0,0] op_sel_hi:[1,1,0]")
derivedAsmCaps["v_dot2_f32_f16"] = tryAssembler(isaVersion, "v_dot2_f32_f16 v20, v36, v34, v20")
derivedAsmCaps["v_dot2c_f32_f16"] = tryAssembler(isaVersion, "v_dot2c_f32_f16 v47, v36, v34") \
or tryAssembler(isaVersion, "v_dot2acc_f32_f16 v47, v36, v34")
derivedAsmCaps["v_dot4_i32_i8"] = tryAssembler(isaVersion, "v_dot4_i32_i8 v47, v36, v34")
derivedAsmCaps["v_dot4c_i32_i8"] = tryAssembler(isaVersion, "v_dot4c_i32_i8 v47, v36, v34")
derivedAsmCaps["VOP3v_dot4_i32_i8"] = tryAssembler(isaVersion, "v_dot4_i32_i8 v47, v36, v34, v47")
derivedAsmCaps["v_mac_f32"] = tryAssembler(isaVersion, "v_mac_f32 v20, v21, v22")
derivedAsmCaps["v_fma_f32"] = tryAssembler(isaVersion, "v_fma_f32 v20, v21, v22, v23")
derivedAsmCaps["v_fmac_f32"] = tryAssembler(isaVersion, "v_fmac_f32 v20, v21, v22")
derivedAsmCaps["v_fma_f64"] = tryAssembler(isaVersion, "v_fma_f64 v[20:21], v[22:23], v[24:25], v[20:21]")
derivedAsmCaps["v_mov_b64"] = tryAssembler(isaVersion, "v_mov_b64 v[20:21], v[22:23]")
derivedAsmCaps["HasAtomicAdd"] = tryAssembler(isaVersion, "buffer_atomic_add_f32 v0, v1, s[0:3], 0 offen offset:0")
derivedAsmCaps["HasGLCModifier"] = tryAssembler(isaVersion, "buffer_load_dwordx4 v[10:13], v[0], s[0:3], 0, offen offset:0, glc")
derivedAsmCaps["HasNTModifier"] = tryAssembler(isaVersion, "buffer_load_dwordx4 v[10:13], v[0], s[0:3], 0, offen offset:0, nt")
if tryAssembler(isaVersion, "s_waitcnt vmcnt(63)"):
derivedAsmCaps["MaxVmcnt"] = 63
elif tryAssembler(isaVersion, "s_waitcnt vmcnt(15)"):
derivedAsmCaps["MaxVmcnt"] = 15
else:
derivedAsmCaps["MaxVmcnt"] = 0
# TODO- Need to query the max cap, just like vmcnt as well?
derivedAsmCaps["MaxLgkmcnt"] = 15
derivedAsmCaps["KernargPreloading"] = tryAssembler(isaVersion, """
TestKernel:
s_endpgm
.amdhsa_kernel TestKernel
.amdhsa_next_free_vgpr 8
.amdhsa_next_free_sgpr 4
.amdhsa_group_segment_fixed_size 0 // lds bytes
.amdhsa_user_sgpr_kernarg_segment_ptr 1
.amdhsa_user_sgpr_kernarg_preload_length 3
.amdhsa_user_sgpr_kernarg_preload_offset 0
.amdhsa_accum_offset 4
.end_amdhsa_kernel
""")
derivedAsmCaps["SupportedSource"] = True
ignoreCacheCheck = globalParameters["IgnoreAsmCapCache"]
# disable cache checking for < rocm 5.3
compilerVer = globalParameters['HipClangVersion'].split(".")[:2]
compilerVer = [int(c) for c in compilerVer]
if len(compilerVer) >= 2:
ignoreCacheCheck = ignoreCacheCheck or \
compilerVer[0] < 5 or \
(compilerVer[0] == 5 and compilerVer[1] <= 2)
if not derivedAsmCaps["SupportedISA"] and CACHED_ASM_CAPS[isaVersion]["SupportedISA"]:
printWarning("Architecture {} not supported by ROCm {}".format(isaVersion, globalParameters['HipClangVersion']))
ignoreCacheCheck = True
# check if derived caps matches asm cap cache
if not ignoreCacheCheck:
exitFlag = False
# rocm<=6.0, ignore KernargPreloading
if compilerVer[0] <= 5 or (compilerVer[0] == 6 and compilerVer[1] == 0):
derivedAsmCapsCopy = deepcopy(derivedAsmCaps)
# copy KernargPreloading from CACHED_ASM_CAPS (to ignore this)
derivedAsmCapsCopy["KernargPreloading"] = CACHED_ASM_CAPS[isaVersion]["KernargPreloading"]
# compare with copied version (need to keep original value)
if derivedAsmCapsCopy != CACHED_ASM_CAPS[isaVersion]:
exitFlag = True
# rocm>=6
elif derivedAsmCaps != CACHED_ASM_CAPS[isaVersion]:
exitFlag = True
if exitFlag:
printExit("Cached asm caps differ from derived asm caps for {}".format(isaVersion))
return derivedAsmCaps
else:
printWarning("Assembler not present, asm caps loaded from cache are unverified")
return CACHED_ASM_CAPS[isaVersion]
触发这句 warning 的条件是 derivedAsmCaps["SupportedISA]
为空,那就要回去找 derivedAsmCaps["SupportedISA"] = tryAssembler(isaVersion, "")
这句代码,看看 tryAssembler
函数是怎么工作的。这个函数定义在同一个文件里:
def tryAssembler(isaVersion, asmString, debug=False, *options):
"""
Try to assemble the asmString for the specified target processor
Success is defined as assembler returning no error code or stderr/stdout
"""
options = list(options)
if globalParameters["PrintLevel"] >= 2:
debug = True
if isaVersion[0] >= 10:
options += ['-mwavefrontsize64']
assembler = globalParameters['AssemblerPath']
if assembler is None:
raise ValueError('No assembler available; set TENSILE_ROCM_ASSEMBLER_PATH to point to ROCm Clang.')
args = [assembler, '-x', 'assembler',
'-target', 'amdgcn-amdhsa',
'-mcpu='+gfxName(isaVersion),
*options,
'-']
result = subprocess.run(args, input=asmString.encode(), stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
output = result.stdout.decode()
if debug:
print("isaVersion: ", isaVersion)
print("asm_cmd:", ' '.join(args))
print("asmString: ", asmString)
print("output: ", output)
print("return code: ", result.returncode)
if output != "" or result.returncode != 0:
return False
return True
他是通过调用 globalParameters['AssemblerPath']
来获取汇编器的路径,然后调用 subprocess.run
来执行汇编命令,根据返回的状态码和输出信息来判断某个架构是否支持某个特性。那接下来就是找 globalParameters['AssemblerPath']
是如何赋值的:
if "TENSILE_ROCM_ASSEMBLER_PATH" in os.environ:
globalParameters["AssemblerPath"] = os.environ.get("TENSILE_ROCM_ASSEMBLER_PATH")
elif globalParameters["AssemblerPath"] is None and supportedCompiler(globalParameters["CxxCompiler"]):
if os.name == "nt":
globalParameters["AssemblerPath"] = locateExe(globalParameters["ROCmBinPath"], "clang++.exe")
else:
bin_path = "llvm/bin" if globalParameters["CxxCompiler"] == "hipcc" else "bin"
compiler = "clang++" if globalParameters["CxxCompiler"] == "hipcc" else "amdclang++"
globalParameters["AssemblerPath"] = locateExe(os.path.join(globalParameters["ROCmPath"], bin_path), compiler)
可以看到,globalParameters['AssemblerPath']
的值是从环境变量 TENSILE_ROCM_ASSEMBLER_PATH
中获取的,如果没有设置这个环境变量,那么就会根据操作系统来找汇编器。这里在 Linux 上应该找到的是 clang++。那我们在 RocBLAS 的安装环境下运行相应的汇编命令试试:
$ clang++ -x assembler -target amdgcn-amdhsa -mcpu=gfx803 test.asm -o test
ld.lld: error: /nix/store/m1n213xhzip6sbrvzfzmv1rcgd5vckw2-rocm-llvm-runtimes-6.2.2/lib/libunwind.so is incompatible with /tmp/nix-build-81386-0/test-e358e3.o
clang++: error: ld.lld command failed with exit code 1 (use -v to see invocation)
这里 test.asm 的内容是一条简单的语句: v_fma_f32 v20, v21, v22, v23
问题终于找到了!Tensile 调用 clang++ 来检查 GPU 架构的支持特性,但是这个编译环境的 clang++ 统统报错,所以 Tensile 认为所有架构都不支持任何特性,报出了 warning ,后续的编译自然会出问题。接下来就是解决 clang++ 报错的问题。添加 -v
参数可以看到详细的编译信息:
$ clang++ -x assembler -target amdgcn-amdhsa -mcpu=gfx803 test.asm -o test -v
clang version 18.0.0git
Target: amdgcn-unknown-amdhsa
Thread model: posix
InstalledDir: /nix/store/j226zjj87mfvss3407ndwxag18ly1vyp-rocm-llvm-clang-unwrapped-6.2.2/bin
"/nix/store/j226zjj87mfvss3407ndwxag18ly1vyp-rocm-llvm-clang-unwrapped-6.2.2/bin/clang++" -cc1as -triple amdgcn-unknown-amdhsa -filetype obj -main-file-name test.asm -target-cpu gfx803 -fdebug-compilation-dir=/home/jachinshen/Documents/nixpkgs/build/rocblas/source/build -dwarf-debug-producer "clang version 18.0.0git" -dwarf-version=5 -mrelocation-model pic -o /tmp/nix-build-81386-0/test-da61cc.o test.asm
"/nix/store/w653fib3igij3jb517rdihjry6cmdfhk-rocm-llvm-binutils-wrapper-6.2.2/bin/ld.lld" --no-undefined -shared -L/nix/store/52ylf6idbizs4ms7wd80y6l2z9k6mgc2-clr-6.2.2/lib -L/nix/store/wqarg6yfjm9m6rw82f7qnp7fvw9dxach-rocm-llvm-comgr-6.2.2/lib -L/nix/store/aq0s19mgpqvq6qwwwcniq03vlx1phd3j-rocm-runtime-6.2.2/lib -L/nix/store/901c80rlps5q05bnjk1sj4zaz5k736nc-python3-3.12.7/lib -L/nix/store/l7h1x86mh3ajhcy7wqyky8h39v7sjcl9-rocprofiler-register-6.2.2/lib -L/nix/store/m1n213xhzip6sbrvzfzmv1rcgd5vckw2-rocm-llvm-runtimes-6.2.2/lib -L/nix/store/g8fw14gp14dkm65n7idrj3a0vai511j9-rocm-llvm-llvm-6.2.2/lib -L/nix/store/ncjma3lhr1sf6mqaczl9mfhjmp6ix58w-zlib-1.3.1/lib -L/nix/store/c7qx9iq6lqfjd4dxrdd6723l6rvzsj5s-ncurses-6.4.20221231/lib -L/nix/store/pf3q2h8s1l49a9qvz9y99p4m8mpcy31m-rocm-llvm-lld-6.2.2/lib -L/nix/store/bd6lqyiqwk8f0im0crmpyaklk5qdw22g-msgpack-3.3.0/lib -L/nix/store/xwhr69r2dzv1v86siykzr4csjymgra2m-libxml2-2.13.4/lib -L/nix/store/52ylf6idbizs4ms7wd80y6l2z9k6mgc2-clr-6.2.2/lib -L/nix/store/wqarg6yfjm9m6rw82f7qnp7fvw9dxach-rocm-llvm-comgr-6.2.2/lib -L/nix/store/aq0s19mgpqvq6qwwwcniq03vlx1phd3j-rocm-runtime-6.2.2/lib -L/nix/store/901c80rlps5q05bnjk1sj4zaz5k736nc-python3-3.12.7/lib -L/nix/store/l7h1x86mh3ajhcy7wqyky8h39v7sjcl9-rocprofiler-register-6.2.2/lib -L/nix/store/m1n213xhzip6sbrvzfzmv1rcgd5vckw2-rocm-llvm-runtimes-6.2.2/lib -L/nix/store/g8fw14gp14dkm65n7idrj3a0vai511j9-rocm-llvm-llvm-6.2.2/lib -L/nix/store/ncjma3lhr1sf6mqaczl9mfhjmp6ix58w-zlib-1.3.1/lib -L/nix/store/c7qx9iq6lqfjd4dxrdd6723l6rvzsj5s-ncurses-6.4.20221231/lib -L/nix/store/pf3q2h8s1l49a9qvz9y99p4m8mpcy31m-rocm-llvm-lld-6.2.2/lib -L/nix/store/bd6lqyiqwk8f0im0crmpyaklk5qdw22g-msgpack-3.3.0/lib -L/nix/store/xwhr69r2dzv1v86siykzr4csjymgra2m-libxml2-2.13.4/lib -L/nix/store/3bvxjkkmwlymr0fssczhgi39c3aj1l7i-glibc-2.40-36/lib -L/nix/store/v7dbfh5n7az2lcap0z1cv4jq0bikya8p-gcc-13.3.0//lib -L/nix/store/c91k93z9yr1cpia2pf5dr226imglrkg5-gcc-13.3.0-libgcc/lib -L/nix/store/j226zjj87mfvss3407ndwxag18ly1vyp-rocm-llvm-clang-unwrapped-6.2.2/lib -dynamic-linker=/nix/store/3bvxjkkmwlymr0fssczhgi39c3aj1l7i-glibc-2.40-36/lib/ld-linux-x86-64.so.2 /tmp/nix-build-81386-0/test-da61cc.o -lunwind -rpath /nix/store/44cn078nngvm2401rda62b3553n120gi-rocblas-6.2.2/lib -plugin-opt=mcpu=gfx803 -o test
ld.lld: error: /nix/store/m1n213xhzip6sbrvzfzmv1rcgd5vckw2-rocm-llvm-runtimes-6.2.2/lib/libunwind.so is incompatible with /tmp/nix-build-81386-0/test-da61cc.o
clang++: error: ld.lld command failed with exit code 1 (use -v to see invocation)
可以看到编译过程分成了两步,第一步是编译出 .o 文件,第二步是链接,链接的时候发现 libunwind.so 和 .o 文件不兼容,所以报错。思考一番后恍然大悟:我们只编译了一条汇编语句,为什么还要链接呢?于是手动执行第一步,编译出 .o 文件:
$ "/nix/store/j226zjj87mfvss3407ndwxag18ly1vyp-rocm-llvm-clang-unwrapped-6.2.2/bin/clang++" -cc1as -triple amdgcn-unknown-amdhsa -filetype obj -main-file-name test.asm -target-cpu gfx803 -fdebug-compilation-dir=/home/jachinshen/Documents/nixpkgs/build/rocblas/source/build -dwarf-debug-producer "clang version 18.0.0git" -dwarf-version=5 -mrelocation-model pic -o /tmp/nix-build-81386-0/test-da61cc.o test.asm
没有报错,说明编译成功。仔细观察会发现,这里使用的是 clang-unwrapped
里的 clang++
而不是 clang-unwrapped-wrapper
里面的 clang++
。这里就又涉及到 nix 的编译体系了,clang-unwrapped
是一个编译器,而 clang-unwrapped-wrapper
是一个编译器包装器,它会在编译的时候自动添加一些环境变量,应该就是这层包装让 clang++
自动调用了链接器,但是我们这里只需要编译一下汇编代码,所以直接使用 clang-unwrapped
里的 clang++
就可以了。这里因为同名不好区分,方便起见笔者直接调用了 clang-18
来使用内部的 clang++
。修改之后重新编译,成功通过。
但是运行测试时,又出现来新的错误,定位到 clients/include/blas_ex/testing_gemm_ex_get_solutions.hpp:
template <typename Ti, typename To, typename Tc>
void testing_gemm_ex_get_solutions(const Arguments& arg)
{
......
#define GEMM_EX_ARGS \
handle, transA, transB, M, N, K, &h_alpha_Tc, dA, arg.a_type, lda, dB, arg.b_type, ldb, \
&h_beta_Tc, dC, arg.c_type, ldc, dDref, d_type, ldd, arg.compute_type, algo
#define rocblas_gemm_exM(...) rocblas_gemm_ex(__VA_ARGS__)
// Get number of solutions
rocblas_int size;
CHECK_ROCBLAS_ERROR(
rocblas_gemm_ex_get_solutions(GEMM_EX_ARGS, rocblas_gemm_flags_none, NULL, &size));
rocblas_int size_large = size * 2;
std::vector<rocblas_int> ary(size_large, -1);
if(size >= 2)
{
rocblas_int size_small = size / 2;
CHECK_ROCBLAS_ERROR(rocblas_gemm_ex_get_solutions(
GEMM_EX_ARGS, rocblas_gemm_flags_none, ary.data(), &size_small));
EXPECT_EQ(ary[size_small], -1);
}
CHECK_ROCBLAS_ERROR(
rocblas_gemm_ex_get_solutions(GEMM_EX_ARGS, rocblas_gemm_flags_none, ary.data(), &size));
EXPECT_EQ(ary[size], -1);
CHECK_ROCBLAS_ERROR(rocblas_gemm_ex_get_solutions(
GEMM_EX_ARGS, rocblas_gemm_flags_none, ary.data(), &size_large));
EXPECT_EQ(ary[size], -1);
......
}
gdb 调试发现存在 size=0
的情况,导致 ary[size]
访问越界,从而崩溃。看来这个测试没有考虑这种情况,手动增加了 size=0
时提前退出,避免崩溃。下一个错误定位到 clients/include/blas3/testing_gemm.hpp:
template <typename T>
void testing_gemm(const Arguments& arg)
{
auto rocblas_gemm_fn = arg.api == FORTRAN ? rocblas_gemm<T, true> : rocblas_gemm<T, false>;
......
if(arg.unit_check || arg.norm_check)
{
host_matrix<T> hC_gold(M, N, ldc);
hC_gold = hC_1;
// ROCBLAS rocblas_pointer_mode_host
if(arg.pointer_mode_host)
{
CHECK_ROCBLAS_ERROR(rocblas_set_pointer_mode(handle, rocblas_pointer_mode_host));
handle.pre_test(arg);
if(arg.api != INTERNAL)
{
CHECK_ROCBLAS_ERROR(rocblas_gemm_fn(
handle, transA, transB, M, N, K, &h_alpha, dA, lda, dB, ldb, &h_beta, dC, ldc));
}
else
{
// using arg.stride_x,y,d for offset testing
rocblas_stride offsetA = arg.stride_x;
rocblas_stride offsetB = arg.stride_y;
rocblas_stride offsetC = arg.stride_d;
constexpr rocblas_stride strideA = 0, strideB = 0, strideC = 0;
constexpr rocblas_int batch_count = 1;
CHECK_ROCBLAS_ERROR(rocblas_internal_gemm_template<T>(handle,
transA,
transB,
M,
N,
K,
&h_alpha,
(const T*)dA + offsetA,
-offsetA,
lda,
strideA,
(const T*)dB + offsetB,
-offsetB,
ldb,
strideB,
&h_beta,
(T*)dC + offsetC,
-offsetC,
ldc,
strideC,
batch_count));
}
handle.post_test(arg);
CHECK_HIP_ERROR(hC_1.transfer_from(dC));
}
}
......
}
运行 gdb 调试:
295 CHECK_ROCBLAS_ERROR(rocblas_gemm_fn(
(gdb) p rocblas_gemm_fn
$1 = (rocblas_status (*)(rocblas_handle, rocblas_operation, rocblas_operation, rocblas_int, rocblas_int, rocblas_int, const float *,
const float *, rocblas_int, const float *, rocblas_int, const float *, float *,
rocblas_int)) 0x7ffff7061320 <rocblas_sgemm(rocblas_handle, rocblas_operation, rocblas_operation, rocblas_int, rocblas_int, rocblas_int, float const*, float const*, rocblas_int, float const*, rocblas_int, float const*, float*, rocblas_int)>
(gdb) p arg
$2 = (const Arguments &) @0x5555703bbfc8: {static c_scan_value = -999, function = "gemm", '\000' <repeats 59 times>,
name = "alpha_beta_zero_NaN", '\000' <repeats 44 times>, category = "pre_checkin", '\000' <repeats 52 times>,
known_bug_platforms = '\000' <repeats 63 times>, alpha = -nan(0x8000000000000), alphai = 0, beta = -nan(0x8000000000000), betai = 0,
stride_a = 16384, stride_b = 8192, stride_c = 32768, stride_d = 32768, stride_x = 0, stride_y = 0, user_allocated_workspace = 0, M = 256,
N = 128, K = 64, KL = 128, KU = 128, lda = 256, ldb = 64, ldc = 256, ldd = 256, incx = 0, incy = 0, batch_count = 1, scan = -999, iters = 10,
cold_iters = 2, algo = 0, solution_index = 0, geam_ex_op = rocblas_geam_ex_operation_min_plus, flags = rocblas_gemm_flags_none,
a_type = rocblas_datatype_f32_r, b_type = rocblas_datatype_f32_r, c_type = rocblas_datatype_f32_r, d_type = rocblas_datatype_f32_r,
compute_type = rocblas_datatype_f32_r, composite_compute_type = 4294967295, initialization = rocblas_initialization::rand_int,
atomics_mode = rocblas_atomics_allowed, os_flags = ALL, gpu_arch = "\000\000\000", api = C, pad = 4096, math_mode = 0, flush_batch_count = 1,
flush_memory_size = 0, threads = 0, streams = 0, devices = 0 '\000', norm_check = 0 '\000', unit_check = 1 '\001', res_check = 0 '\000',
timing = 0 '\000', transA = 78 'N', transB = 78 'N', side = 42 '*', uplo = 42 '*', diag = 42 '*', pointer_mode_host = true,
pointer_mode_device = true, stochastic_rounding = false, c_noalias_d = false, outofplace = false, HMM = false, graph_test = false,
repeatability_check = false}
(gdb) p arg.api
$3 = C
(gdb) p rocblas_gemm_fn(handle, transA, transB, M, N, K, &h_alpha, dA, lda, dB, ldb, &h_beta, dC, ldc)
Memory access fault by GPU node-1 (Agent handle: 0x5555588f62c0) on address 0x9000. Reason: Page not present or supervisor privilege.
$4 = rocblas_status_success
触发了如下错误:
rocBLAS error: Tensile solution found, but exception thrown for { a_type: "f16_r", b_type: "f16_r", c_type: "f16_r", d_type: "f16_r", compute_type: "f16_r", transA: 'N', transB: 'N', M: 256, N: 256, K: 64, alpha: 1, row_stride_a: 1, col_stride_a: 256, row_stride_b: 1, col_stride_b: 128, row_stride_c: 1, col_stride_c: 256, row_stride_d: 1, col_stride_d: 256, beta: 0, batch_count: 1, strided_batch: true, stride_a: 0, stride_b: 0, stride_c: 0, stride_d: 0, atomics_mode: atomics_allowed }
网上搜索后终于找到了解决方案:
- https://github.com/ROCm/Tensile/issues/2044:这是一个 fp16 的 bug
- https://gitlab.archlinux.org/archlinux/packaging/packages/rocblas/-/issues/2:Archlinux 打包时也遇到了这个问题
解决方案就在 Archlinux 仓库的 patch 中,主要就是去除 -mf16c
编译选项。重新编译后,问题解决。
ONNXRuntime 验证
把 ROCm 升级到 6.2.2 后,就可以测试 ONNXRuntime 了。笔者原本尝试自己编译,但是总有各种问题,后来灵光一闪,AMD 官方有提供编译好的 wheel 包,只要让内置的动态链接库指向自己编译的 ROCm 即可。具体代码位于 onnxruntime-rocm-nix-bin,通过 patchelf
修改 libonnxruntime_provider_rocm.so
的动态链接库,使其指向自己编译的 ROCm,然后就可以正常运行 ONNXRuntime 了。