Jachin Shen
Jachin Shen

Categories

Tags

前情提要

在前两篇文章中,我们已经升级了基础的 ROCm LLVM 编译环境,并且解决了 HIP 程序的运行问题。现在,我们继续升级 ROCm 6.2.2 的其他组件,其中的大头就是 CK(composable kernel), MIOpen 和 MIGraphX。

升级 rocMLIR

[Tracking] ROCm packages 中提到了两个 patch,打上就可以解决找不到头文件的问题,正常编译。

升级 composable_kernel

CMake HIP Language support

升级完源代码后,在 CMake 阶段就会出现问题:

-- The HIP compiler identification is unknown
-- Detecting HIP compiler ABI info
-- Detecting HIP compiler ABI info - failed
-- Check for working HIP compiler: /nix/store/h0aa4jal4731bhrc5k865y927qndsjwa-rocm-llvm-clang-wrapper-6.2.2/bin/clang++
-- Check for working HIP compiler: /nix/store/h0aa4jal4731bhrc5k865y927qndsjwa-rocm-llvm-clang-wrapper-6.2.2/bin/clang++ - broken
...
    clang++: error: cannot find ROCm device library; provide its path via '--rocm-path' or '--rocm-device-lib-path', or pass '-nogpulib' to build without ROCm device library

检查 CMakeLists.txt,发现问题出在 project(composable_kernel VERSION ${version} LANGUAGES CXX HIP) 这一行,在 commit 7b027d5 中进行了重构,使用了 CMake 新的 HIP language support 。在旧版本 6.0.2 中并没有这个功能,所以那时的编译环境没有问题。

找到了问题,但是解决没有那么简单。翻看 CMake 的源码,和 HIP 相关的部分主要在:

  1. https://gitlab.kitware.com/cmake/cmake/-/blob/v3.30.0/Modules/CMakeDetermineHIPCompiler.cmake?ref_type=tags
  2. https://gitlab.kitware.com/cmake/cmake/-/blob/v3.30.0/Modules/CMakeTestHIPCompiler.cmake?ref_type=tags

整体流程是,CMake 会从环境中寻找 HIP,然后检查 ABI,如果失败就尝试编译一个简单的 HIP 程序。这里 CMake 找到了 ROCm llvm 的 clang++,但是这里的 clang++ 并没有配套的环境,所以编译失败。我们希望 CMake 找到的应该是已经包装好的 hipcc 。

那我们直接给 CMake 指定 HIP 的编译器行不行?在参数中增加 -DCMAKE_HIP_COMPILER=hipcc,结果如下:

CMake Error at /nix/store/irwcgpm8csj7br49jih6kdpqkaqya3vf-cmake-3.30.4/share/cmake-3.30/Modules/CMakeDetermineHIPCompiler.cmake:73 (message):
  CMAKE_HIP_COMPILER is set to the hipcc wrapper:

   hipcc

  This is not supported.  Use Clang directly, or let CMake pick a default.
Call Stack (most recent call first):
  CMakeLists.txt:26 (project)

很遗憾不行, CMake 不支持使用包装好的 hipcc ,必须使用 clang++ 。

那么现在就有矛盾的地方了, CMake 只接受光秃秃的 clang++,这在一般的发行版中不是问题,因为 ROCm 库已经在系统中了, clang++ 可以执行。然而,在 NixOS 中,安装的库并不会直接暴露在系统环境中,二进制可执行文件需要通过环境变量到对应的 nix store 中寻找库。这就导致了 clang++ 找不到库,从而编译失败。那么这里的一个 workaround 就是在这个编译环境中,将 ROCm 库的路径添加到环境变量中,让 clang++ 找到库。

  rocmtoolkit-merged = symlinkJoin {
    name = "rocmtoolkit-merged";

    paths = [
      rocm-core
      rocm-thunk
      rocm-device-libs
      roctracer
      rocdbgapi
      rocm-smi
      hsa-amd-aqlprofile-bin
      clr
    ];

    postBuild = ''
      rm -rf $out/nix-support
    '';
  };

  propagatedBuildInputs = [ rocmtoolkit-merged ];

  ROCM_PATH = "${rocmtoolkit-merged}";
  DEVICE_LIB_PATH = "${rocmtoolkit-merged}/amdgpu/bitcode";

这里使用了 NixOS 的 symlinkJoin 来将多个库合并成一个,然后通过环境变量来传递给编译环境。笔者尝试过只将基础库 ${clr} 添加到环境变量 ROCM_PATH 中,但发现 CMake 还是无法识别 ABI 信息,所以最后还是将所有相关库都添加到环境变量中。

选择硬件架构

如果不给定硬件架构,composable kernel 默认会编译所有支持的架构,这会导致编译时间过长(笔者使用六核 Ryzen 3600 跑了一整天还没有编译完,官方文档说整个软件栈编译需要 500+ 核时,目前看来 composable kernel 这一部分占了很大比例)。因此,在调试阶段最好指定硬件架构来节省时间。

nix 文件中提供了 GPU_TARGETS 变量来指定硬件架构,但是笔者传入本地的硬件架构 gfx1030 后,发现编译时间依然很长。ps aux | grep clang 后发现,编译器依然在编译所有支持的架构。无奈之下只能回头检查 CMakeLists.txt 文件,发现其中有一段代码:

# https://github.com/ROCm/composable_kernel/blob/665934078ecc6743e049de61e67d28d2a0e5dfe9/library/src/tensor_operation_instance/gpu/CMakeLists.txt#L40
    if(INSTANCES_ONLY)
        set(INST_TARGETS ${DEFAULT_GPU_TARGETS})
    else()
        set(INST_TARGETS ${GPU_TARGETS})
    endif()

编译器真正使用的架构列表是 INST_TARGETS,而 INST_TARGETS 的值取决于 INSTANCES_ONLY 变量。如果 INSTANCES_ONLY 为真,则 INST_TARGETS 的值为 DEFAULT_GPU_TARGETS,否则为 GPU_TARGETS。之前笔者为了跳过硬件测试,将 INSTANCES_ONLY 设置为真,导致编译器只编译了 DEFAULT_GPU_TARGETS 中的架构。因此,在调试阶段,需要将 INSTANCES_ONLY 设置为假,并指定 GPU_TARGETS,这样编译器才会编译指定的硬件架构。

这个设计的缘由应该是假设了两种情况:

  1. INSTANCES_ONLY=true: 开发者本地没有硬件,编译所有支持的架构,跳过测试
  2. INSTANCES_ONLY=false: 开发者本地有硬件,编译指定架构,进行测试

但是文档中并没有提及这点,只是说了如果 GPU_TARGETS 为空,则默认使用所有支持的架构,而如果指定了 GPU_TARGETS,读者就会默认只编译指定的架构,完全不知道 INSTANCES_ONLY 的影响。

找到问题,设置 INSTANCES_ONLY=false 后,再用 ps aux | grep clang 查看编译器参数,确认编译器只编译了 gfx1030 架构,编译时间也明显缩短了。

升级 MIOpen

回头修复 composable_kernel

照常升级好源代码版本后( MIOpen 的源代码空间占用很大,笔者为此扩容了根目录,期间文件系统损坏,不得不重装系统,详见NixOS 根目录重装),在 config 阶段, composable_kernel 的 cmake 文件报错:

CMake Error at /nix/store/50nxgvdjdcr2wdhbx5zn3df91zhbndbr-unpack-composable_kernel-6.2.2/lib/cmake/composable_kernel/composable_kernelConfig.cmake:24 (include):
  include could not find requested file:

    /nix/store/50nxgvdjdcr2wdhbx5zn3df91zhbndbr-unpack-composable_kernel-6.2.2/lib/cmake/composable_kernel/composable_kerneldevice_conv_operationsTargets.cmake
Call Stack (most recent call first):
  CMakeLists.txt:305 (find_package)


CMake Error at /nix/store/50nxgvdjdcr2wdhbx5zn3df91zhbndbr-unpack-composable_kernel-6.2.2/lib/cmake/composable_kernel/composable_kernelConfig.cmake:24 (include):
  include could not find requested file:

    /nix/store/50nxgvdjdcr2wdhbx5zn3df91zhbndbr-unpack-composable_kernel-6.2.2/lib/cmake/composable_kernel/composable_kerneldevice_contraction_operationsTargets.cmake
Call Stack (most recent call first):
  CMakeLists.txt:305 (find_package)

意思是说,在 composable kernel 的库中,找不到 composable_kerneldevice_conv_operationsTargets.cmakecomposable_kerneldevice_contraction_operationsTargets.cmake 这两个文件。笔者到这个库的文件夹下查看:

nix/store/50nxgvdjdcr2wdhbx5zn3df91zhbndbr-unpack-composable_kernel-6.2.2/lib/
├── cmake
│   └── composable_kernel
│       ├── composable_kernelConfig.cmake
│       ├── composable_kernelConfigVersion.cmake
│       ├── composable_kerneldevice_gemm_operationsTargets.cmake
│       ├── composable_kerneldevice_gemm_operationsTargets-release.cmake
│       ├── composable_kerneldevice_other_operationsTargets.cmake
│       ├── composable_kerneldevice_other_operationsTargets-release.cmake
│       ├── composable_kerneldevice_reduction_operationsTargets.cmake
│       ├── composable_kerneldevice_reduction_operationsTargets-release.cmake
│       ├── composable_kernelutilityTargets.cmake
│       └── composable_kernelutilityTargets-release.cmake
├── libdevice_gemm_operations.a
├── libdevice_other_operations.a
├── libdevice_reduction_operations.a
└── libutility.a

3 directories, 14 files

的确没有这两个文件,而且相应的库文件也不存在,所以问题就在 composable kernel 的编译上,并没有生成这两个库文件。

还是硬件架构的问题

回头继续检查 cmake 文件,核心代码如下:

# https://github.com/ROCm/composable_kernel/blob/665934078ecc6743e049de61e67d28d2a0e5dfe9/library/src/tensor_operation_instance/gpu/CMakeLists.txt#L168
        if(("${cmake_instance}" MATCHES "quantization") AND (DEFINED DTYPES) AND (NOT DTYPES MATCHES "int8"))
            message("quantization instances will not be built!")
            set(add_inst 0)
        endif()
        if(("${cmake_instance}" MATCHES "ONLY DL_KERNELS") AND (NOT DEFINED DL_KERNELS))
            message("Found only dl instances, but DL_KERNELS is not set. Skipping.")
            set(add_inst 0)
        endif()
        if(("${cmake_instance}" MATCHES "ONLY XDL_KERNELS") AND (NOT INST_TARGETS MATCHES "gfx9"))
            message("Found only xdl instances, but gfx9 is not on the targets list. Skipping.")
            set(add_inst 0)
        endif()
        if(("${cmake_instance}" MATCHES "ONLY WMMA_KERNELS") AND (NOT INST_TARGETS MATCHES "gfx11"))
            message("Found only wmma instances, but gfx11 is not on the targets list. Skipping.")
            set(add_inst 0)
        endif()
        if(("${cmake_instance}" MATCHES "ONLY XDL_AND_DL_KERNELS") AND (NOT DEFINED DL_KERNELS) AND (NOT INST_TARGETS MATCHES "gfx9"))
            message("Found only xdl and dl instances, but gfx9 is not on the targets listand DL_KERNELS is not set. Skipping.")
            set(add_inst 0)
        endif()
        if(("${cmake_instance}" MATCHES "ONLY XDL_AND_WMMA_KERNELS") AND (NOT INST_TARGETS MATCHES "gfx11") AND (NOT INST_TARGETS MATCHES "gfx9"))
            message("Found only xdl and wmma instances, but gfx11 and gfx9 are not on the targets list. Skipping.")
            set(add_inst 0)
        endif()
        if(("${cmake_instance}" MATCHES "XDL_DL_WMMA_KERNELS") AND (NOT INST_TARGETS MATCHES "gfx11") AND (NOT INST_TARGETS MATCHES "gfx9") AND (NOT DEFINED DL_KERNELS))
            message("Found xdl, dl, and wmma instances, but none of those meet the target list. Skipping.")
            set(add_inst 0)
        endif()
        if((add_inst EQUAL 1))
            get_filename_component(target_dir ${subdir_path} NAME)
            add_subdirectory(${target_dir})
            if("${cmake_instance}" MATCHES "gemm")
                list(APPEND CK_DEVICE_GEMM_INSTANCES $<TARGET_OBJECTS:device_${target_dir}_instance>)
            elseif("${cmake_instance}" MATCHES "conv")
                 list(APPEND CK_DEVICE_CONV_INSTANCES $<TARGET_OBJECTS:device_${target_dir}_instance>)
            elseif("${cmake_instance}" MATCHES "mha")
                 list(APPEND CK_DEVICE_MHA_INSTANCES $<TARGET_OBJECTS:device_${target_dir}_instance>)
            elseif("${cmake_instance}" MATCHES "contr")
                 list(APPEND CK_DEVICE_CONTRACTION_INSTANCES $<TARGET_OBJECTS:device_${target_dir}_instance>)
            elseif("${cmake_instance}" MATCHES "reduce")
                 list(APPEND CK_DEVICE_REDUCTION_INSTANCES $<TARGET_OBJECTS:device_${target_dir}_instance>)
            else()
                 list(APPEND CK_DEVICE_OTHER_INSTANCES $<TARGET_OBJECTS:device_${target_dir}_instance>)
            endif()
            message("add_instance_directory ${subdir_path}")
        else()
            message("skip_instance_directory ${subdir_path}")
        endif()

这里的 ${cmake_instance} 是读取了子文件夹的 CMakeLists.txt 文件,比如:

# ONLY XDL_KERNELS
set(BATCHED_GEMM_INSTANCES)
list(APPEND BATCHED_GEMM_INSTANCES device_batched_gemm_xdl_f16_f16_f16_gmk_gkn_gmn_instance.cpp
                                      device_batched_gemm_xdl_f16_f16_f16_gmk_gnk_gmn_instance.cpp
                                      device_batched_gemm_xdl_f16_f16_f16_gkm_gkn_gmn_instance.cpp
                                      device_batched_gemm_xdl_f16_f16_f16_gkm_gnk_gmn_instance.cpp
                                      device_batched_gemm_xdl_bf16_bf16_bf16_gmk_gkn_gmn_instance.cpp
                                      device_batched_gemm_xdl_bf16_bf16_bf16_gmk_gnk_gmn_instance.cpp
                                      device_batched_gemm_xdl_bf16_bf16_bf16_gkm_gkn_gmn_instance.cpp
                                      device_batched_gemm_xdl_bf16_bf16_bf16_gkm_gnk_gmn_instance.cpp
                                      device_batched_gemm_xdl_f32_f32_f32_gmk_gkn_gmn_instance.cpp
                                      device_batched_gemm_xdl_f32_f32_f32_gmk_gnk_gmn_instance.cpp
                                      device_batched_gemm_xdl_f32_f32_f32_gkm_gkn_gmn_instance.cpp
                                      device_batched_gemm_xdl_f32_f32_f32_gkm_gnk_gmn_instance.cpp
                                      device_batched_gemm_xdl_int8_int8_int8_gmk_gkn_gmn_instance.cpp
                                      device_batched_gemm_xdl_int8_int8_int8_gmk_gnk_gmn_instance.cpp
                                      device_batched_gemm_xdl_int8_int8_int8_gkm_gkn_gmn_instance.cpp
                                      device_batched_gemm_xdl_int8_int8_int8_gkm_gnk_gmn_instance.cpp)
add_instance_library(device_batched_gemm_instance ${BATCHED_GEMM_INSTANCES})

第一行的注释标记了 kernel 类型, XDL kernel 需要 gf9 系列架构, WMMA kernel 需要 gfx11 系列架构。如果没有对应的 kernel,则不会生成对应的 instance。笔者的 gfx1030 不属于这两个架构,所以需要在 CMake 中启用 DL_KERNEL 选项。

启用 DL_KERNEL 选项重新编译后,还是缺少库文件,回头检查 CK 的源码,发现 DL_KERNEL 并没有实现所有的算子,比如 conv1d_bwd_data 只有 XDL_KERNEL 的实现。无奈之下,暂时只能添加其他硬件架构,让 CK 生成完整的库文件来满足下游 MIOpen 的需求。这里笔者选择了 gfx90a

MIOpen 的编译

解决了 CK 的问题后重新编译,检查产生的动态链接库:

$ ldd -r /nix/store/60riwd6fsp2kb314n6hs3jhlfjkfpdla-miopen-6.2.2/lib/libMIOpen.so
linux-vdso.so.1 (0x000079b29f972000)
        libzstd.so.1 => /nix/store/5hg6h4zjxc3ax7j4ywn6ksd509yl4pmd-zstd-1.5.6/lib/libzstd.so.1 (0x000079b26f12c000)
        libdl.so.2 => /nix/store/3bvxjkkmwlymr0fssczhgi39c3aj1l7i-glibc-2.40-36/lib/libdl.so.2 (0x000079b29f967000)
        libbz2.so.1 => /nix/store/6amvjdayglsifq4lb2lrj4q5gwf4dicd-bzip2-1.0.8/lib/libbz2.so.1 (0x000079b29f954000)
        libhiprtc.so.6 => /nix/store/52ylf6idbizs4ms7wd80y6l2z9k6mgc2-clr-6.2.2/lib/libhiprtc.so.6 (0x000079b26ef4e000)
        libamd_comgr.so.2 => /nix/store/wqarg6yfjm9m6rw82f7qnp7fvw9dxach-rocm-llvm-comgr-6.2.2/lib/libamd_comgr.so.2 (0x000079b266200000)
        librocblas.so.4 => /nix/store/z73h13qlf5lp0r0dym87n7hzim8ixk87-rocblas-6.2.2/lib/librocblas.so.4 (0x000079b23a200000)
        libsqlite3.so.0 => /nix/store/b4cqpglpjvkjhb8l4dh1ammpg8gk6h9p-sqlite-3.46.1/lib/libsqlite3.so.0 (0x000079b23a097000)
        librt.so.1 => /nix/store/3bvxjkkmwlymr0fssczhgi39c3aj1l7i-glibc-2.40-36/lib/librt.so.1 (0x000079b29f94d000)
        libroctx64.so.4 => /nix/store/ymihn5p7a9ix9jciqf3i3ljrdxxj3xiy-rocmtoolkit-merged/lib/libroctx64.so.4 (0x000079b29f946000)
        libamdhip64.so.6 => /nix/store/52ylf6idbizs4ms7wd80y6l2z9k6mgc2-clr-6.2.2/lib/libamdhip64.so.6 (0x000079b238800000)
        libunwind.so.1 => /nix/store/m1n213xhzip6sbrvzfzmv1rcgd5vckw2-rocm-llvm-runtimes-6.2.2/lib/libunwind.so.1 (0x000079b29f937000)
        libm.so.6 => /nix/store/3bvxjkkmwlymr0fssczhgi39c3aj1l7i-glibc-2.40-36/lib/libm.so.6 (0x000079b239fb0000)
        libpthread.so.0 => /nix/store/3bvxjkkmwlymr0fssczhgi39c3aj1l7i-glibc-2.40-36/lib/libpthread.so.0 (0x000079b29f932000)
        libc.so.6 => /nix/store/3bvxjkkmwlymr0fssczhgi39c3aj1l7i-glibc-2.40-36/lib/libc.so.6 (0x000079b238607000)
        /nix/store/wn7v2vhyyyi6clcyn0s9ixvl7d4d87ic-glibc-2.40-36/lib64/ld-linux-x86-64.so.2 (0x000079b29f974000)
        libhsa-runtime64.so.1 => /nix/store/aq0s19mgpqvq6qwwwcniq03vlx1phd3j-rocm-runtime-6.2.2/lib/libhsa-runtime64.so.1 (0x000079b238200000)
        libnuma.so.1 => /nix/store/jljd230kkw0pxdzbvcc5vdgvvrqq0x60-numactl-2.0.18/lib/libnuma.so.1 (0x000079b29f921000)
        libz.so.1 => /nix/store/ncjma3lhr1sf6mqaczl9mfhjmp6ix58w-zlib-1.3.1/lib/libz.so.1 (0x000079b26ef30000)
        libncursesw.so.6 => /nix/store/c7qx9iq6lqfjd4dxrdd6723l6rvzsj5s-ncurses-6.4.20221231/lib/libncursesw.so.6 (0x000079b239f3a000)
        libstdc++.so.6 => /nix/store/s94fwp43xhzkvw8l8nqslskib99yifzi-gcc-13.3.0-lib/lib/libstdc++.so.6 (0x000079b237e00000)
        libgcc_s.so.1 => /nix/store/s94fwp43xhzkvw8l8nqslskib99yifzi-gcc-13.3.0-lib/lib/libgcc_s.so.1 (0x000079b26ef0b000)
        librocprofiler-register.so.0 => /nix/store/l7h1x86mh3ajhcy7wqyky8h39v7sjcl9-rocprofiler-register-6.2.2/lib/librocprofiler-register.so.0 (0x000079b26eed9000)
        libelf.so.1 => /nix/store/p2i9dd3fslfhsc593h2xfs4s75lzbpki-elfutils-0.191/lib/libelf.so.1 (0x000079b2661e3000)
        libdrm.so.2 => /nix/store/sbm8cmk0i6xz6af63cgmhqr26gfdcjgq-libdrm-2.4.123/lib/libdrm.so.2 (0x000079b29f905000)
        libdrm_amdgpu.so.1 => /nix/store/sbm8cmk0i6xz6af63cgmhqr26gfdcjgq-libdrm-2.4.123/lib/libdrm_amdgpu.so.1 (0x000079b2661d7000)
        libatomic.so.1 => /nix/store/s94fwp43xhzkvw8l8nqslskib99yifzi-gcc-13.3.0-lib/lib/libatomic.so.1 (0x000079b239f2f000)
        libfmt.so.10 => /nix/store/5796jkcv6398ygfzkbkd6g4lpfbig19r-fmt-10.2.1/lib/libfmt.so.10 (0x000079b2381db000)
        libglog.so.1 => /nix/store/i159a630nxxwsb1yhqfjcmjsnwjw2lay-glog-0.6.0/lib/libglog.so.1 (0x000079b23819f000)
        libgflags.so.2.2 => /nix/store/halb5g04abisgmk3fyhlp8v36mad9xdm-gflags-2.2.2/lib/libgflags.so.2.2 (0x000079b238170000)
        liblzma.so.5 => /nix/store/wp5xd6g6v3mafvdh2185v19diqqf0cm2-xz-5.6.3/lib/liblzma.so.5 (0x000079b23813f000)
undefined symbol: LLVMInitializeX86TargetInfo   (/nix/store/60riwd6fsp2kb314n6hs3jhlfjkfpdla-miopen-6.2.2/lib/libMIOpen.so)
undefined symbol: LLVMInitializeX86Target       (/nix/store/60riwd6fsp2kb314n6hs3jhlfjkfpdla-miopen-6.2.2/lib/libMIOpen.so)
undefined symbol: LLVMInitializeX86TargetMC     (/nix/store/60riwd6fsp2kb314n6hs3jhlfjkfpdla-miopen-6.2.2/lib/libMIOpen.so)
undefined symbol: LLVMInitializeX86AsmPrinter   (/nix/store/60riwd6fsp2kb314n6hs3jhlfjkfpdla-miopen-6.2.2/lib/libMIOpen.so)
undefined symbol: LLVMInitializeX86AsmParser    (/nix/store/60riwd6fsp2kb314n6hs3jhlfjkfpdla-miopen-6.2.2/lib/libMIOpen.so)

出现了 undefined symbol ,导致下游 onnxruntime 运行时报错。检查了依赖后,定位到 rocMLIR 上:

$ nm librockCompiler.a | grep LLVMInit
0000000000000000 T LLVMInitializeAMDGPUAsmParser
0000000000000000 T LLVMInitializeAMDGPUAsmPrinter
0000000000000000 T LLVMInitializeAMDGPUTarget
0000000000000000 T LLVMInitializeAMDGPUTargetMC
0000000000000000 T LLVMInitializeAMDGPUDisassembler
0000000000000000 T LLVMInitializeAMDGPUTargetInfo
nm: TFLiteUtils.cpp.o: no symbols
0000000000000000 T LLVMInitializeFunctionPassManager
nm: blake3_neon.c.o: no symbols
nm: AutoConvert.cpp.o: no symbols
                 U LLVMInitializeAMDGPUAsmParser
                 U LLVMInitializeAMDGPUAsmPrinter
                 U LLVMInitializeAMDGPUTarget
                 U LLVMInitializeAMDGPUTargetInfo
                 U LLVMInitializeAMDGPUTargetMC
                 U LLVMInitializeAMDGPUAsmParser
                 U LLVMInitializeAMDGPUAsmPrinter
                 U LLVMInitializeAMDGPUTarget
                 U LLVMInitializeAMDGPUTargetInfo
                 U LLVMInitializeAMDGPUTargetMC
                 U LLVMInitializeAMDGPUAsmParser
                 U LLVMInitializeAMDGPUAsmPrinter
                 U LLVMInitializeAMDGPUTarget
                 U LLVMInitializeAMDGPUTargetInfo
                 U LLVMInitializeAMDGPUTargetMC
                 U LLVMInitializeX86AsmParser
                 U LLVMInitializeX86AsmPrinter
                 U LLVMInitializeX86Target
                 U LLVMInitializeX86TargetInfo
                 U LLVMInitializeX86TargetMC

查找 LLVMInitialize 相关 symbol 后,定位到了 mlir/tools/rocmlir-lib/librockcompiler_deps.cmake

### DO NOT EDIT!
### Generated by mlir/utils/jenkins/static-checks/get_fat_library_deps_list.pl
set(__mlir_libs
LLVMAMDGPUAsmParser
LLVMAMDGPUCodeGen
LLVMAMDGPUDesc
LLVMAMDGPUDisassembler
LLVMAMDGPUInfo
LLVMAMDGPUUtils
......
lldCommon
lldELF
)

问题就在这里, LLVMAMDGPU 相关的 symbol 包含进来了,但是 LLVMX86 相关的没有,导致 librockCompiler.a 中并没有相关的 symbol 。根据提示,检查 mlir/utils/jenkins/static-checks/get_fat_library_deps_list.pl 文件,最关键的部分在于:

foreach (@deps) {
  last if /outputs:/;
  if (m#external/llvm-project/llvm/lib/lib(\w+)\.a#) {
    push @mlirLibs, $1;
  } elsif (m#lib/lib(\w+)\.a#) {
    push @rocmlirLibs, $1;
  }
}

这个文件会自动枚举 external/llvm-project/llvm/lib 里面的静态链接库,加入依赖。猜测上游开发者编译的时候,应该只有 AMDGPU ,没有 X86,所以这个脚本只添加了 AMDGPU 的依赖。回头检查 nix 配置:

cmakeFlags = [
    "-DLLVM_TARGETS_TO_BUILD=AMDGPU;${llvmNativeTarget}"
    "-DLLVM_ENABLE_ZSTD=ON"
    "-DLLVM_ENABLE_ZLIB=ON"
    "-DLLVM_ENABLE_TERMINFO=ON"
    "-DLLVM_BUILD_LIBRARY_DIR=/build/source/build/external/llvm-project/llvm/lib"
    "-DROCM_PATH=${clr}"
    # Manually define CMAKE_INSTALL_<DIR>
    # See: https://github.com/NixOS/nixpkgs/pull/197838
    "-DCMAKE_INSTALL_BINDIR=bin"
    "-DCMAKE_INSTALL_LIBDIR=lib"
    "-DCMAKE_INSTALL_INCLUDEDIR=include"
  ] ++ lib.optionals buildRockCompiler [
    "-DBUILD_FAT_LIBROCKCOMPILER=ON"
  ] ++ lib.optionals (!buildRockCompiler) [
    "-DROCM_TEST_CHIPSET=gfx000"
  ];

这里 llvmNativeTargetX86 ,所以 librockCompiler.a 依赖 LLVMX86 相关的 symbol ,但是附带的 mlir/tools/rocmlir-lib/librockcompiler_deps.cmake 并没有相关依赖,所以 librockCompiler.aLLVMX86 相关的 symbol 处于 undefined 状态。解决方案有两个:

  1. 修改 mlir/tools/rocmlir-lib/librockcompiler_deps.cmake ,添加 LLVMX86 相关的依赖。
  2. 修改 cmakeFlags ,将 llvmNativeTarget 设置为 AMDGPU

这里先尝试第一种方案。添加依赖后重新编译,再次检查 librockCompiler.a:

$ nm librockCompiler.a | grep LLVMInit
0000000000000000 T LLVMInitializeAMDGPUAsmParser
0000000000000000 T LLVMInitializeAMDGPUAsmPrinter
0000000000000000 T LLVMInitializeAMDGPUTarget
0000000000000000 T LLVMInitializeAMDGPUTargetMC
0000000000000000 T LLVMInitializeAMDGPUDisassembler
0000000000000000 T LLVMInitializeAMDGPUTargetInfo
0000000000000000 T LLVMInitializeX86AsmParser
0000000000000000 T LLVMInitializeX86AsmPrinter
0000000000000000 T LLVMInitializeX86Target
0000000000000000 T LLVMInitializeX86TargetMC
0000000000000000 T LLVMInitializeX86Disassembler
0000000000000000 T LLVMInitializeX86TargetInfo
nm: TFLiteUtils.cpp.o: no symbols
0000000000000000 T LLVMInitializeFunctionPassManager
nm: blake3_neon.c.o: no symbols
nm: AutoConvert.cpp.o: no symbols
                 U LLVMInitializeAMDGPUAsmParser
                 U LLVMInitializeAMDGPUAsmPrinter
                 U LLVMInitializeAMDGPUTarget
                 U LLVMInitializeAMDGPUTargetInfo
                 U LLVMInitializeAMDGPUTargetMC
                 U LLVMInitializeAMDGPUAsmParser
                 U LLVMInitializeAMDGPUAsmPrinter
                 U LLVMInitializeAMDGPUTarget
                 U LLVMInitializeAMDGPUTargetInfo
                 U LLVMInitializeAMDGPUTargetMC
                 U LLVMInitializeAMDGPUAsmParser
                 U LLVMInitializeAMDGPUAsmPrinter
                 U LLVMInitializeAMDGPUTarget
                 U LLVMInitializeAMDGPUTargetInfo
                 U LLVMInitializeAMDGPUTargetMC
                 U LLVMInitializeX86AsmParser
                 U LLVMInitializeX86AsmPrinter
                 U LLVMInitializeX86Target
                 U LLVMInitializeX86TargetInfo
                 U LLVMInitializeX86TargetMC

按预想中出现了 LLVMInitializeX86 的符号,说明问题的确在这里。

接下来尝试第二种方案,在 nix 配置中去除 llvmNativeTarget 后重新编译,检查结果如下:

[nix-shell:/nix/store/nfw1ki7cwgv4i394jmpp4m9kj18ncd5r-rocmlir-rock-6.2.2/lib]$ nm librockCompiler.a | grep LLVMInit
0000000000000000 T LLVMInitializeAMDGPUAsmParser
0000000000000000 T LLVMInitializeAMDGPUAsmPrinter
0000000000000000 T LLVMInitializeAMDGPUTarget
0000000000000000 T LLVMInitializeAMDGPUTargetMC
0000000000000000 T LLVMInitializeAMDGPUDisassembler
0000000000000000 T LLVMInitializeAMDGPUTargetInfo
nm: TFLiteUtils.cpp.o: no symbols
0000000000000000 T LLVMInitializeFunctionPassManager
nm: AutoConvert.cpp.o: no symbols
nm: blake3_neon.c.o: no symbols
                 U LLVMInitializeAMDGPUAsmParser
                 U LLVMInitializeAMDGPUAsmPrinter
                 U LLVMInitializeAMDGPUTarget
                 U LLVMInitializeAMDGPUTargetInfo
                 U LLVMInitializeAMDGPUTargetMC
                 U LLVMInitializeAMDGPUAsmParser
                 U LLVMInitializeAMDGPUAsmPrinter
                 U LLVMInitializeAMDGPUTarget
                 U LLVMInitializeAMDGPUTargetInfo
                 U LLVMInitializeAMDGPUTargetMC
                 U LLVMInitializeAMDGPUAsmParser
                 U LLVMInitializeAMDGPUAsmPrinter
                 U LLVMInitializeAMDGPUTarget
                 U LLVMInitializeAMDGPUTargetInfo
                 U LLVMInitializeAMDGPUTargetMC

没有了 X86 相关的 symbol,这应该就是上游开发者原本的编译配置。

MIGraphX 的编译

MIGraphX 的编译主要遇到了 头文件的问题,一旦启用 CK 支持,src/targets/gpu/include/migraphx/gpu/ck.hpp 中就会需要:

#include "ck/host/device_gemm_multiple_d.hpp"
#include "ck/host/device_batched_gemm_softmax_gemm.hpp"

但是 CK 安装的文件中并没有这些文件。一番搜索过后,在 codegen/include/ck/host 中找到了 device_gemm_multiple_d.hpp,但是 device_batched_gemm_softmax_gemm.hpp 并没有找到。这里很奇怪

  1. codegen 这个文件夹并没有加入主 CMakeLists.txt 中,而是自己拥有一个独立 CMakeLists.txt
  2. 尝试在 codegen 目录底下编译,还是没有 device_batched_gemm_softmax_gemm.hpp 文件
  3. 这个文件夹是在 Add host lib #1134 这个 三月 issue 引进的,但是 src/targets/gpu/include/migraphx/gpu/ck.hpp 在去年就引入了
  4. 根据 requirements.txt 中记录的 commit id 去找对应的 CK 版本codegen 文件夹根本不存在

目前的线索就这些,暂时不启用 CK 编译,跳过这个问题。

修复 RocBLAS 的 Tensile 支持

之前编译 RocBLAS 的时候,增加 Tensile 支持会编译失败,为了尽快推进暂时禁止了这个选项,但是下游运行 onnxruntime 的时候会报 rocblas_status=14 错误,在 library/include/internal/rocblas-types.h 中有错误代码的定义:

/* ============================================================================================ */
/**
 *   @brief rocblas status codes definition
 */
typedef enum rocblas_status_
{
    rocblas_status_success         = 0, /**< Success */
    rocblas_status_invalid_handle  = 1, /**< Handle not initialized, invalid or null */
    rocblas_status_not_implemented = 2, /**< Function is not implemented */
    rocblas_status_invalid_pointer = 3, /**< Invalid pointer argument */
    rocblas_status_invalid_size    = 4, /**< Invalid size argument */
    rocblas_status_memory_error    = 5, /**< Failed internal memory allocation, copy or dealloc */
    rocblas_status_internal_error  = 6, /**< Other internal library failure */
    rocblas_status_perf_degraded   = 7, /**< Performance degraded due to low device memory */
    rocblas_status_size_query_mismatch = 8, /**< Unmatched start/stop size query */
    rocblas_status_size_increased      = 9, /**< Queried device memory size increased */
    rocblas_status_size_unchanged      = 10, /**< Queried device memory size unchanged */
    rocblas_status_invalid_value       = 11, /**< Passed argument not valid */
    rocblas_status_continue            = 12, /**< Nothing preventing function to proceed */
    rocblas_status_check_numerics_fail
    = 13, /**< Will be set if the vector/matrix has a NaN/Infinity/denormal value */
    rocblas_status_excluded_from_build
    = 14, /**< Function is not available in build, likely a function requiring Tensile built without Tensile */
    rocblas_status_arch_mismatch
    = 15, /**< The function requires a feature absent from the device architecture */
} rocblas_status;

对应的问题就是有些需要 Tensile 的函数没有编译,所以还是需要解决 Tensile 的问题。

加上 Tensile 支持时编译 RocBLAS 会出现:

################################################################################
# Tensile Create Library
Tensile::WARNING: Did not detect SupportedISA: [(8, 0, 3), (9, 0, 0), (9, 0, 6), (9, 0, 8), (9, 0, 10), (9, 4, 0), (9, 4, 1), (9, 4, 2), (10, 1, 0), (10, 1, 1), (10, 1, 2), (10, 3, 0), (10, 3, 1), (11, 0, 0), (11, 0, 1), (11, 0, 2)]; cannot benchmark assembly kernels.
# Found  hipcc version 6.2.41134-0
Tensile::WARNING: Architecture (8, 0, 3) not supported by ROCm 6.2.41134-0
Tensile::WARNING: Architecture (9, 0, 0) not supported by ROCm 6.2.41134-0
Tensile::WARNING: Architecture (9, 0, 6) not supported by ROCm 6.2.41134-0
Tensile::WARNING: Architecture (9, 0, 8) not supported by ROCm 6.2.41134-0
Tensile::WARNING: Architecture (9, 0, 10) not supported by ROCm 6.2.41134-0
Tensile::WARNING: Architecture (9, 4, 0) not supported by ROCm 6.2.41134-0
Tensile::WARNING: Architecture (9, 4, 1) not supported by ROCm 6.2.41134-0
Tensile::WARNING: Architecture (9, 4, 2) not supported by ROCm 6.2.41134-0
Tensile::WARNING: Architecture (10, 1, 0) not supported by ROCm 6.2.41134-0
Tensile::WARNING: Architecture (10, 1, 1) not supported by ROCm 6.2.41134-0
Tensile::WARNING: Architecture (10, 1, 2) not supported by ROCm 6.2.41134-0
Tensile::WARNING: Architecture (10, 3, 0) not supported by ROCm 6.2.41134-0
Tensile::WARNING: Architecture (10, 3, 1) not supported by ROCm 6.2.41134-0
Tensile::WARNING: Architecture (11, 0, 0) not supported by ROCm 6.2.41134-0
Tensile::WARNING: Architecture (11, 0, 1) not supported by ROCm 6.2.41134-0
Tensile::WARNING: Architecture (11, 0, 2) not supported by ROCm 6.2.41134-0

后续编译的时候, Tensile 会报 kernel 缺少 VectorWidthA key 的错误,让人很迷惑,只能先从这些 warning 下手。查找一番后,在 Tensile 的源码中找到了 相关代码

def GetAsmCaps(isaVersion):
  """ Determine assembler capabilities by testing short instructions sequences """
  if globalParameters["AssemblerPath"] is not None:

    derivedAsmCaps = {}
    derivedAsmCaps["SupportedISA"]          = tryAssembler(isaVersion, "")
    derivedAsmCaps["HasExplicitCO"]         = tryAssembler(isaVersion, "v_add_co_u32 v0,vcc,v0,1")
    derivedAsmCaps["HasExplicitNC"]         = tryAssembler(isaVersion, "v_add_nc_u32 v0,v0,1")

    # Syntax of DirectToLds loads has changed: destination vgpr should be omitted
    # Old syntax should be removed in a future update as it is no longer supported
    derivedAsmCaps["HasDirectToLdsDest"]    = tryAssembler(isaVersion, "buffer_load_dword v40, v36, s[24:27], s28 offen offset:0 lds") \
                                           or tryAssembler(isaVersion, "buffer_load_b32 v40, v36, s[24:27], s28 offen offset:0 lds")
    derivedAsmCaps["HasDirectToLdsNoDest"]  = tryAssembler(isaVersion, "buffer_load_dword v36, s[24:27], s28 offen offset:0 lds") \
                                           or tryAssembler(isaVersion, "buffer_load_b32 v36, s[24:27], s28 offen offset:0 lds")

    derivedAsmCaps["HasAddLshl"]            = tryAssembler(isaVersion, "v_add_lshl_u32 v47, v36, v34, 0x2")
    derivedAsmCaps["HasLshlOr"]             = tryAssembler(isaVersion, "v_lshl_or_b32 v47, v36, 0x2, v34")
    derivedAsmCaps["HasSMulHi"]             = tryAssembler(isaVersion, "s_mul_hi_u32 s47, s36, s34")

    derivedAsmCaps["HasWMMA"]               = tryAssembler(isaVersion, "v_wmma_f32_16x16x16_f16 v[0:7], v[8:15], v[16:23], v[0:7]")
    derivedAsmCaps["HasMFMA"]               = tryAssembler(isaVersion, "v_mfma_f32_32x32x2bf16 a[0:31], v32, v33, a[0:31]") \
                                           or tryAssembler(isaVersion, "v_mfma_f32_32x32x1_2b_f32 a[0:31], v0, v1, a[0:31]")
    derivedAsmCaps["HasMFMA_constSrc"]      = tryAssembler(isaVersion, "v_mfma_f32_32x32x2bf16 a[0:31], v32, v33, 0") \
                                           or tryAssembler(isaVersion, "v_mfma_f32_32x32x1_2b_f32 a[0:31], v0, v1, 0")
    derivedAsmCaps["HasMFMA_vgpr"]          = tryAssembler(isaVersion, "v_mfma_f32_32x32x2bf16 v[0:31], v32, v33, v[0:31]") \
                                           or tryAssembler(isaVersion, "v_mfma_f32_32x32x1_2b_f32 v[0:31], v0, v1, v[0:31]")
    derivedAsmCaps["HasMFMA_f64"]           = tryAssembler(isaVersion, "v_mfma_f64_16x16x4f64 v[0:7], v[32:33], v[36:37], v[0:7]") \
                                           or tryAssembler(isaVersion, "v_mfma_f64_16x16x4_f64 v[0:7], v[32:33], v[36:37], v[0:7]")
    derivedAsmCaps["HasMFMA_bf16_original"] = tryAssembler(isaVersion, "v_mfma_f32_32x32x2bf16 a[0:31], v32, v33, a[0:31]")
    derivedAsmCaps["HasMFMA_bf16_1k"]       = tryAssembler(isaVersion, "v_mfma_f32_32x32x4bf16_1k a[0:31], v[32:33], v[36:37], a[0:31]")
    derivedAsmCaps["HasMFMA_xf32"]          = tryAssembler(isaVersion, "v_mfma_f32_32x32x4_xf32 a[0:15], v[32:33], v[36:37], a[0:15]")
    derivedAsmCaps["HasMFMA_f8"]            = tryAssembler(isaVersion, "v_mfma_f32_16x16x32_fp8_fp8 a[0:3], v[2:3], v[4:5], a[0:3]")
    derivedAsmCaps["HasMFMA_b8"]            = tryAssembler(isaVersion, "v_mfma_f32_16x16x32_bf8_bf8 a[0:3], v[2:3], v[4:5], a[0:3]")
    derivedAsmCaps["HasMFMA_i8_908"]        = tryAssembler(isaVersion, "v_mfma_i32_32x32x8i8 a[0:15], v2, v3, a[0:15]")
    derivedAsmCaps["HasMFMA_i8_940"]        = tryAssembler(isaVersion, "v_mfma_i32_32x32x16_i8 a[0:15], v[2:3], v[4:5], a[0:15]")

    derivedAsmCaps["v_mac_f16"]             = tryAssembler(isaVersion, "v_mac_f16 v47, v36, v34")

    derivedAsmCaps["v_fma_f16"]             = tryAssembler(isaVersion, "v_fma_f16 v47, v36, v34, v47, op_sel:[0,0,0,0]")
    derivedAsmCaps["v_fmac_f16"]            = tryAssembler(isaVersion, "v_fma_f16 v47, v36, v34")

    derivedAsmCaps["v_pk_fma_f16"]          = tryAssembler(isaVersion, "v_pk_fma_f16 v47, v36, v34, v47, op_sel:[0,0,0]")
    derivedAsmCaps["v_pk_fmac_f16"]         = tryAssembler(isaVersion, "v_pk_fma_f16 v47, v36, v34")

    derivedAsmCaps["v_mad_mix_f32"]         = tryAssembler(isaVersion, "v_mad_mix_f32 v47, v36, v34, v47, op_sel:[0,0,0] op_sel_hi:[1,1,0]")
    derivedAsmCaps["v_fma_mix_f32"]         = tryAssembler(isaVersion, "v_fma_mix_f32 v47, v36, v34, v47, op_sel:[0,0,0] op_sel_hi:[1,1,0]")

    derivedAsmCaps["v_dot2_f32_f16"]        = tryAssembler(isaVersion, "v_dot2_f32_f16 v20, v36, v34, v20")
    derivedAsmCaps["v_dot2c_f32_f16"]       = tryAssembler(isaVersion, "v_dot2c_f32_f16 v47, v36, v34") \
                                           or tryAssembler(isaVersion, "v_dot2acc_f32_f16 v47, v36, v34")

    derivedAsmCaps["v_dot4_i32_i8"]         = tryAssembler(isaVersion, "v_dot4_i32_i8 v47, v36, v34")
    derivedAsmCaps["v_dot4c_i32_i8"]        = tryAssembler(isaVersion, "v_dot4c_i32_i8 v47, v36, v34")
    derivedAsmCaps["VOP3v_dot4_i32_i8"]     = tryAssembler(isaVersion, "v_dot4_i32_i8 v47, v36, v34, v47")

    derivedAsmCaps["v_mac_f32"]             = tryAssembler(isaVersion, "v_mac_f32 v20, v21, v22")
    derivedAsmCaps["v_fma_f32"]             = tryAssembler(isaVersion, "v_fma_f32 v20, v21, v22, v23")
    derivedAsmCaps["v_fmac_f32"]            = tryAssembler(isaVersion, "v_fmac_f32 v20, v21, v22")

    derivedAsmCaps["v_fma_f64"]             = tryAssembler(isaVersion, "v_fma_f64 v[20:21], v[22:23], v[24:25], v[20:21]")

    derivedAsmCaps["v_mov_b64"]             = tryAssembler(isaVersion, "v_mov_b64 v[20:21], v[22:23]")

    derivedAsmCaps["HasAtomicAdd"]          = tryAssembler(isaVersion, "buffer_atomic_add_f32 v0, v1, s[0:3], 0 offen offset:0")
    derivedAsmCaps["HasGLCModifier"]        = tryAssembler(isaVersion, "buffer_load_dwordx4 v[10:13], v[0], s[0:3], 0, offen offset:0, glc")
    derivedAsmCaps["HasNTModifier"]         = tryAssembler(isaVersion, "buffer_load_dwordx4 v[10:13], v[0], s[0:3], 0, offen offset:0, nt")

    if tryAssembler(isaVersion, "s_waitcnt vmcnt(63)"):
      derivedAsmCaps["MaxVmcnt"] = 63
    elif tryAssembler(isaVersion, "s_waitcnt vmcnt(15)"):
      derivedAsmCaps["MaxVmcnt"] = 15
    else:
      derivedAsmCaps["MaxVmcnt"] = 0

    # TODO- Need to query the max cap, just like vmcnt as well?
    derivedAsmCaps["MaxLgkmcnt"] = 15

    derivedAsmCaps["KernargPreloading"] = tryAssembler(isaVersion, """
      TestKernel:
      s_endpgm
      .amdhsa_kernel TestKernel
      .amdhsa_next_free_vgpr 8
      .amdhsa_next_free_sgpr 4
      .amdhsa_group_segment_fixed_size 0 // lds bytes
      .amdhsa_user_sgpr_kernarg_segment_ptr 1
      .amdhsa_user_sgpr_kernarg_preload_length 3
      .amdhsa_user_sgpr_kernarg_preload_offset 0
      .amdhsa_accum_offset 4
      .end_amdhsa_kernel
      """)

    derivedAsmCaps["SupportedSource"] = True

    ignoreCacheCheck = globalParameters["IgnoreAsmCapCache"]

    # disable cache checking for < rocm 5.3
    compilerVer = globalParameters['HipClangVersion'].split(".")[:2]
    compilerVer = [int(c) for c in compilerVer]
    if len(compilerVer) >= 2:
      ignoreCacheCheck = ignoreCacheCheck or \
                         compilerVer[0] < 5 or \
                         (compilerVer[0] == 5 and compilerVer[1] <= 2) 
      
    if not derivedAsmCaps["SupportedISA"] and CACHED_ASM_CAPS[isaVersion]["SupportedISA"]:
      printWarning("Architecture {} not supported by ROCm {}".format(isaVersion, globalParameters['HipClangVersion']))
      ignoreCacheCheck = True

    # check if derived caps matches asm cap cache
    if not ignoreCacheCheck:
      exitFlag = False
      # rocm<=6.0, ignore KernargPreloading
      if compilerVer[0] <= 5 or (compilerVer[0] == 6 and compilerVer[1] == 0):
        derivedAsmCapsCopy = deepcopy(derivedAsmCaps)
        # copy KernargPreloading from CACHED_ASM_CAPS (to ignore this)
        derivedAsmCapsCopy["KernargPreloading"] = CACHED_ASM_CAPS[isaVersion]["KernargPreloading"]
        # compare with copied version (need to keep original value)
        if derivedAsmCapsCopy != CACHED_ASM_CAPS[isaVersion]:
          exitFlag = True
      # rocm>=6
      elif derivedAsmCaps != CACHED_ASM_CAPS[isaVersion]:
        exitFlag = True
      if exitFlag:
        printExit("Cached asm caps differ from derived asm caps for {}".format(isaVersion))
    return derivedAsmCaps
  else:
    printWarning("Assembler not present, asm caps loaded from cache are unverified")
    return CACHED_ASM_CAPS[isaVersion]

触发这句 warning 的条件是 derivedAsmCaps["SupportedISA] 为空,那就要回去找 derivedAsmCaps["SupportedISA"] = tryAssembler(isaVersion, "") 这句代码,看看 tryAssembler 函数是怎么工作的。这个函数定义在同一个文件里:

def tryAssembler(isaVersion, asmString, debug=False, *options):
  """
  Try to assemble the asmString for the specified target processor
  Success is defined as assembler returning no error code or stderr/stdout
  """
  options = list(options)
  if globalParameters["PrintLevel"] >= 2:
    debug = True

  if isaVersion[0] >= 10:
    options += ['-mwavefrontsize64']

  assembler = globalParameters['AssemblerPath']
  if assembler is None:
    raise ValueError('No assembler available; set TENSILE_ROCM_ASSEMBLER_PATH to point to ROCm Clang.')
  args = [assembler, '-x', 'assembler',
          '-target', 'amdgcn-amdhsa',
          '-mcpu='+gfxName(isaVersion),
          *options,
          '-']

  result = subprocess.run(args, input=asmString.encode(), stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
  output = result.stdout.decode()

  if debug:
    print("isaVersion: ", isaVersion)
    print("asm_cmd:", ' '.join(args))
    print("asmString: ", asmString)
    print("output: ", output)
    print("return code: ", result.returncode)

  if output != "" or result.returncode != 0:
    return False
  return True

他是通过调用 globalParameters['AssemblerPath'] 来获取汇编器的路径,然后调用 subprocess.run 来执行汇编命令,根据返回的状态码和输出信息来判断某个架构是否支持某个特性。那接下来就是找 globalParameters['AssemblerPath'] 是如何赋值的:

if "TENSILE_ROCM_ASSEMBLER_PATH" in os.environ:
    globalParameters["AssemblerPath"] = os.environ.get("TENSILE_ROCM_ASSEMBLER_PATH")
  elif globalParameters["AssemblerPath"] is None and supportedCompiler(globalParameters["CxxCompiler"]):
    if os.name == "nt":
      globalParameters["AssemblerPath"] = locateExe(globalParameters["ROCmBinPath"], "clang++.exe")
    else:
      bin_path = "llvm/bin" if globalParameters["CxxCompiler"] == "hipcc" else "bin"
      compiler = "clang++" if globalParameters["CxxCompiler"] == "hipcc" else "amdclang++"
      globalParameters["AssemblerPath"] = locateExe(os.path.join(globalParameters["ROCmPath"], bin_path), compiler)

可以看到,globalParameters['AssemblerPath'] 的值是从环境变量 TENSILE_ROCM_ASSEMBLER_PATH 中获取的,如果没有设置这个环境变量,那么就会根据操作系统来找汇编器。这里在 Linux 上应该找到的是 clang++。那我们在 RocBLAS 的安装环境下运行相应的汇编命令试试:

$ clang++ -x assembler -target amdgcn-amdhsa -mcpu=gfx803 test.asm -o test
ld.lld: error: /nix/store/m1n213xhzip6sbrvzfzmv1rcgd5vckw2-rocm-llvm-runtimes-6.2.2/lib/libunwind.so is incompatible with /tmp/nix-build-81386-0/test-e358e3.o
clang++: error: ld.lld command failed with exit code 1 (use -v to see invocation)

这里 test.asm 的内容是一条简单的语句: v_fma_f32 v20, v21, v22, v23

问题终于找到了!Tensile 调用 clang++ 来检查 GPU 架构的支持特性,但是这个编译环境的 clang++ 统统报错,所以 Tensile 认为所有架构都不支持任何特性,报出了 warning ,后续的编译自然会出问题。接下来就是解决 clang++ 报错的问题。添加 -v 参数可以看到详细的编译信息:

$ clang++ -x assembler -target amdgcn-amdhsa -mcpu=gfx803 test.asm -o test -v
clang version 18.0.0git
Target: amdgcn-unknown-amdhsa
Thread model: posix
InstalledDir: /nix/store/j226zjj87mfvss3407ndwxag18ly1vyp-rocm-llvm-clang-unwrapped-6.2.2/bin
 "/nix/store/j226zjj87mfvss3407ndwxag18ly1vyp-rocm-llvm-clang-unwrapped-6.2.2/bin/clang++" -cc1as -triple amdgcn-unknown-amdhsa -filetype obj -main-file-name test.asm -target-cpu gfx803 -fdebug-compilation-dir=/home/jachinshen/Documents/nixpkgs/build/rocblas/source/build -dwarf-debug-producer "clang version 18.0.0git" -dwarf-version=5 -mrelocation-model pic -o /tmp/nix-build-81386-0/test-da61cc.o test.asm
 "/nix/store/w653fib3igij3jb517rdihjry6cmdfhk-rocm-llvm-binutils-wrapper-6.2.2/bin/ld.lld" --no-undefined -shared -L/nix/store/52ylf6idbizs4ms7wd80y6l2z9k6mgc2-clr-6.2.2/lib -L/nix/store/wqarg6yfjm9m6rw82f7qnp7fvw9dxach-rocm-llvm-comgr-6.2.2/lib -L/nix/store/aq0s19mgpqvq6qwwwcniq03vlx1phd3j-rocm-runtime-6.2.2/lib -L/nix/store/901c80rlps5q05bnjk1sj4zaz5k736nc-python3-3.12.7/lib -L/nix/store/l7h1x86mh3ajhcy7wqyky8h39v7sjcl9-rocprofiler-register-6.2.2/lib -L/nix/store/m1n213xhzip6sbrvzfzmv1rcgd5vckw2-rocm-llvm-runtimes-6.2.2/lib -L/nix/store/g8fw14gp14dkm65n7idrj3a0vai511j9-rocm-llvm-llvm-6.2.2/lib -L/nix/store/ncjma3lhr1sf6mqaczl9mfhjmp6ix58w-zlib-1.3.1/lib -L/nix/store/c7qx9iq6lqfjd4dxrdd6723l6rvzsj5s-ncurses-6.4.20221231/lib -L/nix/store/pf3q2h8s1l49a9qvz9y99p4m8mpcy31m-rocm-llvm-lld-6.2.2/lib -L/nix/store/bd6lqyiqwk8f0im0crmpyaklk5qdw22g-msgpack-3.3.0/lib -L/nix/store/xwhr69r2dzv1v86siykzr4csjymgra2m-libxml2-2.13.4/lib -L/nix/store/52ylf6idbizs4ms7wd80y6l2z9k6mgc2-clr-6.2.2/lib -L/nix/store/wqarg6yfjm9m6rw82f7qnp7fvw9dxach-rocm-llvm-comgr-6.2.2/lib -L/nix/store/aq0s19mgpqvq6qwwwcniq03vlx1phd3j-rocm-runtime-6.2.2/lib -L/nix/store/901c80rlps5q05bnjk1sj4zaz5k736nc-python3-3.12.7/lib -L/nix/store/l7h1x86mh3ajhcy7wqyky8h39v7sjcl9-rocprofiler-register-6.2.2/lib -L/nix/store/m1n213xhzip6sbrvzfzmv1rcgd5vckw2-rocm-llvm-runtimes-6.2.2/lib -L/nix/store/g8fw14gp14dkm65n7idrj3a0vai511j9-rocm-llvm-llvm-6.2.2/lib -L/nix/store/ncjma3lhr1sf6mqaczl9mfhjmp6ix58w-zlib-1.3.1/lib -L/nix/store/c7qx9iq6lqfjd4dxrdd6723l6rvzsj5s-ncurses-6.4.20221231/lib -L/nix/store/pf3q2h8s1l49a9qvz9y99p4m8mpcy31m-rocm-llvm-lld-6.2.2/lib -L/nix/store/bd6lqyiqwk8f0im0crmpyaklk5qdw22g-msgpack-3.3.0/lib -L/nix/store/xwhr69r2dzv1v86siykzr4csjymgra2m-libxml2-2.13.4/lib -L/nix/store/3bvxjkkmwlymr0fssczhgi39c3aj1l7i-glibc-2.40-36/lib -L/nix/store/v7dbfh5n7az2lcap0z1cv4jq0bikya8p-gcc-13.3.0//lib -L/nix/store/c91k93z9yr1cpia2pf5dr226imglrkg5-gcc-13.3.0-libgcc/lib -L/nix/store/j226zjj87mfvss3407ndwxag18ly1vyp-rocm-llvm-clang-unwrapped-6.2.2/lib -dynamic-linker=/nix/store/3bvxjkkmwlymr0fssczhgi39c3aj1l7i-glibc-2.40-36/lib/ld-linux-x86-64.so.2 /tmp/nix-build-81386-0/test-da61cc.o -lunwind -rpath /nix/store/44cn078nngvm2401rda62b3553n120gi-rocblas-6.2.2/lib -plugin-opt=mcpu=gfx803 -o test
ld.lld: error: /nix/store/m1n213xhzip6sbrvzfzmv1rcgd5vckw2-rocm-llvm-runtimes-6.2.2/lib/libunwind.so is incompatible with /tmp/nix-build-81386-0/test-da61cc.o
clang++: error: ld.lld command failed with exit code 1 (use -v to see invocation)

可以看到编译过程分成了两步,第一步是编译出 .o 文件,第二步是链接,链接的时候发现 libunwind.so 和 .o 文件不兼容,所以报错。思考一番后恍然大悟:我们只编译了一条汇编语句,为什么还要链接呢?于是手动执行第一步,编译出 .o 文件:

$  "/nix/store/j226zjj87mfvss3407ndwxag18ly1vyp-rocm-llvm-clang-unwrapped-6.2.2/bin/clang++" -cc1as -triple amdgcn-unknown-amdhsa -filetype obj -main-file-name test.asm -target-cpu gfx803 -fdebug-compilation-dir=/home/jachinshen/Documents/nixpkgs/build/rocblas/source/build -dwarf-debug-producer "clang version 18.0.0git" -dwarf-version=5 -mrelocation-model pic -o /tmp/nix-build-81386-0/test-da61cc.o test.asm

没有报错,说明编译成功。仔细观察会发现,这里使用的是 clang-unwrapped 里的 clang++ 而不是 clang-unwrapped-wrapper 里面的 clang++。这里就又涉及到 nix 的编译体系了,clang-unwrapped 是一个编译器,而 clang-unwrapped-wrapper 是一个编译器包装器,它会在编译的时候自动添加一些环境变量,应该就是这层包装让 clang++ 自动调用了链接器,但是我们这里只需要编译一下汇编代码,所以直接使用 clang-unwrapped 里的 clang++ 就可以了。这里因为同名不好区分,方便起见笔者直接调用了 clang-18 来使用内部的 clang++。修改之后重新编译,成功通过。

但是运行测试时,又出现来新的错误,定位到 clients/include/blas_ex/testing_gemm_ex_get_solutions.hpp

template <typename Ti, typename To, typename Tc>
void testing_gemm_ex_get_solutions(const Arguments& arg)
{
  ......

#define GEMM_EX_ARGS                                                                        \
    handle, transA, transB, M, N, K, &h_alpha_Tc, dA, arg.a_type, lda, dB, arg.b_type, ldb, \
        &h_beta_Tc, dC, arg.c_type, ldc, dDref, d_type, ldd, arg.compute_type, algo
#define rocblas_gemm_exM(...) rocblas_gemm_ex(__VA_ARGS__)

    // Get number of solutions
    rocblas_int size;
    CHECK_ROCBLAS_ERROR(
        rocblas_gemm_ex_get_solutions(GEMM_EX_ARGS, rocblas_gemm_flags_none, NULL, &size));

    rocblas_int              size_large = size * 2;
    std::vector<rocblas_int> ary(size_large, -1);

    if(size >= 2)
    {
        rocblas_int size_small = size / 2;
        CHECK_ROCBLAS_ERROR(rocblas_gemm_ex_get_solutions(
            GEMM_EX_ARGS, rocblas_gemm_flags_none, ary.data(), &size_small));
        EXPECT_EQ(ary[size_small], -1);
    }

    CHECK_ROCBLAS_ERROR(
        rocblas_gemm_ex_get_solutions(GEMM_EX_ARGS, rocblas_gemm_flags_none, ary.data(), &size));
    EXPECT_EQ(ary[size], -1);

    CHECK_ROCBLAS_ERROR(rocblas_gemm_ex_get_solutions(
        GEMM_EX_ARGS, rocblas_gemm_flags_none, ary.data(), &size_large));
    EXPECT_EQ(ary[size], -1);

  ......
}

gdb 调试发现存在 size=0 的情况,导致 ary[size] 访问越界,从而崩溃。看来这个测试没有考虑这种情况,手动增加了 size=0 时提前退出,避免崩溃。下一个错误定位到 clients/include/blas3/testing_gemm.hpp

template <typename T>
void testing_gemm(const Arguments& arg)
{
    auto rocblas_gemm_fn = arg.api == FORTRAN ? rocblas_gemm<T, true> : rocblas_gemm<T, false>;

......

    if(arg.unit_check || arg.norm_check)
    {
        host_matrix<T> hC_gold(M, N, ldc);
        hC_gold = hC_1;

        // ROCBLAS rocblas_pointer_mode_host
        if(arg.pointer_mode_host)
        {
            CHECK_ROCBLAS_ERROR(rocblas_set_pointer_mode(handle, rocblas_pointer_mode_host));

            handle.pre_test(arg);
            if(arg.api != INTERNAL)
            {
                CHECK_ROCBLAS_ERROR(rocblas_gemm_fn(
                    handle, transA, transB, M, N, K, &h_alpha, dA, lda, dB, ldb, &h_beta, dC, ldc));
            }
            else
            {
                // using arg.stride_x,y,d for offset testing
                rocblas_stride           offsetA = arg.stride_x;
                rocblas_stride           offsetB = arg.stride_y;
                rocblas_stride           offsetC = arg.stride_d;
                constexpr rocblas_stride strideA = 0, strideB = 0, strideC = 0;
                constexpr rocblas_int    batch_count = 1;

                CHECK_ROCBLAS_ERROR(rocblas_internal_gemm_template<T>(handle,
                                                                      transA,
                                                                      transB,
                                                                      M,
                                                                      N,
                                                                      K,
                                                                      &h_alpha,
                                                                      (const T*)dA + offsetA,
                                                                      -offsetA,
                                                                      lda,
                                                                      strideA,
                                                                      (const T*)dB + offsetB,
                                                                      -offsetB,
                                                                      ldb,
                                                                      strideB,
                                                                      &h_beta,
                                                                      (T*)dC + offsetC,
                                                                      -offsetC,
                                                                      ldc,
                                                                      strideC,
                                                                      batch_count));
            }
            handle.post_test(arg);
            CHECK_HIP_ERROR(hC_1.transfer_from(dC));
        }
    }
......
}

运行 gdb 调试:

295                     CHECK_ROCBLAS_ERROR(rocblas_gemm_fn(
(gdb) p rocblas_gemm_fn
$1 = (rocblas_status (*)(rocblas_handle, rocblas_operation, rocblas_operation, rocblas_int, rocblas_int, rocblas_int, const float *, 
    const float *, rocblas_int, const float *, rocblas_int, const float *, float *, 
    rocblas_int)) 0x7ffff7061320 <rocblas_sgemm(rocblas_handle, rocblas_operation, rocblas_operation, rocblas_int, rocblas_int, rocblas_int, float const*, float const*, rocblas_int, float const*, rocblas_int, float const*, float*, rocblas_int)>
(gdb) p arg
$2 = (const Arguments &) @0x5555703bbfc8: {static c_scan_value = -999, function = "gemm", '\000' <repeats 59 times>, 
  name = "alpha_beta_zero_NaN", '\000' <repeats 44 times>, category = "pre_checkin", '\000' <repeats 52 times>, 
  known_bug_platforms = '\000' <repeats 63 times>, alpha = -nan(0x8000000000000), alphai = 0, beta = -nan(0x8000000000000), betai = 0, 
  stride_a = 16384, stride_b = 8192, stride_c = 32768, stride_d = 32768, stride_x = 0, stride_y = 0, user_allocated_workspace = 0, M = 256, 
  N = 128, K = 64, KL = 128, KU = 128, lda = 256, ldb = 64, ldc = 256, ldd = 256, incx = 0, incy = 0, batch_count = 1, scan = -999, iters = 10, 
  cold_iters = 2, algo = 0, solution_index = 0, geam_ex_op = rocblas_geam_ex_operation_min_plus, flags = rocblas_gemm_flags_none, 
  a_type = rocblas_datatype_f32_r, b_type = rocblas_datatype_f32_r, c_type = rocblas_datatype_f32_r, d_type = rocblas_datatype_f32_r, 
  compute_type = rocblas_datatype_f32_r, composite_compute_type = 4294967295, initialization = rocblas_initialization::rand_int, 
  atomics_mode = rocblas_atomics_allowed, os_flags = ALL, gpu_arch = "\000\000\000", api = C, pad = 4096, math_mode = 0, flush_batch_count = 1, 
  flush_memory_size = 0, threads = 0, streams = 0, devices = 0 '\000', norm_check = 0 '\000', unit_check = 1 '\001', res_check = 0 '\000', 
  timing = 0 '\000', transA = 78 'N', transB = 78 'N', side = 42 '*', uplo = 42 '*', diag = 42 '*', pointer_mode_host = true, 
  pointer_mode_device = true, stochastic_rounding = false, c_noalias_d = false, outofplace = false, HMM = false, graph_test = false, 
  repeatability_check = false}
(gdb) p arg.api
$3 = C
(gdb) p rocblas_gemm_fn(handle, transA, transB, M, N, K, &h_alpha, dA, lda, dB, ldb, &h_beta, dC, ldc)
Memory access fault by GPU node-1 (Agent handle: 0x5555588f62c0) on address 0x9000. Reason: Page not present or supervisor privilege.
$4 = rocblas_status_success

触发了如下错误:

rocBLAS error: Tensile solution found, but exception thrown for { a_type: "f16_r", b_type: "f16_r", c_type: "f16_r", d_type: "f16_r", compute_type: "f16_r", transA: 'N', transB: 'N', M: 256, N: 256, K: 64, alpha: 1, row_stride_a: 1, col_stride_a: 256, row_stride_b: 1, col_stride_b: 128, row_stride_c: 1, col_stride_c: 256, row_stride_d: 1, col_stride_d: 256, beta: 0, batch_count: 1, strided_batch: true, stride_a: 0, stride_b: 0, stride_c: 0, stride_d: 0, atomics_mode: atomics_allowed }

网上搜索后终于找到了解决方案:

  • https://github.com/ROCm/Tensile/issues/2044:这是一个 fp16 的 bug
  • https://gitlab.archlinux.org/archlinux/packaging/packages/rocblas/-/issues/2:Archlinux 打包时也遇到了这个问题

解决方案就在 Archlinux 仓库的 patch 中,主要就是去除 -mf16c 编译选项。重新编译后,问题解决。

ONNXRuntime 验证

把 ROCm 升级到 6.2.2 后,就可以测试 ONNXRuntime 了。笔者原本尝试自己编译,但是总有各种问题,后来灵光一闪,AMD 官方有提供编译好的 wheel 包,只要让内置的动态链接库指向自己编译的 ROCm 即可。具体代码位于 onnxruntime-rocm-nix-bin,通过 patchelf 修改 libonnxruntime_provider_rocm.so 的动态链接库,使其指向自己编译的 ROCm,然后就可以正常运行 ONNXRuntime 了。