FPGA開発日記

カテゴリ別記事インデックス https://msyksphinz.github.io/github_pages , English Version https://fpgadevdiary.hatenadiary.com/

Gem5のインストール試行とベンチマーク実行 (AArch64向けとSVE実行の試行)

Gem5について、SVEを使ってみたいので、AArch64の環境を構築しておく。GCCは以下からダウンロードできるようだ。

curl -L https://developer.arm.com/-/media/Files/downloads/gnu/11.3.rel1/binrel/arm-gnu-toolchain-11.3.rel1-x86_64-aarch64-none-linux-gnu.tar.xz | tar xJ

ARM用のGem5のビルドは以下のようにする。

scons build/ARM/gem5.debug -j20
scons build/ARM/gem5.opt -j20

一応、memcpyを使ってみたいのだが、SVEのmemcpyはどうやって生成するのだろうか?

memcpy (dest_data, source_data, data_num);

とりあえず以下のオプションでバイナリを生成すると、普通のmemcpyが生成されたようだった。

$ aarch64-none-linux-gnu-gcc -DAARCH64 -march=armv8-a+sve main.c -o memcpy.aarch64 -static
$ aarch64-none-linux-gnu-objdump -d memcpy.aarch64 > memcpy.aarch64.dmp
0000000000400724 <copy_data_scalar>:
  400724:       a9bc7bfd        stp     x29, x30, [sp, #-64]!
  400728:       910003fd        mov     x29, sp
  40072c:       f90017e0        str     x0, [sp, #40]
  400730:       f90013e1        str     x1, [sp, #32]
  400734:       b9001fe2        str     w2, [sp, #28]
  400738:       9400745a        bl      41d8a0 <clock>
  40073c:       b9003fe0        str     w0, [sp, #60]
  400740:       b9801fe0        ldrsw   x0, [sp, #28]
  400744:       aa0003e2        mov     x2, x0
  400748:       f94013e1        ldr     x1, [sp, #32]
  40074c:       f94017e0        ldr     x0, [sp, #40]
  400750:       97fffecc        bl      400280 <.plt+0x10>
  400754:       94007453        bl      41d8a0 <clock>
  400758:       b9003be0        str     w0, [sp, #56]
  40075c:       b9403be1        ldr     w1, [sp, #56]
...
  4007b4:       b9403fe1        ldr     w1, [sp, #60]
  4007b8:       900002a0        adrp    x0, 454000 <__getauxval+0x50>
  4007bc:       9132c000        add     x0, x0, #0xcb0
  4007c0:       94001738        bl      4064a0 <_IO_printf>
  4007c4:       d503201f        nop
  4007c8:       a8c47bfd        ldp     x29, x30, [sp], #64
  4007cc:       d65f03c0        ret

一方で、a64fxモードのmemcpyも生成されていた。これはSVEが使われているのかな?

000000000041bf40 <__memcpy_a64fx>:
  41bf40:   0420e3e7    cntb    x7
  41bf44:   eb07045f    cmp x2, x7, lsl #1
  41bf48:   54000148    b.hi    41bf70 <__memcpy_a64fx+0x30>  // b.pmore
  41bf4c:   25221ce1    whilelo p1.b, x7, x2
  41bf50:   25221fe0    whilelo p0.b, xzr, x2
  41bf54:   a400a020    ld1b    {z0.b}, p0/z, [x1]
  41bf58:   a401a421    ld1b    {z1.b}, p1/z, [x1, #1, mul vl]
  41bf5c:   e400e000    st1b    {z0.b}, p0, [x0]
  41bf60:   e401e401    st1b    {z1.b}, p1, [x0, #1, mul vl]
  41bf64:   d65f03c0    ret
  41bf68:   d503201f    nop
  41bf6c:   d503201f    nop
  41bf70:   eb070c5f    cmp x2, x7, lsl #3
  41bf74:   54000468    b.hi    41c000 <__memcpy_a64fx+0xc0>  // b.pmore
  41bf78:   8b020004    add x4, x0, x2
  41bf7c:   8b020025    add x5, x1, x2
  41bf80:   eb07085f    cmp x2, x7, lsl #2
  41bf84:   54000168    b.hi    41bfb0 <__memcpy_a64fx+0x70>  // b.pmore
  41bf88:   2518e3e0    ptrue   p0.b
  41bf8c:   a400a020    ld1b    {z0.b}, p0/z, [x1]
  41bf90:   a401a021    ld1b    {z1.b}, p0/z, [x1, #1, mul vl]
  41bf94:   a40ea0a2    ld1b    {z2.b}, p0/z, [x5, #-2, mul vl]
  41bf98:   a40fa0a3    ld1b    {z3.b}, p0/z, [x5, #-1, mul vl]
  41bf9c:   e400e000    st1b    {z0.b}, p0, [x0]
  41bfa0:   e401e001    st1b    {z1.b}, p0, [x0, #1, mul vl]
  41bfa4:   e40ee082    st1b    {z2.b}, p0, [x4, #-2, mul vl]
  41bfa8:   e40fe083    st1b    {z3.b}, p0, [x4, #-1, mul vl]
  41bfac:   d65f03c0    ret
  41bfb0:   2518e3e0    ptrue   p0.b
  41bfb4:   a400a020    ld1b    {z0.b}, p0/z, [x1]
  41bfb8:   a401a021    ld1b    {z1.b}, p0/z, [x1, #1, mul vl]
  41bfbc:   a402a022    ld1b    {z2.b}, p0/z, [x1, #2, mul vl]
  41bfc0:   a403a023    ld1b    {z3.b}, p0/z, [x1, #3, mul vl]
  41bfc4:   a40ca0a4    ld1b    {z4.b}, p0/z, [x5, #-4, mul vl]
  41bfc8:   a40da0a5    ld1b    {z5.b}, p0/z, [x5, #-3, mul vl]
  41bfcc:   a40ea0a6    ld1b    {z6.b}, p0/z, [x5, #-2, mul vl]
  41bfd0:   a40fa0a7    ld1b    {z7.b}, p0/z, [x5, #-1, mul vl]
  41bfd4:   e400e000    st1b    {z0.b}, p0, [x0]
  41bfd8:   e401e001    st1b    {z1.b}, p0, [x0, #1, mul vl]
  41bfdc:   e402e002    st1b    {z2.b}, p0, [x0, #2, mul vl]
  41bfe0:   e403e003    st1b    {z3.b}, p0, [x0, #3, mul vl]
  41bfe4:   e40ce084    st1b    {z4.b}, p0, [x4, #-4, mul vl]
  41bfe8:   e40de085    st1b    {z5.b}, p0, [x4, #-3, mul vl]
  41bfec:   e40ee086    st1b    {z6.b}, p0, [x4, #-2, mul vl]
  41bff0:   e40fe087    st1b    {z7.b}, p0, [x4, #-1, mul vl]

Gem5で実行してみると、一応シミュレーションすることができた。

$ ../../gem5/build/ARM/gem5.debug \
        --debug-flags=O3PipeView \
        --debug-file=memcpy.aarch64.out \
        ../../gem5/configs/example/se.py \
        --cpu-type=DerivO3CPU \
        --caches -c \
        memcpy.aarch64
**** REAL SIMULATION ****
build/ARM/sim/simulate.cc:194: info: Entering event queue @ 0.  Starting simulation...
build/ARM/sim/syscall_emul.cc:74: warn: ignoring syscall set_robust_list(...)
build/ARM/sim/syscall_emul.cc:85: warn: ignoring syscall rseq(...)
      (further warnings will be suppressed)
build/ARM/sim/mem_state.cc:443: info: Increasing stack size by one page.
build/ARM/arch/arm/insts/pseudo.cc:172: warn:   instruction 'bti' unimplemented
build/ARM/sim/syscall_emul.cc:74: warn: ignoring syscall mprotect(...)
===== Main start =====
scalar : stop - start = 2764472338, 2764472339, 1
vector : stop - start = 2764472342, 2764472343, 1
scalar : stop - start = 2764472343, 2764472343, 0
vector : stop - start = 2764472344, 2764472344, 0

一応動いて、サイクルトレース的にの取得できたようだ。もうちょっと命令数とかサイクル数とか確認しないとちゃんと動いているかどうかわからないな?