Gem5について、SVEを使ってみたいので、AArch64の環境を構築しておく。GCCは以下からダウンロードできるようだ。
curl -L https://developer.arm.com/-/media/Files/downloads/gnu/11.3.rel1/binrel/arm-gnu-toolchain-11.3.rel1-x86_64-aarch64-none-linux-gnu.tar.xz | tar xJ
ARM用のGem5のビルドは以下のようにする。
scons build/ARM/gem5.debug -j20 scons build/ARM/gem5.opt -j20
一応、memcpyを使ってみたいのだが、SVEのmemcpyはどうやって生成するのだろうか?
memcpy (dest_data, source_data, data_num);
とりあえず以下のオプションでバイナリを生成すると、普通のmemcpyが生成されたようだった。
$ aarch64-none-linux-gnu-gcc -DAARCH64 -march=armv8-a+sve main.c -o memcpy.aarch64 -static $ aarch64-none-linux-gnu-objdump -d memcpy.aarch64 > memcpy.aarch64.dmp
0000000000400724 <copy_data_scalar>: 400724: a9bc7bfd stp x29, x30, [sp, #-64]! 400728: 910003fd mov x29, sp 40072c: f90017e0 str x0, [sp, #40] 400730: f90013e1 str x1, [sp, #32] 400734: b9001fe2 str w2, [sp, #28] 400738: 9400745a bl 41d8a0 <clock> 40073c: b9003fe0 str w0, [sp, #60] 400740: b9801fe0 ldrsw x0, [sp, #28] 400744: aa0003e2 mov x2, x0 400748: f94013e1 ldr x1, [sp, #32] 40074c: f94017e0 ldr x0, [sp, #40] 400750: 97fffecc bl 400280 <.plt+0x10> 400754: 94007453 bl 41d8a0 <clock> 400758: b9003be0 str w0, [sp, #56] 40075c: b9403be1 ldr w1, [sp, #56] ... 4007b4: b9403fe1 ldr w1, [sp, #60] 4007b8: 900002a0 adrp x0, 454000 <__getauxval+0x50> 4007bc: 9132c000 add x0, x0, #0xcb0 4007c0: 94001738 bl 4064a0 <_IO_printf> 4007c4: d503201f nop 4007c8: a8c47bfd ldp x29, x30, [sp], #64 4007cc: d65f03c0 ret
一方で、a64fxモードのmemcpyも生成されていた。これはSVEが使われているのかな?
000000000041bf40 <__memcpy_a64fx>: 41bf40: 0420e3e7 cntb x7 41bf44: eb07045f cmp x2, x7, lsl #1 41bf48: 54000148 b.hi 41bf70 <__memcpy_a64fx+0x30> // b.pmore 41bf4c: 25221ce1 whilelo p1.b, x7, x2 41bf50: 25221fe0 whilelo p0.b, xzr, x2 41bf54: a400a020 ld1b {z0.b}, p0/z, [x1] 41bf58: a401a421 ld1b {z1.b}, p1/z, [x1, #1, mul vl] 41bf5c: e400e000 st1b {z0.b}, p0, [x0] 41bf60: e401e401 st1b {z1.b}, p1, [x0, #1, mul vl] 41bf64: d65f03c0 ret 41bf68: d503201f nop 41bf6c: d503201f nop 41bf70: eb070c5f cmp x2, x7, lsl #3 41bf74: 54000468 b.hi 41c000 <__memcpy_a64fx+0xc0> // b.pmore 41bf78: 8b020004 add x4, x0, x2 41bf7c: 8b020025 add x5, x1, x2 41bf80: eb07085f cmp x2, x7, lsl #2 41bf84: 54000168 b.hi 41bfb0 <__memcpy_a64fx+0x70> // b.pmore 41bf88: 2518e3e0 ptrue p0.b 41bf8c: a400a020 ld1b {z0.b}, p0/z, [x1] 41bf90: a401a021 ld1b {z1.b}, p0/z, [x1, #1, mul vl] 41bf94: a40ea0a2 ld1b {z2.b}, p0/z, [x5, #-2, mul vl] 41bf98: a40fa0a3 ld1b {z3.b}, p0/z, [x5, #-1, mul vl] 41bf9c: e400e000 st1b {z0.b}, p0, [x0] 41bfa0: e401e001 st1b {z1.b}, p0, [x0, #1, mul vl] 41bfa4: e40ee082 st1b {z2.b}, p0, [x4, #-2, mul vl] 41bfa8: e40fe083 st1b {z3.b}, p0, [x4, #-1, mul vl] 41bfac: d65f03c0 ret 41bfb0: 2518e3e0 ptrue p0.b 41bfb4: a400a020 ld1b {z0.b}, p0/z, [x1] 41bfb8: a401a021 ld1b {z1.b}, p0/z, [x1, #1, mul vl] 41bfbc: a402a022 ld1b {z2.b}, p0/z, [x1, #2, mul vl] 41bfc0: a403a023 ld1b {z3.b}, p0/z, [x1, #3, mul vl] 41bfc4: a40ca0a4 ld1b {z4.b}, p0/z, [x5, #-4, mul vl] 41bfc8: a40da0a5 ld1b {z5.b}, p0/z, [x5, #-3, mul vl] 41bfcc: a40ea0a6 ld1b {z6.b}, p0/z, [x5, #-2, mul vl] 41bfd0: a40fa0a7 ld1b {z7.b}, p0/z, [x5, #-1, mul vl] 41bfd4: e400e000 st1b {z0.b}, p0, [x0] 41bfd8: e401e001 st1b {z1.b}, p0, [x0, #1, mul vl] 41bfdc: e402e002 st1b {z2.b}, p0, [x0, #2, mul vl] 41bfe0: e403e003 st1b {z3.b}, p0, [x0, #3, mul vl] 41bfe4: e40ce084 st1b {z4.b}, p0, [x4, #-4, mul vl] 41bfe8: e40de085 st1b {z5.b}, p0, [x4, #-3, mul vl] 41bfec: e40ee086 st1b {z6.b}, p0, [x4, #-2, mul vl] 41bff0: e40fe087 st1b {z7.b}, p0, [x4, #-1, mul vl]
Gem5で実行してみると、一応シミュレーションすることができた。
$ ../../gem5/build/ARM/gem5.debug \ --debug-flags=O3PipeView \ --debug-file=memcpy.aarch64.out \ ../../gem5/configs/example/se.py \ --cpu-type=DerivO3CPU \ --caches -c \ memcpy.aarch64
**** REAL SIMULATION **** build/ARM/sim/simulate.cc:194: info: Entering event queue @ 0. Starting simulation... build/ARM/sim/syscall_emul.cc:74: warn: ignoring syscall set_robust_list(...) build/ARM/sim/syscall_emul.cc:85: warn: ignoring syscall rseq(...) (further warnings will be suppressed) build/ARM/sim/mem_state.cc:443: info: Increasing stack size by one page. build/ARM/arch/arm/insts/pseudo.cc:172: warn: instruction 'bti' unimplemented build/ARM/sim/syscall_emul.cc:74: warn: ignoring syscall mprotect(...) ===== Main start ===== scalar : stop - start = 2764472338, 2764472339, 1 vector : stop - start = 2764472342, 2764472343, 1 scalar : stop - start = 2764472343, 2764472343, 0 vector : stop - start = 2764472344, 2764472344, 0
一応動いて、サイクルトレース的にの取得できたようだ。もうちょっと命令数とかサイクル数とか確認しないとちゃんと動いているかどうかわからないな?