SiFive社のRISC-Vボード HiFive Unleashedを使ってみる (4. ベンチマークプログラムの実行)

f:id:msyksphinz:20190419223924p:plain — HiFive Unleashed 評価ボード

HiFive Unleashed DebianでCoremarkを動かす

次に、ベンチマークプログラムを動かして、HiFive UnleashedのRISC-Vコアの性能を見てみる。今回はCoremarkベンチマークプログラムを使用する。 linux64/core_portme.makを変更して、CFLAGSの最適化オプションを-O3に変更した。

diff --git a/linux64/core_portme.mak b/linux64/core_portme.mak
index 5cfabee..68fdb3d 100755
--- a/linux64/core_portme.mak
+++ b/linux64/core_portme.mak
@@ -24,7 +24,7 @@ OUTFLAG= -o
 CC = gcc
 # Flag: CFLAGS
 #      Use this flag to define compiler options. Note, you can add compiler opt
ions from the command line using XCFLAGS="other flags"
-PORT_CFLAGS = -O2
+PORT_CFLAGS = -O3
 FLAGS_STR = "$(PORT_CFLAGS) $(XCFLAGS) $(XLFLAGS) $(LFLAGS_END)"
 CFLAGS = $(PORT_CFLAGS) -I$(PORT_DIR) -I. -DFLAGS_STR=\"$(FLAGS_STR)\"
 #Flag: LFLAGS_END

git clone https://github.com/eembc/coremark.git
cd coremark/
make TARGET=linux64

echo Loading done ./coremark.exe
Loading done ./coremark.exe
make port_postload
...
make port_prerun
...
./coremark.exe  0x3415 0x3415 0x66 0 7 1 2000  > ./run2.log
make port_postrun
...

無事にコンパイルとベンチマーク実行がうまくいった。

root@buildroot:~/work/riscv/coremark# less ./run2.log
2K validation run parameters for coremark.
CoreMark Size    : 666
Total ticks      : 13323
Total time (secs): 13.323000
Iterations/Sec   : 2251.745102
Iterations       : 30000
Compiler version : GCC8.3.0
Compiler flags   : -O3 -DPERFORMANCE_RUN=1  -lrt
Memory location  : Please put data memory location here
                        (e.g. code in flash, data on heap etc)
seedcrc          : 0x18f2
[0]crclist       : 0xe3c1
[0]crcmatrix     : 0x0747
[0]crcstate      : 0x8d84
[0]crcfinal      : 0xff48
Correct operation validated. See README.md for run and reporting rules.

HiFive Unleashedの動作周波数は1000MHzなので、Coremark/MHzは $2251.745102 / 1000 = 2.25$ となった。 HiFive UnleashedのU54のCoremarkスコアは2.75 Coremark/MHzなので、公称値よりも低下してした。 OSによるパフォーマンスの低下の可能性もある。

HiFive Unleashedでマルチコアプログラミングに挑戦

HiFive UnleashedにDebianをインストールしたので、様々なプログラミングの幅が広がった。 RISC-Vのマルチコアは、C++のPthreadなどのライブラリを使えば簡単に活用することができる。ここでは、C++でPthreadのプログラムを記述して、HiFive Unleashedでマルチコアの性能を見てみる。

ここでは、HiFive Unleashedで数値積分のプログラムをマルチコアで動かし、その性能を測定する。数値積分は、ある関数に対して区間 $[a, b$ ] の間の積分を計算するわけだが区間を分割してマルチコアで計算し、最後に加算しても特に積分結果に問題はないはずだ(数値計算の専門家から言わせると厳密にもう少しケアしなければならない部分があるかもしれませんが、今回はあまり気にせず進める)。

#include <assert.h>
#include <chrono>
#include <cstdio>
#include <iostream>
#include <limits.h>
#include <mutex>
#include <thread>
#include <vector>

std::mutex mtx_;
double int_ans = 0.0;

double f(double x)
{
  return x * x;
}

void add_count(double ans)
{
  std::lock_guard<std::mutex> lock(mtx_);
  int_ans += ans;
}

void worker(double a, double b)
{
  static double step = 0.00000001;

  double x = a;
  double s = 0.0;

  while(x < b) {
    x = x + step;
    s = s + f(x);
  }

  s = step * ((f(a) + f(b)) / 2.0 + s);
  add_count(s);
}


int main (int argc, char **argv)
{
  size_t num_threads = 1;
  if (argc != 2) {
    std::cerr << "Error: multi_core [num_threads]\n";
    exit (EXIT_FAILURE);
  }

  size_t val = strtoul (argv[1], NULL, 10);
  if (val == 0) {
  } else {
    num_threads = val;
    std::cout << "Number of threads " << num_threads << " : ";
  }

  double length = 1.0 / num_threads;
  // start to measure
  auto start = std::chrono::high_resolution_clock::now();

  std::vector<std::thread> threads;

  for(size_t i = 0; i < num_threads; ++i){
    double start = static_cast<double>(i) / num_threads;
    threads.emplace_back(std::thread(worker, start, start + length));
  }

  for(auto& thread : threads){
    thread.join();
  }

  auto end = std::chrono::high_resolution_clock::now();
  auto dur = end - start;
  auto msec = std::chrono::duration_cast<std::chrono::milliseconds>(dur).count();
  std::cout << msec << "msec\n";
  std::cout << "Answer = " << int_ans << '\n';

  return 0;
}

区間積分の対象範囲を $[0, 1.0)$ に設定し、その間の区間をコア数によって分割している。これをコンパイルし、さっそく1コア、2コア、4コアで実行して時間を測定する。

# g++ multi_core.cpp -lpthread
# ./a.out 1
Number of threads 1 : 6905msec
Answer = 0.333333
# ./a.out 2
Number of threads 2 : 3453msec
Answer = 0.333333
# ./a.out 4
Number of threads 4 : 1727msec
Answer = 0.333333
# ./a.out 8
Number of threads 8 : 1729msec
Answer = 0.333333