BOOM Explorerの論文を読む (1. Introduction)

BOOM Explorerという論文を読んでいる。これは、RISC-VのアウトオブオーダコアであるBOOMのパラメータを探索し、最適なマイクロアーキテクチャデザインを見つけるためのツールセットらしい。どうやって見つけだすのか、その手法に非常に興味があるので読んでみることにしてみよう。

ieeexplore.ieee.org

マイクロアーキテクチャの最適なパラメータを探索する。

最適とは「電力」「性能」について。

BOOM-Explorerによる最適な電力と性能のバランスを探索する。

初期デザインセットを作成するためにMicroALアルゴリズムを使用してアドバンストマイクロアーキテクチャを探索する。
ガウシアンプロセスモデルのディープカーネルラーニング関数(DKL-GP）を使用してデザイン空間を特徴化する。
相関関係のあるベイジアン最適化によりPreto-Optimalデザインを探索する。

実験によりBOOM-Explorerはシニアエンジニアによって設計された洗練されたデザインと同様の電力と性能のバランスをより短い時間で達成することが出来た。

Introduction

マイクロアーキテクチャはISAの実装を決定づけるものであり、いくつかのプロセッサの内部コンポーネントとパラメータによりって、マイクロアーキテクチャは特定のプロセスにおいて電力、性能、回路面積に影響を与える[4], [5]

RISC-VのBOOMはオープンソースのアウトオブオーダ命令CPUで、Chiselを使って記述されている。Chiselを使用することにより、パラメタライズされた柔軟なデザインを設計することができる。

マイクロアーキテクチャに置いて最適なポイント、つまり電力と性能のバランスを取るためにはいくつかの方法があるが、デザイン空間が大きすぎてすべてを見積もることは出来ず、また、様々な機能が融合される(CPUの特殊なキュー、バッファ、分岐予測器、ベクトル実行ユニット、外部コプロセッサなど)と、さらにマイクロアーキテクチャの探索は難しいものとなる。

[6] のAdaBoostは洗練されたサンプリングアルゴリズムによりデザインスペースを探索し学習するアルゴリズムとなっている。

もう一つの方法は、マイクロアーキテクチャについて熟練されたエンジニアが最適な設計を行うというものがある。しかしこれはスケーラビリティが無い。

さらに、マイクロアーキテクチャの粗粒度シミュレータを使うという方法もあるが、これも限界がある。

[7-10] は粗粒度マイクロアーキテクチャモデルを使ったパラメータの導出手法

または冗長的なオーバヘッドを削減することにより、シミュレーション速度を向上させることができる[11-14] 。

これらの手法はいくつかの制約がある。

静的な解析では
粗粒度なモデルによるシミュレーションでは、制度に限界がある。
電力消費をモデル化することが難しい。[15]

これにより、BOOM-Explorerを提案する。BOOM-Explorerはモデルの精度を落とすことなく、マイクロアーキテクチャのアクティブラーニング(MicroAL)アルゴリズムを用いてBOOMの予備知識を埋め込み、最小回数でBOOMのデザイン空間をサンプリングする。

次に、MicroALによって初期化されたディープカーネル学習関数(DKL-GP)ガウシアンプロセスモデルを用いて異なるマイクロアーキテクチャの特徴を提案する。

デザイン空間はDKL-GPをベースとした相関多目的ベイズ最適化[17]を用いることにより探索される。このフレームワークにより最小のマイクロアーキテクチャのデザインを用いて、性能と電力のより効率的なバランスのマイクロアーキテクチャを探索することができる。

RISC-V BOOMの膨大なデザイン空間の中から最も代表的な設計を獲得するために、transductiveデザインに基づくマイクロアーキテクチャを考慮したアクティブラーニング手法を初めて導入した。
ディープカーネル学習を用いた新しいガウスプロセスモデルと相関多目的ベイズ最適化により、マイクロアーキテクチャの設計空間を特徴付ける。DKL-GPの助けを借りて、消費電力と性能のバランスを取ったデザインを探索する。
このフレームワークを7nmの先端技術で動作するBOOMで検証した。実験結果は、様々なBOOMマイクロアーキテクチャにおいて、BOOM-Explorerが優れた性能を発揮することを実証している。

参考文献

[1] : K. Asanovic, D. A. Patterson, and C. Celio, “The berkeley out-of-order machine (BOOM): An industry-competitive, synthesizable, parameterized RISC-V processor,” University of California at Berkeley, Tech. Rep., 2015.
[2] C. P. Celio, A Highly Productive Implementation of an Out-of-Order Processor Generator . eScholarship, University of California, 2017.
[3] J. Bachrach, H. Vo, B. Richards, Y. Lee, A. Waterman, R. Aviˇzienis, J. Wawrzynek, and K. Asanovi´c, “Chisel: constructing hardware in a scala embedded language,” in ACM/IEEE Design Automation Conference (DAC) , 2012, pp. 1212–1221.
[4] S. Salamin, M. Rapp, A. Pathania, A. Maity, J. Henkel, T. Mitra, and H. Amrouch, “Power-efﬁcient heterogeneous many-core design with ncfet technology,” IEEE Transactions on Computers , vol. 70, no. 9, pp.1484–1497, 2021.
[5] B. Grayson, J. Rupley, G. Z. Zuraski, E. Quinnell, D. A. Jim´enez, T. Nakra, P. Kitchin, R. Hensley, E. Brekelbaum, V. Sinha et al. , “Evolution of the samsung exynos CPU microarchitecture,” in IEEE/ACM International Symposium on Computer Architecture (ISCA) , 2020, pp.40–51.
[6] D. Li, S. Yao, Y.-H. Liu, S. Wang, and X.-H. Sun, “Efﬁcient design space exploration via statistical sampling and adaboost learning,” in ACM/IEEE Design Automation Conference (DAC) , 2016, pp. 1–6.
[7] M. Moudgill, P. Bose, and J. H. Moreno, “Validation of turandot, a fast processor model for microarchitecture exploration,” in International Performance Computing and Communications Conference (IPCCC) ,1999, pp. 451–457.
[8] T. Austin, E. Larson, and D. Ernst, “SimpleScalar: An infrastructure for computer system modeling,” Computer , vol. 35, no. 2, pp. 59–67,2002.
[9] D. Brooks, P. Bose, V. Srinivasan, M. K. Gschwind, P. G. Emma, and M. G. Rosenﬁeld, “New methodology for early-stage, microarchitecture-level power-performance analysis of microprocessors,” IBM Journal of Research and Development , vol. 47, no. 5.6, pp. 653–670, 2003.
[10] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, “The gem5 simulator,” SIGARCH Comput. Archit. News , vol. 39, no. 2, p. 1–7, Aug. 2011. [Online]. Available: https://doi.org/10.1145/2024716.2024718
[11] E. Perelman, G. Hamerly, M. Van Biesbrouck, T. Sherwood, and B. Calder, “Using simpoint for accurate and efﬁcient simulation,” ACM SIGMETRICS Performance Evaluation Review , vol. 31, no. 1, pp. 318–319, 2003.
[12] Y.-I. Kim and C.-M. Kyung, “Automatic translation of behavioral testbench for fully accelerated simulation,” in IEEE/ACM International Conference on Computer-Aided Design (ICCAD) , 2004, pp. 218–221.
[13] S. Beamer and D. Donofrio, “Efﬁciently exploiting low activity factors to accelerate RTL simulation,” in ACM/IEEE Design Automation Conference (DAC) , 2020, pp. 1–6.
[14] J. Feldmann, K. Kraft, L. Steiner, N. Wehn, and M. Jung, “Fast and accurate DRAM simulation: Can we further accelerate it?” in IEEE/ACM Proceedings Design, Automation and Test in Eurpoe (DATE) , 2020, pp.364–369.
[15] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi, “McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures,” in IEEE/ACM International Symposium on Microarchitecture (MICRO) , 2009, pp. 469–480.

FPGA開発日記

カテゴリ別記事インデックス https://msyksphinz.github.io/github_pages , English Version https://fpgadevdiary.hatenadiary.com/

BOOM Explorerの論文を読む (1. Introduction)

Introduction

参考文献