FPGA開発日記

カテゴリ別記事インデックス https://msyksphinz.github.io/github_pages , English Version https://fpgadevdiary.hatenadiary.com/

RISC-Vにおける割り込み処理挿入の方法確認

RISC-Vにおける割り込みの挿入方法を、テストパタンを動かしながら確認している。 以下のテストパタンで、MIP(Machine Interrupt Pending)とMIE(Machine Interrupt Enable)を制御して割り込みを挿入するテストが行われている。

動作確認には、ChipyardのRocketChip環境を使用している。

$ chipyard/sims/verilator
$ make CONFIG=RocketConfig debug
$ ./simulator-chipyard-RocketConfig-debug +verbose -v rv64mi-p-illegal.vcd ${RISCV}/riscv64-unknown-elf/share/riscv-tests/isa/rv64mi-p-illegal 2>&1 | spike-dasm | tee rv64mi-p-illegal.rocket.log
C0:        564 [1] pc=[00000000800001b0] W[r 7=0000000a00000880][1] R[r 0=0000000000000000] R[r 0=0000000000000000] inst=[300023f3] csrr    t2, mstatus
C0:        567 [1] pc=[00000000800001b4] W[r 7=0000000000000800][1] R[r 7=0000000a00000880] R[r 5=0000000000001800] inst=[0053f3b3] and     t2, t2, t0
C0:        568 [1] pc=[00000000800001b8] W[r 0=0000000000000000][0] R[r 6=0000000000000800] R[r 7=0000000000000800] inst=[0c731e63] bne     t1, t2, pc + 220
C0:        569 [1] pc=[00000000800001bc] W[r 0=0000000000000080][1] R[r 0=0000000000000000] R[r 0=0000000000000000] inst=[34415073] csrwi   mip, 2
C0:        590 [1] pc=[00000000800001c0] W[r 0=0000000000000000][1] R[r 0=0000000000000000] R[r 0=0000000000000000] inst=[30415073] csrwi   mie, 2
C0:        595 [1] pc=[00000000800001c4] W[r 5=00000000800001c4][1] R[r 0=0000000000000000] R[r 0=0000000000000000] inst=[00000297] auipc   t0, 0x0
C0:        596 [1] pc=[00000000800001c8] W[r 5=0000000080000301][1] R[r 5=00000000800001c4] R[r 0=0000000000000000] inst=[13d28293] addi    t0, t0, 317
C0:        597 [1] pc=[00000000800001cc] W[r 8=0000000080000004][1] R[r 5=0000000080000301] R[r 0=0000000000000000] inst=[30529473] csrrw   s0, mtvec, t0
C0:        602 [1] pc=[00000000800001d0] W[r 5=0000000080000301][1] R[r 0=0000000000000000] R[r 0=0000000000000000] inst=[305022f3] csrr    t0, mtvec
C0:        605 [1] pc=[00000000800001d4] W[r 5=0000000000000001][1] R[r 5=0000000080000301] R[r 0=0000000000000000] inst=[0012f293] andi    t0, t0, 1
C0:        606 [1] pc=[00000000800001d8] W[r 0=0000000000000000][0] R[r 5=0000000000000001] R[r 0=0000000000000000] inst=[00028663] beqz    t0, pc + 12
C0:        607 [1] pc=[00000000800001dc] W[r 0=0000000a00000880][1] R[r 0=0000000000000000] R[r 0=0000000000000000] inst=[30046073] csrsi   mstatus, 8
C0:        612 [0] pc=[00000000800001e0] W[r 0=0000000000000000][0] R[r 0=0000000000000000] R[r 0=0000000000000000] inst=[0000006f] j       pc + 0x0
C0:        617 [1] pc=[0000000080000304] W[r 0=0000000080000308][1] R[r 0=0000000000000000] R[r 0=0000000000000000] inst=[ee1ff06f] j       pc - 0x120
C0:        619 [1] pc=[00000000800001e4] W[r 0=0000000080000301][1] R[r 8=0000000080000004] R[r 0=0000000000000000] inst=[30541073] csrw    mtvec, s0
C0:        624 [1] pc=[00000000800001e8] W[r 0=0000000000000000][1] R[r 0=0000000000000000] R[r 0=0000000000000000] inst=[30315073] csrwi   mideleg, 2
C0:        629 [1] pc=[00000000800001ec] W[r 5=00000000800001ec][1] R[r 0=0000000000000000] R[r 0=0000000000000000] inst=[00000297] auipc   t0, 0x0
C0:        630 [1] pc=[00000000800001f0] W[r 5=0000000080000214][1] R[r 5=00000000800001ec] R[r 0=0000000000000000] inst=[02828293] addi    t0, t0, 40
C0:        631 [1] pc=[00000000800001f4] W[r 0=00000000800001e0][1] R[r 5=0000000080000214] R[r 0=0000000000000000] inst=[34129073] csrw    mepc, t0
C0:        632 [1] pc=[00000000800001f8] W[r 5=0000000000002000][1] R[r 0=0000000000000000] R[r 0=0000000000000000] inst=[000022b7] lui     t0, 0x2
C0:        633 [1] pc=[00000000800001fc] W[r 5=0000000000001800][1] R[r 5=0000000000002000] R[r 0=0000000000000000] inst=[8002829b] addiw   t0, t0, -2048

MIP=2, MIE=2 を設定しているので、SSIP(Supervisor Software Interrupt Pending)とSSIE(Supervisor software Interrupt Enable)を設定し、mstatusの3ビット目を1に設定する、つまりmstatus.MIE=1とすることで割込みが挿入される。

mtvec0x80000301が設定されているため、割り込みが発生するとそのあたりにジャンプするはずだが、Rocketのログでは0x80000304にジャンプしている。なんでだ?

もう一つ、MIPは自動的にOFFされるのかと思ったが、これは違うようだ。一度割り込みが入った後に自動的にどのように抑え込むかは、Rocket-Chipの実装を確認してみる必要がありそうだ。

f:id:msyksphinz:20220120003526p:plain

コンピュータアーキテクチャ系国際学会サーベイ (2. ICCAD)

なんとなく最近のコンピュータアーキテクチャ系の研究傾向が知りたいので、学会のサーベイをしてみることにした。

次はICCAD。EDAとか半導体設計ツール、半導体設計環境系の話が多いかな。2021年のセッションで興味のありそうなものを引き出してみる。

  • ICCAD International Conference on Computer-Aided Design
    • Keynote
      • Efficient Computing for AI and Robotics: From Hardware Accelerators to Algorithm Design
      • Challenges and opportunities in GaN power electronics
      • RISC-V Is Inevitable
      • Designing Reliable Distributed Systems
    • 1A. Efficient DNN training and secure/robust DNN inference
    • 1B. Advances in Boolean methods for sythesis
      • Engineering an Efficient Boolean Functional Synthesis Engine
      • Enhanced Fast Boolean Matching based on Sensitivity Signatures Pruning
      • An Efficient Two-Phase Method for Prime Compilation of NonClausal Boolean Formulae
      • Heuristics for Million-scale Two-level Logic Minimization
    • 1C. Power model calibration and computing with approximation and uncertainty
      • McPAT-Calib: A Microarchitecture Power Modeling Framework for Modern CPUs
      • Positive/Negative Approximate Multipliers for DNN Accelerators
      • MinSC: An Exact Synthesis-Based Method for Minimal Area Stochastic Circuits under Relaxed Error Bound
      • CORLD: In-Stream Correlation Manipulation for Low-Discrepancy Stochastic Computing
    • 1D. Cross Layer design solutions for energy-efficient and secure edge AI
      • TinyML: Massive Opportunity for Edge AI when Machine Intelligence meets the Real World of Billions of Sensors
      • Challenges and Opportunities in Security and Reliability of Edge AI
      • A General Hardware and Software Co-Design Framework for Energy-Efficient Edge AI
      • Towards Energy-Efficient and Secure Edge AI: A Cross-Layer Framework
    • 2A. Efficient DNN inference and tools
    • 2B. LOLOL: Lots of logic locking and unlocking
      • UNTANGLE: Unlocking Routing and Logic Obfuscation Using Graph Neural Networks-based Link Prediction
      • Circuit Deobfuscation from Power Side-Channels using PseudoBoolean SAT
      • Exploring eFPGA-based Redaction for IP Protection
    • 2C. Quantum CAD matters
    • 2D. Multi-Core System design and optimization for the big data era
      • BOOM-Explorer: RISC-V BOOM Microarchitecture Design Space Exploration Framework
      • DARe: DropLayer-Aware Manycore ReRAM architecture for Training Graph Neural Networks
      • IPA: Floorplan-Aware SystemC Interconnect Performance Modeling and Generation for HLS-based SoCs
      • Theoretical Analysis and Evaluation of NoCs with Weighted Round Robin Arbitration
    • 3A. Algorithm-Hardware co-design for machine learning hardware accelerators
    • 3B. Routing with wires and light
    • 3C. CAD for novel electronic applications
    • 3D. Quantum machine leering: from algorithm to applications
    • 4A. Acceleration of emerging deep learning techniques
    • 4B. Resilient and efficient embedded applications
    • 4C. Simulation and Test generation tools
      • Accelerate Logic Re-simulation on GPU via Gate/Event Parallelism and State Compression
      • Generating Architecture-Level Abstractions from RTL Designs for Processors and Accelerators, Part I: Determining Architectural State Variables
      • Machine Learning-Based Test Pattern Generation for Neuromorphic Chips
      • Banshee: A Fast LLVM-Based RISC-V Binary Translator
    • 4D. VLSI for 5G and beyond wireless in the AI era: Algorithm, hardware and system
    • 5A. Hardware software co-design for advanced deep neural networks
    • 5B. Security of ML systems
    • 5C. Managing complexity with cell design and partitioning
      • Simultaneous Transistor Folding and Placement in Standard Cell Layout Synthesis
      • Exploring Physical Synthesis for Circuits based on Emergin Reconfigurable Nanotechnologies
      • HyperSF: Spectral Hypergraph Coarsening via Flow-based Local Clustering
      • TopoPart: a Multi-level Topology-Driven Partitioning Framework for Multi-FPGA Systems
    • 5D. Hardware aware learning for medicine
    • 6A. Brain-inspired computing and microfluidic bio-chips
    • 6B. New techniques in timing and power analysis
    • 6C. Machine learning methods for DFM
    • 6D. tutorial: Ferroelectric FET technologies and its applications: from device to system

コンピュータアーキテクチャ系国際学会サーベイ (1. ISCA 続き)

なんとなく最近のコンピュータアーキテクチャ系の研究傾向が知りたいので、学会のサーベイをしてみることにした。

まずはISCA (International Symposium of Computer Architecture) から。2021年のプログラムから、Abstractを引き抜いて何となく傾向をつかんでみる。 日本語で概要をコメントする。後で追記する。 アーキテクチャはプリフェッチ、ベクトルなどの高速化が多いかな。 メモリはセキュリティとかコヒーレンシとかが多い気がする。

  • ISCA International Symposium on Computer Architecture
    • Industry Track (6)
    • Microarchitecture (6)
      • Zero Inclusion Victim: Isolating Core Caches from Inclusive Last-Level Cache Evictions
        • LLCのコヒーレント制御に関して
      • Exploiting Page Table Locality for Agile TLB Prefetching
        • TLBプリフェッチの高速化
      • A Cost-Effective Entangling Prefetcher for Instructions
        • 高効率な命令プリフェッチャ
      • Vector Runahead
        • 投機的なメモリロードについて
      • Unlimited Vector Extension with Data Streaming Support
        • 新しいタイプのベクトル命令の提案
      • Speculative Vectorisation with Selective Replay
        • セレクティブリプレイによる投機的ベクトル化
    • Memory (10)
      • Don't Forget the I/O When Allocating Your LLC
        • LLCの性能解析
      • PF-DRAM: A Precharge-Free DRAM Structure
        • 物理回路向け。プリチャージフリーのDRAM
      • Efficient Multi-GPU Shared Memory via Automatic Optimization of Fine-Grained Transfers
      • CODIC: A Low-Cost Substrate for Enabling Custom In-DRAM Functionalities and Optimizations
        • 4つの従来固定されていたDRAM内部タイミングをきめ細かく制御できる新しい低コストDRAM基板、CODICを設計
      • NVOverlay: Enabling Efficient and Scalable High-Frequency Snapshotting to NVM
        • NVMに頻繁に永続的なスナップショットを取得し、後でランダムにアクセスできるようにするためのスケーラブルで効率的な技術。
      • Rebooting Virtual Memory with Midgard
      • Dvé: Improving DRAM Reliability and Performance On-Demand via Coherent Replication
        • キャッシュコヒーレントなNUMAシステムにおいて、データブロックを2つの異なるソケットに複製するハードウェア駆動型の複製機構
      • Ripple: Profile-Guided Instruction Cache Replacement for Data Center Applications
        • プログラムをプロファイル化して、プログラムのコンテキストを用いて置き換えポリシーの基礎に効率の良い置き換え決定を通知する。
      • Quantifying Server Memory Frequency Margin and Using It to Improve Performance in HPC Systems
        • 汎用サーバ用メモリモジュールの周波数マージンを特徴付ける初の公開研究を実施した。
      • Revamping Storage Class Memory With Hardware Automated Memory-Over-Storage Solution
        • ハードウェア自動嵌合型MoS(Memory-over-Storage)ソリューションであるHAMSを提案。
    • Machine Learning (7)
      • 「作ってみた」系が多い気がする
      • RaPiD: AI Accelerator for Ultra-Low Precision Training and Inference
      • REDUCT: Keep It Close, Keep It Cool! - Scaling DNN Inference on Multi-Core CPUs with Near-Cache Compute
        • DNN推論能力に影響を与え、そのパフォーマンスを制限する従来のCPUリソースをバイパスする革新的なソリューションを構築する
      • Communication Algorithm-Architecture Co-Design for Distributed Deep Learning
        • 効率的でスケーラブルなall-reduce操作のために、トポロジとリソース使用率を認識したMultiTreeall-reduceアルゴリズムを提案
      • SPACE: Locality-Aware Processing in Heterogeneous Memory for Personalized Recommendations
        • SPACEは、パーソナライズされた推奨事項のために、DIMMを備えたコンピューティング対応の3DスタックDRAMを活用します。
      • ELSA: Hardware-Software Co-Design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks
        • 自己注意メカニズムに費やされる実行時間とエネルギーを大幅に削減するためのハードウェアとソフトウェアの共同設計ソリューション。
      • Cambricon-Q: A Hybrid Architecture for Efficient Training
        • Cambricon-Qは、ASICアクセラレーションコアとニアデータプロセッシング(NDP)エンジンで構成されるハイブリッドアーキテクチャを備えています。
      • TENET: A Framework for Modeling Tensor Dataflow Based on Relation-Centric Notation
    • Processing in/near Memory (4)
      • ABC-DIMM: Alleviating the Bottleneck of Communication in DIMM-Based Near-Memory Processing with Inter-DIMM Broadcast
      • Sieve: Scalable In-Situ DRAM-Based Accelerator Designs for Massively Parallel k-mer Matching

      • FORMS: Fine-Grained Polarized ReRAM-Based In-Situ Computation for Mixed-Signal DNN Accelerator
      • BOSS: Bandwidth-Optimized Search Accelerator for Storage-Class Memory
    • Data Center (4)
      • SATORI: Efficient and Fair Resource Partitioning by Sacrificing Short-Term Benefits for Long-Term Gains
      • Confidential Serverless Made Efficient with Plug-In Enclaves
      • Flex: High-Availability Datacenters with Zero Reserved Power
      • BlockMaestro: Enabling Programmer-Transparent Task-Based Execution in GPU Systems
    • Security (3)
      • Opening Pandora's Box: A Systematic Study of New Ways Microarchitecture Can Leak Private Data
      • I See Dead μops: Leaking Secrets via Intel/AMD Micro-Op Caches
      • TimeCache: Using Time to Eliminate Cache Side Channels when Sharing Software
    • Accelerator (10)
      • Accelerated Seeding for Genome Sequence Alignment with Enumerated Radix Trees
      • Aurochs: An Architecture for Dataflow Threads
      • PipeZK: Accelerating Zero-Knowledge Proof with a Pipelined Architecture
      • Enabling Compute-Communication Overlap in Distributed Deep Learning Training Platforms
      • CoSA: Scheduling by Constrained Optimization for Spatial Accelerators
      • η-LSTM: Co-Designing Highly-Efficient Large LSTM Training via Exploiting Memory-Saving and Architectural Design Opportunities
      • NN-Baton: DNN Workload Orchestration and Chiplet Granularity Exploration for Multichip Accelerators
      • SNAFU: An Ultra-Low-Power, Energy-Minimal CGRA-Generation Framework and Architecture
      • SARA: Scaling a Reconfigurable Dataflow Accelerator
      • HASCO: Towards Agile HArdware and Software CO-design for Tensor Computation
    • Compiler (4)
      • Taming the Zoo: The Unified GraphIt Compiler Framework for Novel Architectures
      • Supporting Legacy Libraries on Non-Volatile Memory: A User-Transparent Approach
      • Execution Dependence Extension (EDE): ISA Support for Eliminating Fences
      • Hetero-ViTAL: A Virtualization Stack for Heterogeneous FPGA Clusters
    • Graph Processing (4)
      • FlexMiner: A Pattern-Aware Accelerator for Graph Pattern Mining
      • PolyGraph: Exposing the Value of Flexibility for Graph Processing Accelerators
      • Large-Scale Graph Processing on FPGAs with Caches for Thousands of Simultaneous Misses
    • Low Temperature (4)
      • Cost-Efficient Overclocking in Immersion-Cooled Datacenters
      • CryoGuard: A Near Refresh-Free Robust DRAM Design for Cryogenic Computing
      • Superconducting Computing with Alternating Logic Elements
      • Failure Sentinels: Ubiquitous Just-in-Time Intermittent Computation via Low-Cost Hardware Support for Voltage Monitoring
    • Network Storage Acceleration (3)
      • NASGuard: A Novel Accelerator Architecture for Robust Neural Architecture Search (NAS) Networks
      • NASA: Accelerating Neural Network Design with a NAS Processor
      • PMNet: In-Network Data Persistence
    • Quantum / Photonics (4)
      • Exploiting Long Distance Interactions and Tolerating Atom Loss in Neutral Atom Quantum Architectures
      • Software-Hardware Co-Optimization for Computational Chemistry on Superconducting Quantum Processors
      • Designing Calibration and Expressivity-Efficient Instruction Sets for Quantum Computing
      • Albireo: Energy-Efficient Acceleration of Convolutional Neural Networks via Silicon Photonics
    • Reliability & Security (4)
      • IntroSpectre: A Pre-Silicon Framework for Discovery and Analysis of Transient Execution Vulnerabilities
      • Maya: Using Formal Control to Obfuscate Power Side Channels
      • Demystifying the System Vulnerability Stack: Transient Fault Effects Across the Layers
      • No-FAT: Architectural Support for Low Overhead Memory Safety Checks
    • DRAM / IO / Network (3)
      • Ghost Routing to Enable Oblivious Computation on Memory-Centric Networks
      • QUAC-TRNG: High-Throughput True Random Number Generation Using Quadruple Row Activation in Commodity DRAM Chips
      • A RISC-V In-Network Accelerator for Flexible High-Performance Low-Power Packet Processing
    • Sparse Processing (7)
      • Leaky Buddies: Cross-Component Covert Channels on Integrated CPU-GPU Systems
      • IChannels: Exploiting Current Management Mechanisms to Create Covert Channels in Modern Processors
      • ZeRØ: Zero-Overhead Resilient Operation Under Pointer Integrity Attacks
      • SpZip: Architectural Support for Effective Data Compression In Irregular Applications
      • Dual-Side Sparse Tensor Core
      • RingCNN: Exploiting Algebraically-Sparse Ring Tensors for Energy-Efficient CNN-Based Computational Imaging
      • GoSPA: An Energy-Efficient High-Performance Globally Optimized SParse Convolutional Neural Network Accelerator

自作RISC-V CPUコア実装(テストパタンデバッグ)

自作CPUのデバッグをチマチマ進めている。 TLBの問題についてはとりあえず置いておいて、テストパタンを動かしながら問題点を洗い出している。 とりあえず環境等の問題を洗い出して、基本的なケースは動くようなところまで直していった。 巨大なコンフィグレーションでのテストもある程度動作するようになってきたが、まだテストパタンのPASS率は低いまだだ。

0b3bcc9 (2022/01/16)
rv32_big.log 89 / 109 (81.65%)
rv32_giant.log 40 / 109 (36.70%)
rv32_small.log 70 / 109 (64.22%)
rv32_standard.log 87 / 109 (79.82%)
rv32_tiny.log 92 / 109 (84.40%)
rv64_big.log 105 / 169 (62.13%)
rv64_giant.log 80 / 169 (47.34%)
rv64_small.log 87 / 169 (51.48%)
rv64_standard.log 88 / 169 (52.07%)
rv64_tiny.log 152 / 169 (89.94%)

テストパタンのPASS率の管理については、いろいろ考えている。毎日のリグレッション結果をどのようにして管理するのか? 一つ考えられるのは、テストの結果を何等かのフォーマット(JSONYAMLなど)で出力して毎日追加し、これも何らかのスクリプトを使ってグラフなどに変換するということが考えられる。 この辺を管理できる上手い方法は無いだろうか?

あとはGitHub Actionsの扱い方が未だに良く分かっていない。

SFENCE.VMAの実行に関する検討

SFENCE.VMAの実装に関してちょっと詰まったのでメモ。 これまでSFENCE.VMAは単なるCSR命令として、In-orderに実行するようにしていたのだが、実際にはストア命令の操作が完了するまで待たなければならない。

3509 : 1325 : PC=[00000000ffc0245c] (M,46,01) 0065a023 sw      t1, 0(a1)
3509 : 1326 : PC=[00000000ffc02460] (M,46,02) 12050073 sfence.vma a0, zero

この2つの命令は同じグループで実行されるため、SFENCE.VMAがコミットされた後にSW命令がデータをL1Dに書き込むことになる。 もしL1Dへの書き込みが遅れると、後続の命令がPTWへのアクセスをした場合のページテーブルの参照が誤ってしまう。

これを防ぐためには、SFENCE.VMAは必ず単体で実行されるように制約を変更し、SW命令がST-Bufferから離れて完全にEmptyになるまで待つように制御を追加しなければならないらしい。

コンピュータアーキテクチャ系国際学会サーベイ (1. ISCA)

なんとなく最近のコンピュータアーキテクチャ系の研究傾向が知りたいので、学会のサーベイをしてみることにした。

まずはISCA (International Symposium of Computer Architecture) から。2021年のプログラムから、Abstractを引き抜いて何となく傾向をつかんでみる。 日本語で概要をコメントする。後で追記する。 アーキテクチャはプリフェッチ、ベクトルなどの高速化が多いかな。 メモリはセキュリティとかコヒーレンシとかが多い気がする。

  • ISCA International Symposium on Computer Architecture

    • Industry Track (6)
    • Microarchitecture (6)
      • Zero Inclusion Victim: Isolating Core Caches from Inclusive Last-Level Cache Evictions
        • LLCのコヒーレント制御に関して
      • Exploiting Page Table Locality for Agile TLB Prefetching
        • TLBプリフェッチの高速化
      • A Cost-Effective Entangling Prefetcher for Instructions
        • 高利付な命令プリフェッチャ
      • Vector Runahead
        • 投機的なメモリロードについて
      • Unlimited Vector Extension with Data Streaming Support
        • 新しいタイプのベクトル命令の提案
      • Speculative Vectorisation with Selective Replay
        • セレクティブリプレイによる投機的ベクトル化
    • Memory (10)
      • Don't Forget the I/O When Allocating Your LLC
        • LLCの性能解析
      • PF-DRAM: A Precharge-Free DRAM Structure
        • 物理回路向け。プリチャージフリーのDRAM
      • Efficient Multi-GPU Shared Memory via Automatic Optimization of Fine-Grained Transfers
      • CODIC: A Low-Cost Substrate for Enabling Custom In-DRAM Functionalities and Optimizations
        • 4つの従来固定されていたDRAM内部タイミングをきめ細かく制御できる新しい低コストDRAM基板、CODICを設計
      • NVOverlay: Enabling Efficient and Scalable High-Frequency Snapshotting to NVM
        • NVMに頻繁に永続的なスナップショットを取得し、後でランダムにアクセスできるようにするためのスケーラブルで効率的な技術。
      • Rebooting Virtual Memory with Midgard
      • Dvé: Improving DRAM Reliability and Performance On-Demand via Coherent Replication
        • キャッシュコヒーレントなNUMAシステムにおいて、データブロックを2つの異なるソケットに複製するハードウェア駆動型の複製機構
      • Ripple: Profile-Guided Instruction Cache Replacement for Data Center Applications
        • プログラムをプロファイル化して、プログラムのコンテキストを用いて置き換えポリシーの基礎に効率の良い置き換え決定を通知する。
      • Quantifying Server Memory Frequency Margin and Using It to Improve Performance in HPC Systems
        • 汎用サーバ用メモリモジュールの周波数マージンを特徴付ける初の公開研究を実施した。
      • Revamping Storage Class Memory With Hardware Automated Memory-Over-Storage Solution
        • ハードウェア自動嵌合型MoS(Memory-over-Storage)ソリューションであるHAMSを提案。
    • Machine Learning (7)
      • RaPiD: AI Accelerator for Ultra-Low Precision Training and Inference
      • REDUCT: Keep It Close, Keep It Cool! - Scaling DNN Inference on Multi-Core CPUs with Near-Cache Compute
      • Communication Algorithm-Architecture Co-Design for Distributed Deep Learning
      • SPACE: Locality-Aware Processing in Heterogeneous Memory for Personalized Recommendations
      • ELSA: Hardware-Software Co-Design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks
      • Cambricon-Q: A Hybrid Architecture for Efficient Training
      • TENET: A Framework for Modeling Tensor Dataflow Based on Relation-Centric Notation
    • Processing in/near Memory (4)
      • ABC-DIMM: Alleviating the Bottleneck of Communication in DIMM-Based Near-Memory Processing with Inter-DIMM Broadcast
      • Sieve: Scalable In-Situ DRAM-Based Accelerator Designs for Massively Parallel k-mer Matching
      • FORMS: Fine-Grained Polarized ReRAM-Based In-Situ Computation for Mixed-Signal DNN Accelerator
      • BOSS: Bandwidth-Optimized Search Accelerator for Storage-Class Memory
    • Data Center (4)
      • SATORI: Efficient and Fair Resource Partitioning by Sacrificing Short-Term Benefits for Long-Term Gains
      • Confidential Serverless Made Efficient with Plug-In Enclaves
      • Flex: High-Availability Datacenters with Zero Reserved Power
      • BlockMaestro: Enabling Programmer-Transparent Task-Based Execution in GPU Systems
    • Security (3)
      • Opening Pandora's Box: A Systematic Study of New Ways Microarchitecture Can Leak Private Data
      • I See Dead μops: Leaking Secrets via Intel/AMD Micro-Op Caches
      • TimeCache: Using Time to Eliminate Cache Side Channels when Sharing Software
    • Accelerator (10)
      • Accelerated Seeding for Genome Sequence Alignment with Enumerated Radix Trees
      • Aurochs: An Architecture for Dataflow Threads
      • PipeZK: Accelerating Zero-Knowledge Proof with a Pipelined Architecture
      • Enabling Compute-Communication Overlap in Distributed Deep Learning Training Platforms
      • CoSA: Scheduling by Constrained Optimization for Spatial Accelerators
      • η-LSTM: Co-Designing Highly-Efficient Large LSTM Training via Exploiting Memory-Saving and Architectural Design Opportunities
      • NN-Baton: DNN Workload Orchestration and Chiplet Granularity Exploration for Multichip Accelerators
      • SNAFU: An Ultra-Low-Power, Energy-Minimal CGRA-Generation Framework and Architecture
      • SARA: Scaling a Reconfigurable Dataflow Accelerator
      • HASCO: Towards Agile HArdware and Software CO-design for Tensor Computation
    • Compiler (4)
      • Taming the Zoo: The Unified GraphIt Compiler Framework for Novel Architectures
      • Supporting Legacy Libraries on Non-Volatile Memory: A User-Transparent Approach
      • Execution Dependence Extension (EDE): ISA Support for Eliminating Fences
      • Hetero-ViTAL: A Virtualization Stack for Heterogeneous FPGA Clusters
    • Graph Processing (4)
      • FlexMiner: A Pattern-Aware Accelerator for Graph Pattern Mining
      • PolyGraph: Exposing the Value of Flexibility for Graph Processing Accelerators
      • Large-Scale Graph Processing on FPGAs with Caches for Thousands of Simultaneous Misses
    • Low Temperature (4)
      • Cost-Efficient Overclocking in Immersion-Cooled Datacenters
      • CryoGuard: A Near Refresh-Free Robust DRAM Design for Cryogenic Computing
      • Superconducting Computing with Alternating Logic Elements
      • Failure Sentinels: Ubiquitous Just-in-Time Intermittent Computation via Low-Cost Hardware Support for Voltage Monitoring
    • Network Storage Acceleration (3)
      • NASGuard: A Novel Accelerator Architecture for Robust Neural Architecture Search (NAS) Networks
      • NASA: Accelerating Neural Network Design with a NAS Processor
      • PMNet: In-Network Data Persistence
    • Quantum / Photonics (4)
      • Exploiting Long Distance Interactions and Tolerating Atom Loss in Neutral Atom Quantum Architectures
      • Software-Hardware Co-Optimization for Computational Chemistry on Superconducting Quantum Processors
      • Designing Calibration and Expressivity-Efficient Instruction Sets for Quantum Computing
      • Albireo: Energy-Efficient Acceleration of Convolutional Neural Networks via Silicon Photonics
    • Reliability & Security (4)
      • IntroSpectre: A Pre-Silicon Framework for Discovery and Analysis of Transient Execution Vulnerabilities
      • Maya: Using Formal Control to Obfuscate Power Side Channels
      • Demystifying the System Vulnerability Stack: Transient Fault Effects Across the Layers
      • No-FAT: Architectural Support for Low Overhead Memory Safety Checks
    • DRAM / IO / Network (3)
      • Ghost Routing to Enable Oblivious Computation on Memory-Centric Networks
      • QUAC-TRNG: High-Throughput True Random Number Generation Using Quadruple Row Activation in Commodity DRAM Chips
      • A RISC-V In-Network Accelerator for Flexible High-Performance Low-Power Packet Processing
    • Sparse Processing (7)
      • Leaky Buddies: Cross-Component Covert Channels on Integrated CPU-GPU Systems
      • IChannels: Exploiting Current Management Mechanisms to Create Covert Channels in Modern Processors
      • ZeRØ: Zero-Overhead Resilient Operation Under Pointer Integrity Attacks
      • SpZip: Architectural Support for Effective Data Compression In Irregular Applications
      • Dual-Side Sparse Tensor Core
      • RingCNN: Exploiting Algebraically-Sparse Ring Tensors for Energy-Efficient CNN-Based Computational Imaging
      • GoSPA: An Energy-Efficient High-Performance Globally Optimized SParse Convolutional Neural Network Accelerator

Rocket-ChipのAtomic命令動作観察 (1. ChipyardによるRTLシミュレーション)

とりあえずChipyardで以下のようにしてテストベンチを動かせば良かろう。vcdの出力はrv32ua-v-amoadd_d.vcdに行われる。

./simulator-chipyard-RocketConfig-debug +verbose -v rv32ua-v-amoadd_d.vcd ${RISCV}/riscv64-unknown-elf/share/riscv-tests/isa/rv64ua-p-amoadd_d 2>&1 | spike-dasm | tee rv64ua-p-amoadd_d.rocket.log

ログを見てみると、amoadd.dを2回実行している。あれ、2回だけなの?

C0:        410 [1] pc=[0000000080000180] W[r10=ffffffff80000000][1] R[r 0=0000000000000000] R[r 0=0000000000000000] inst=[80000537] lui     a0, 0x80000
C0:        411 [1] pc=[0000000080000184] W[r11=fffffffffffff800][1] R[r 0=0000000000000000] R[r 0=0000000000000000] inst=[80000593] li      a1, -2048
C0:        412 [1] pc=[0000000080000188] W[r13=0000000080002188][1] R[r 0=0000000000000000] R[r 0=0000000000000000] inst=[00002697] auipc   a3, 0x2
C0:        413 [1] pc=[000000008000018c] W[r13=0000000080002000][1] R[r13=0000000080002188] R[r 0=0000000000000000] inst=[e7868693] addi    a3, a3, -392
C0:        430 [1] pc=[0000000080000190] W[r 0=0000000000000000][0] R[r13=0000000080002000] R[r10=ffffffff80000000] inst=[00a6b023] sd      a0, 0(a3)
C0:        436 [1] pc=[0000000080000194] W[r14=ffffffff80000000][1] R[r13=0000000080002000] R[r11=fffffffffffff800] inst=[00b6b72f] amoadd.d a4, a1, (a3)
C0:        437 [1] pc=[0000000080000198] W[r 7=ffffffff80000000][1] R[r 0=0000000000000000] R[r 0=0000000000000000] inst=[800003b7] lui     t2, 0x80000
C0:        438 [1] pc=[000000008000019c] W[r 3=0000000000000002][1] R[r 0=0000000000000000] R[r 0=0000000000000000] inst=[00200193] li      gp, 2
C0:        439 [1] pc=[00000000800001a0] W[r 0=0000000000000000][0] R[r14=ffffffff80000000] R[r 7=ffffffff80000000] inst=[04771863] bne     a4, t2, pc + 80
C0:        440 [1] pc=[00000000800001a4] W[r15=ffffffff7ffff800][1] R[r13=0000000080002000] R[r 0=0000000000000000] inst=[0006b783] ld      a5, 0(a3)
C0:        441 [1] pc=[00000000800001a8] W[r 7=ffffffffffffffff][1] R[r 0=0000000000000000] R[r 0=0000000000000000] inst=[fff0039b] addiw   t2, zero, -1
C0:        442 [1] pc=[00000000800001ac] W[r 7=ffffffff80000000][1] R[r 7=ffffffffffffffff] R[r 0=0000000000000000] inst=[01f39393] slli    t2, t2, 31
C0:        443 [1] pc=[00000000800001b0] W[r 7=ffffffff7ffff800][1] R[r 7=ffffffff80000000] R[r 0=0000000000000000] inst=[80038393] addi    t2, t2, -2048
C0:        444 [1] pc=[00000000800001b4] W[r 3=0000000000000003][1] R[r 0=0000000000000000] R[r 0=0000000000000000] inst=[00300193] li      gp, 3
C0:        445 [1] pc=[00000000800001b8] W[r 0=0000000000000000][0] R[r15=ffffffff7ffff800] R[r 7=ffffffff7ffff800] inst=[02779c63] bne     a5, t2, pc + 56
C0:        446 [1] pc=[00000000800001bc] W[r14=ffffffff7ffff800][1] R[r13=0000000080002000] R[r11=fffffffffffff800] inst=[00b6b72f] amoadd.d a4, a1, (a3)
C0:        468 [1] pc=[00000000800001c0] W[r 7=ffffffffffffffff][1] R[r 0=0000000000000000] R[r 0=0000000000000000] inst=[fff0039b] addiw   t2, zero, -1
C0:        469 [1] pc=[00000000800001c4] W[r 7=ffffffff80000000][1] R[r 7=ffffffffffffffff] R[r 0=0000000000000000] inst=[01f39393] slli    t2, t2, 31
C0:        470 [1] pc=[00000000800001c8] W[r 7=ffffffff7ffff800][1] R[r 7=ffffffff80000000] R[r 0=0000000000000000] inst=[80038393] addi    t2, t2, -2048
C0:        471 [1] pc=[00000000800001cc] W[r 3=0000000000000004][1] R[r 0=0000000000000000] R[r 0=0000000000000000] inst=[00400193] li      gp, 4

おそらくこの辺りだと思われる。波形を詳細に見ていきたい。

f:id:msyksphinz:20220222002156p:plain