Performance and Area

RV32

A few things to keep in mind :

  • You can trade FMax IPC Area

  • There is better IPC xor FMAX xor Area configs

For the following configuration :

  • RV32IMASU, dual issue, OoO, linux compatible

  • 64 bits fetch, 2 decode, 3 issue, 2 retire

  • Shared issue queue with 32 entries

  • Renaming with 64 physical registers

  • 3 execution units (2*Int/Shift/branch, 1*load/store/mul/div/csr/env)

  • LSU with 16 load queue, 16 store queue

  • Load hit predictor (3 cycles load to use delay)

  • Store to load bypass

  • I$ 16KB/4W, D$ 16KB/4W 2 refill 2 writeback slots

  • MMU with ITLB 6 way/192 entries, DTLB 6 way/192 entries

  • BTB 1 way/512 entries, GSHARE 1 way/4KB, RAS 32 entries

Performance :

  • Dhrystone : 2.93 DMIPS/Mhz 1.65 IPC (-O3 -fno-common -fno-inline, 318 instruction per iteration)

  • Coremark : 5.02 Coremark/Mhz 1.28 IPC (-O3 and so many more random flags)

  • Embench-iot : 1.67 baseline 1.42 IPC (-O2 -mcmodel=medany -ffunction-sections)

On Artix 7 speed grade 3 :

  • 13.3 KLUT, 10.3 KFF, 12 BRAM, 4 DSP

  • 155 Mhz

Reducing the number of int ALU to a single one and moving the branch to the shared pipeline will produce :

Performance :

  • Dhrystone : 2.71 DMIPS/Mhz (-O3 -fno-common -fno-inline)

  • Coremark : 4.44 Coremark/Mhz (-O3 and so many more random flags)

  • Embench-iot : 1.46 baseline (-O2 -mcmodel=medany -ffunction-sections)

On Artix 7 speed grade 3 :

  • 12.1 KLUT, 9.9 KFF, 12 BRAM, 4 DSP

  • 148 Mhz

To go further, increasing the GSHARE storage or implementing something as TAGE should help.

Here are a pipeline representation of the two above configurations :

../../_images/pipeline_simple.png

Also note that the NaxRiscv simulator support gem5 / konata logs, allowing to visualise the execution flow.

Note that if you configure the core with 1 decode 1 alu 1 shared eu you get :

Performance :

  • Dhrystone : 1.70 DMIPS/Mhz (-O3 -fno-common -fno-inline)

  • Coremark : 3.35 Coremark/Mhz (-O3 and so many more random flags)

  • Embench-iot : 1.06 baseline (-O2 -mcmodel=medany -ffunction-sections)

On Artix 7 speed grade 3 :

  • 10.8 KLUT, 9.7 KFF, 12 BRAM, 4 DSP

  • 155 Mhz

RV64

In a similar configuration as the above RV32 (2*Int/Shift/Branch, 1*/load/store/mul/div/csr/env)

Performance :

  • Dhrystone : 2.94 DMIPS/Mhz (-O3 -fno-common -fno-inline)

  • Coremark : 4.94 Coremark/Mhz (-O3, u32 as s32 and so many more random flags)

  • Embench-iot : 1.84 baseline (-O2 -ffunction-sections)

On Artix 7 speed grade 3 :

  • 17.9 KLUT, 12.5 KFF, 12 BRAM, 16 DSP

  • 137 Mhz

Notes

Here are a few notes collected during the development :

  • An out of order CPU without branch prediction is performing really bad ^^

  • Avoiding store having to wait for the store data in the IQ can really help avoiding bad load speculation.

  • Some tests were made with two cycle latency ALU (in prevision of RV64 timing relaxation) which seem to show “little” impact on the overall performances (~15%, need to verify on more benchmarks)

  • Adding more and more execution units seems to go fast into diminishing returns lands

How to run the benchmark

First follow the steps in https://github.com/SpinalHDL/NaxRiscv/blob/main/src/test/cpp/naxriscv/README.md#how-to-setup-things to get a functional simulator.

Then dhrystone and coremark benchmark can be run manually with :

obj_dir/VNaxRiscv --name dhrystone --output-dir output/nax/dhrystone --load-elf ../../../../ext/NaxSoftware/baremetal/dhrystone/build/rv32im/dhrystone.elf --start-symbol _start  --stats-print --stats-toggle-symbol sim_time
obj_dir/VNaxRiscv --name coremark --output-dir output/nax/coremark --load-elf ../../../../ext/NaxSoftware/baremetal/coremark/build/rv32im/coremark.elf --start-symbol _start --pass-symbol pass  --stats-print-all --stats-toggle-symbol sim_time

To run embench, you have to clone https://github.com/SpinalHDL/embench-iot.git and then follow the steps defined in config/riscv32/boards/naxriscv_sim/README.md