.. role:: raw-html-m2r(raw)
   :format: html


====================
Performance and Area
====================


RV32
=========================

A few things to keep in mind :

- You can trade FMax IPC Area
- There is better IPC xor FMAX xor Area configs

For the following configuration :

- RV32IMASU, dual issue, OoO, linux compatible
- 64 bits fetch, 2 decode, 3 issue, 2 retire
- Shared issue queue with 32 entries
- Renaming with 64 physical registers
- 3 execution units (2\*Int/Shift/branch, 1\*load/store/mul/div/csr/env)
- LSU with 16 load queue, 16 store queue
- Load hit predictor (3 cycles load to use delay)
- Store to load bypass
- I$ 16KB/4W, D$ 16KB/4W 2 refill 2 writeback slots
- MMU with ITLB 6 way/192 entries, DTLB 6 way/192 entries
- BTB 1 way/512 entries, GSHARE 1 way/4KB, RAS 32 entries

Performance :

- Dhrystone   : 2.93 DMIPS/Mhz    1.65 IPC (-O3 -fno-common -fno-inline, 318 instruction per iteration)
- Coremark    : 5.02 Coremark/Mhz 1.28 IPC (-O3 and so many more random flags)
- Embench-iot : 1.67 baseline     1.42 IPC (-O2 -mcmodel=medany -ffunction-sections)

On Artix 7 speed grade 3 :

- 13.3 KLUT, 10.3 KFF, 12 BRAM, 4 DSP
- 155 Mhz

Reducing the number of int ALU to a single one and moving the branch to the shared pipeline will produce :


Performance :

- Dhrystone   : 2.71 DMIPS/Mhz    (-O3 -fno-common -fno-inline)
- Coremark    : 4.44 Coremark/Mhz (-O3 and so many more random flags)
- Embench-iot : 1.46 baseline     (-O2 -mcmodel=medany -ffunction-sections)

On Artix 7 speed grade 3 :

- 12.1 KLUT, 9.9 KFF, 12 BRAM, 4 DSP
- 148 Mhz


To go further, increasing the GSHARE storage or implementing something as TAGE should help.

Here are a pipeline representation of the two above configurations :

.. image:: /asset/image/pipeline_simple.png

Also note that the NaxRiscv simulator support gem5 / konata logs, allowing to visualise the execution flow.

Note that if you configure the core with 1 decode 1 alu 1 shared eu you get :

Performance :

- Dhrystone   : 1.70 DMIPS/Mhz    (-O3 -fno-common -fno-inline)
- Coremark    : 3.35 Coremark/Mhz (-O3 and so many more random flags)
- Embench-iot : 1.06 baseline     (-O2 -mcmodel=medany -ffunction-sections)

On Artix 7 speed grade 3 :

- 10.8 KLUT, 9.7 KFF, 12 BRAM, 4 DSP
- 155 Mhz


RV64
=========================

In a similar configuration as the above RV32 (2\*Int/Shift/Branch, 1\*/load/store/mul/div/csr/env)

Performance :

- Dhrystone   : 2.94 DMIPS/Mhz    (-O3 -fno-common -fno-inline)
- Coremark    : 4.94 Coremark/Mhz (-O3, u32 as s32 and so many more random flags)
- Embench-iot : 1.84 baseline     (-O2 -ffunction-sections)

On Artix 7 speed grade 3 :

- 17.9 KLUT, 12.5 KFF, 12 BRAM, 16 DSP
- 137 Mhz

Notes
===============

Here are a few notes collected during the development :

- An out of order CPU without branch prediction is performing really bad ^^
- Avoiding store having to wait for the store data in the IQ can really help avoiding bad load speculation.
- Some tests were made with two cycle latency ALU (in prevision of RV64 timing relaxation) which seem to show "little" impact on the overall performances (~15%, need to verify on more benchmarks)
- Adding more and more execution units seems to go fast into diminishing returns lands


How to run the benchmark
==============================

First follow the steps in https://github.com/SpinalHDL/NaxRiscv/blob/main/src/test/cpp/naxriscv/README.md#how-to-setup-things to get a functional simulator.

Then dhrystone and coremark benchmark can be run manually with :

.. code:: shell

    obj_dir/VNaxRiscv --name dhrystone --output-dir output/nax/dhrystone --load-elf ../../../../ext/NaxSoftware/baremetal/dhrystone/build/rv32im/dhrystone.elf --start-symbol _start  --stats-print --stats-toggle-symbol sim_time
    obj_dir/VNaxRiscv --name coremark --output-dir output/nax/coremark --load-elf ../../../../ext/NaxSoftware/baremetal/coremark/build/rv32im/coremark.elf --start-symbol _start --pass-symbol pass  --stats-print-all --stats-toggle-symbol sim_time

To run embench, you have to clone https://github.com/SpinalHDL/embench-iot.git and then follow the steps defined in config/riscv32/boards/naxriscv_sim/README.md