.. role:: raw-html-m2r(raw) :format: html ==================== Performance and Area ==================== RV32 ========================= A few things to keep in mind : - You can trade FMax IPC Area - There is better IPC xor FMAX xor Area configs For the following configuration : - RV32IMASU, dual issue, OoO, linux compatible - 64 bits fetch, 2 decode, 3 issue, 2 retire - Shared issue queue with 32 entries - Renaming with 64 physical registers - 3 execution units (2\*Int/Shift/branch, 1\*load/store/mul/div/csr/env) - LSU with 16 load queue, 16 store queue - Load hit predictor (3 cycles load to use delay) - Store to load bypass - I$ 16KB/4W, D$ 16KB/4W 2 refill 2 writeback slots - MMU with ITLB 6 way/192 entries, DTLB 6 way/192 entries - BTB 1 way/512 entries, GSHARE 1 way/4KB, RAS 32 entries Performance : - Dhrystone : 2.93 DMIPS/Mhz 1.65 IPC (-O3 -fno-common -fno-inline, 318 instruction per iteration) - Coremark : 5.02 Coremark/Mhz 1.28 IPC (-O3 and so many more random flags) - Embench-iot : 1.67 baseline 1.42 IPC (-O2 -mcmodel=medany -ffunction-sections) On Artix 7 speed grade 3 : - 13.3 KLUT, 10.3 KFF, 12 BRAM, 4 DSP - 155 Mhz Reducing the number of int ALU to a single one and moving the branch to the shared pipeline will produce : Performance : - Dhrystone : 2.71 DMIPS/Mhz (-O3 -fno-common -fno-inline) - Coremark : 4.44 Coremark/Mhz (-O3 and so many more random flags) - Embench-iot : 1.46 baseline (-O2 -mcmodel=medany -ffunction-sections) On Artix 7 speed grade 3 : - 12.1 KLUT, 9.9 KFF, 12 BRAM, 4 DSP - 148 Mhz To go further, increasing the GSHARE storage or implementing something as TAGE should help. Here are a pipeline representation of the two above configurations : .. image:: /asset/image/pipeline_simple.png Also note that the NaxRiscv simulator support gem5 / konata logs, allowing to visualise the execution flow. Note that if you configure the core with 1 decode 1 alu 1 shared eu you get : Performance : - Dhrystone : 1.70 DMIPS/Mhz (-O3 -fno-common -fno-inline) - Coremark : 3.35 Coremark/Mhz (-O3 and so many more random flags) - Embench-iot : 1.06 baseline (-O2 -mcmodel=medany -ffunction-sections) On Artix 7 speed grade 3 : - 10.8 KLUT, 9.7 KFF, 12 BRAM, 4 DSP - 155 Mhz RV64 ========================= In a similar configuration as the above RV32 (2\*Int/Shift/Branch, 1\*/load/store/mul/div/csr/env) Performance : - Dhrystone : 2.94 DMIPS/Mhz (-O3 -fno-common -fno-inline) - Coremark : 4.94 Coremark/Mhz (-O3, u32 as s32 and so many more random flags) - Embench-iot : 1.84 baseline (-O2 -ffunction-sections) On Artix 7 speed grade 3 : - 17.9 KLUT, 12.5 KFF, 12 BRAM, 16 DSP - 137 Mhz Notes =============== Here are a few notes collected during the development : - An out of order CPU without branch prediction is performing really bad ^^ - Avoiding store having to wait for the store data in the IQ can really help avoiding bad load speculation. - Some tests were made with two cycle latency ALU (in prevision of RV64 timing relaxation) which seem to show "little" impact on the overall performances (~15%, need to verify on more benchmarks) - Adding more and more execution units seems to go fast into diminishing returns lands How to run the benchmark ============================== First follow the steps in https://github.com/SpinalHDL/NaxRiscv/blob/main/src/test/cpp/naxriscv/README.md#how-to-setup-things to get a functional simulator. Then dhrystone and coremark benchmark can be run manually with : .. code:: shell obj_dir/VNaxRiscv --name dhrystone --output-dir output/nax/dhrystone --load-elf ../../../../ext/NaxSoftware/baremetal/dhrystone/build/rv32im/dhrystone.elf --start-symbol _start --stats-print --stats-toggle-symbol sim_time obj_dir/VNaxRiscv --name coremark --output-dir output/nax/coremark --load-elf ../../../../ext/NaxSoftware/baremetal/coremark/build/rv32im/coremark.elf --start-symbol _start --pass-symbol pass --stats-print-all --stats-toggle-symbol sim_time To run embench, you have to clone https://github.com/SpinalHDL/embench-iot.git and then follow the steps defined in config/riscv32/boards/naxriscv_sim/README.md