.. role:: raw-html-m2r(raw)
   :format: html


======================================
Hardware
======================================


Litex
=========================

NaxRiscv is ported on Litex : 


Digilent nexys video
---------------------------

Once Litex is installed, you can generate and load the Digilent nexys video bitstream via for instance :

.. code:: bash

    # RV64IMAFDCSU config, enough to run linux
    python3 -m litex_boards.targets.digilent_nexys_video --cpu-type=naxriscv  --bus-standard axi-lite --with-video-framebuffer --with-spi-sdcard --with-ethernet --xlen=64 --scala-args='rvc=true,rvf=true,rvd=true' --build --load


Putting debian on the SDCARD
------------------------------------------------------

.. code:: bash
    
    export SDCARD=/dev/???
    (
    echo o
    echo n
    echo p
    echo 1
    echo
    echo +500M
    echo y
    echo n
    echo p
    echo 2
    echo
    echo +7G
    echo y
    echo t
    echo 1
    echo b
    echo p
    echo w
    ) | sudo fdisk $SDCARD
    
    sudo mkdosfs ${SDCARD}1
    sudo mkfs -t ext2 ${SDCARD}2
    

You need now to download part1 and part2 from https://drive.google.com/drive/folders/1OWY_NtJYWXd3oT8A3Zujef4eJwZFP_Yh?usp=sharing 
and extract them to ${SDCARD}1 and ${SDCARD}2

.. code:: bash
    
    # Download images from https://drive.google.com/drive/folders/1OWY_NtJYWXd3oT8A3Zujef4eJwZFP_Yh?usp=sharing 
    mkdir mnt
    
    sudo mount ${SDCARD}1 mnt
    sudo tar -xf part1.tar.gz -C mnt
    sudo umount mnt

    sudo mount ${SDCARD}2 mnt
    sudo tar -xf part2.tar.gz -C mnt
    sudo umount mnt

Note that the DTB was generated for the digilent nexys video with :
python3 -m litex_boards.targets.digilent_nexys_video --cpu-type=naxriscv  --with-video-framebuffer --with-spi-sdcard --with-ethernet --xlen=64 --scala-args='rvc=true,rvf=true,rvd=true' --build --load

Then all should be good. You can login with user "root" password "root". You can also connect via SSH to root.

The bottleneck of the system is by far accessing the spi-sdcard. (500 KB/s read speed), so, things take time the first time you run them. Then it is much faster (linux cached stuff). So, instead of --with-spi-sdcard, consider using --with-coherent-dma --with-sdcard with the driver patch described in https://github.com/SpinalHDL/NaxSoftware/tree/main/debian_litex, this will allow the SoC to reach 4MB/s on the sdcard.

The Debian chroot (part2) was generated by following https://wiki.debian.org/RISC-V#Creating_a_riscv64_chroot and https://github.com/tongchen126/Boot-Debian-On-Litex-Rocket/blob/main/README.md#step3-build-debian-rootfs.
Also, it was generated inside QEMU, using https://github.com/esmil/riscv-linux "make sid"

You can also find the dts and linux .config on the google drive link. The .config came mostly from https://github.com/esmil/riscv-linux#kernel with a few additions, especialy, adding the litex drivers. The kernel was https://github.com/litex-hub/linux commit 53b46d10f9a438a29c061cac05fb250568d1d21b.

Adding packages, like xfce-desktop, chocolate-doom, openttd, visualboyadvance you can get things as following : 

.. image:: /asset/image/debian_demo1.png


Generating everything from scratch
------------------------------------------------------

You can find some documentation about how to generate :

- Debian rootfs
- Linux kernel
- OpenSBI

here : https://github.com/SpinalHDL/NaxSoftware/tree/main/debian_litex

It also contains some tips / tricks for the none Debian / Linux experts.


ASIC
=========================

While mainly focused on FPGA, NaxRiscv also integrate some ASIC friendly implementations : 

- Latch based register file
- Automatic generation of the openram scripts
- Automatic blackboxing of the memory blocks (via SpinalHDL)
- Parametrable reset strategy (via SpinalHDL)
- An optimized multiplier

Generating verilog
---------------------

You can generate an example of ASIC tunned NaxRiscv using : 

.. code:: bash

    cd $NAXRISCV
    sbt "runMain naxriscv.platform.asic.NaxAsicGen" 

    ls nax.v


If you want to target sky130, with openram memories, you can do : 


.. code:: bash

    cd $NAXRISCV
    sbt "runMain naxriscv.platform.asic.NaxAsicGen --sky130-ram --no-lsu" # ()no-lsu is optiona)

    ls nax.v sram/*

In order to artificialy reduce the register file, you can use the `\-\-regfile-fake-ratio=X` argument, where X need to be a power of two, and will reduce the register file size by that ratio.

You can also generate a design without load/store unit by having the `\-\-no-lsu` argument.

If you use NaxRiscv as a toplevel, You can generate the netlist with flip flop on the IO via the `\-\-io-ff` argument in order to relax timings.

You can ask SpinalHDL to blackbox memories with combinatorial read using the `\-\-bb-comb-ram` argument. This will also generate a comb_ram.log file which contains the list of all the blackbox used. The layout of the blacbox is : 

.. code:: verilog

  ram_${number of read ports}ar_${number of write ports}w_${words}x${width} ${name of replaced ram} (
    .clk           (clk                             ), //i

    .writes_0_en   (...                             ), //i
    .writes_0_addr (...                             ), //i
    .writes_0_data (...                             ), //i
    .writes_._en   (...                             ), //i
    .writes_._addr (...                             ), //i
    .writes_._data (...                             ), //i

    .reads_0_addr  (...                             ), //i
    .reads_0_data  (...                             ), //o
    .reads_._addr  (...                             ), //i
    .reads_._data  (...                             ), //o
  );


You can customize how the blackboxing is done by modifying https://github.com/SpinalHDL/NaxRiscv/blob/488c3397880b4c215022aa42f533574fe4dd366a/src/main/scala/naxriscv/compatibility/MultiportRam.scala#L488

Also, if you use  `\-\-bb-comb-ram`, you may also consider using `\-\-no-rf-latch-ram` which will also enable the generation of the register file blackbox.

OpenRam
---------

You can use OpenRam to generate the ram macros generated by the `\-\-sky130-ram` argument.

Here is a few dependencies to install first : 

- https://github.com/VLSIDA/OpenRAM/blob/stable/docs/source/index.md#openram-dependencies

.. code:: bash

    git clone https://github.com/VLSIDA/OpenRAM.git
    cd OpenRam
    
    ./install_conda.sh
    make pdk # The first time only
    make install  The first time only
    pip install -r requirements.txt

    mv technology/sky130/tech/tech.py technology/sky130/tech/tech.py_old
    sed '/Leakage power of 3-input nand in nW/a spice["nand4_leakage"] = 1' technology/sky130/tech/tech.py_old > technology/sky130/tech/tech.py


    cd macros
    cp -rf $NAXRISCV/sram/sky* sram_configs
    cp -rf $NAXRISCV/sram/openram.sh . && chmod +x openram.sh 

    # Run the macro generationThis will take quite some time
    ./openram.sh
    
    ls sky130_sram_1r1w_*

OpenLane
----------

You can use openlane to generate a GDS of NaxRiscv.


Setup / how to reproduce
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

You can get the openlane docker via :

- https://openlane.readthedocs.io/en/latest/getting_started/installation/installation_common_section.html

Then :


.. code:: bash

    # Generate a NaxRiscv verilog (here without using the ram macro)
    (cd $NAXRISCV && sbt "runMain naxriscv.platform.asic.NaxAsicGen")

    git clone https://github.com/The-OpenROAD-Project/OpenLane.git
    cd OpenLane
    make mount
    make pdk # the first time only

    # Setup the design
    cp -rf $NAXRISCV/src/main/openlane/nax designs/nax
    mkdir designs/nax/src
    cp -rf $NAXRISCV/nax.v designs/nax/src/nax.v

    # You will find your design in  designs/nax/runs/$TAG
    export TAG=run_1

    # This will run all the openlane flow, and will take hours
    ./flow.tcl -design nax -overwrite -tag $TAG

    # Run the openroad GUI to visualise the design
    python3 gui.py --viewer openroad ./designs/nax/runs/$TAG

If you want to reproduce with the ram macros, then :

- Generate the NaxRiscv verilog file with the `\-\-sky130-ram` argument.
- update the designs/nax/src/nax.v
- Generate the ram macro using openram
- Uncomment the ram macro related things in the OpenLane/designs/nax/config.tcl file and copy the macros files there.


.. code:: bash
    
    cd $NAXRISCV
    sbt "runMain naxriscv.platform.asic.NaxAsicGen --sky130-ram"
    cp -rf $NAXRISCV/nax.v $OPENLANE/designs/nax/src/nax.v

    # Do the things described in the OpenRam chapter of this doc to generate the ram macros

    mkdir $OPENLANE/designs/nax/sram
    cp $OPENRAM/macros/sky130_sram_1r1w_*/sky130_sram_1r1w_*.* $OPENLANE/designs/nax/sram
    sed -i '1 i\/// sta-blackbox' $OPENLANE/designs/nax/sram/*.v
    sed -i 's/max_transition       : 0.04/max_transition       : 0.75/g' $OPENLANE/designs/nax/sram/*.lib

    # Run flow.tcl


Running simulation
^^^^^^^^^^^^^^^^^^^^

You can run a simulation which use the NaxRiscv ASIC specific feature inside a little SoC by running : 

.. code:: bash
    
    sbt "runMain naxriscv.platform.tilelinkdemo.SocSim --load-elf ext/NaxSoftware/baremetal/dhrystone/build/rv32ima/dhrystone.elf --no-rvls --iverilog --asic"

By using iverilog instead of verilator, its ensure that the Latch based register file is functional.


Results
^^^^^^^^

Here is the result of openlane with the default sky130 PDK and NaxRiscv as toplevel (--regfile-fake-ratio=8 --io-ff), so, without any memory blackbox and with a reduced I$ / D$ / branch predictor size as follow : 

.. code:: scala

        case p: FetchCachePlugin => p.wayCount = 1; p.cacheSize = 256; p.memDataWidth = 64
        case p: DataCachePlugin => p.wayCount = 1; p.cacheSize = 256; p.memDataWidth = 64
        case p: BtbPlugin => p.entries = 8
        case p: GSharePlugin => p.memBytes = 32
        case p: Lsu2Plugin => p.hitPedictionEntries = 64

.. image:: /asset/image/asic_1.png

The maximal frequency is around ~100 Mhz, with most of the critical path time budget being spent into a high fanout net (see issues section). The total area used by the design cells being 1.633 mm². The density was set with FP_CORE_UTIL=40 and PL_TARGET_DENSITY=45. 

The main obstacle to frequancy and density being described bellow in the Issue section.

Issues
^^^^^^^^

There is mostly two main issues : 

- Sky130 + openroad has "sever" density/congestion issues with the register file (4R2W/latch/tristate). One workaround would be https://github.com/AUCOHL/DFFRAM, but unfortunately, it doesn't support configs behond 2R1W (issue https://github.com/AUCOHL/DFFRAM/issues/192).
- There is also a macro insertion halo issue which makes the usage of openram macros impossible at the moment, where the power lines get too close to the macro, not giving enough room for the data pins to route. See https://github.com/The-OpenROAD-Project/OpenLane/issues/2030

Otherwise, the main performance issue observed seems to be the unbalenced insertion of buffer on high fanout logics. One instance happend for the MMU TLB lookup. I had the TLB setup as 6 ways of 32 entries each (a lot), meaning the virtual address was used to gatter information from 7360 TLB bits (2530 muxes to drive). In this scenario, the ASIC critical path was this TLB lookup, where most of the timing budget was spent on distributing the virtual address signal to those muxes. The main issue being it was done through 13 layers of various buffer gates with a typical fanout of 10, while a utopian balanced fanout would be able to reach 10^13 gates, while here it is only to drive 2530 muxes. See https://github.com/The-OpenROAD-Project/OpenLane/issues/2090 for more info. This issue may play an important role into the congestion / density / frequency performances.

Here you can see in pink the buffer chain path.

.. image:: /asset/image/asic_buf_1.png