简介

spinal.lib.misc.pipeline提供了一套流水线API。相对于手动流水线它的主要优点是:

  • 您不必预先准备好整个流水系统中所需的所有信号元素。您可以根据设计需要,以更特别的方式创建和使用可分级的信号,而无需重构所有中间阶段来处理信号

  • 流水线的信号可以利用SpinalHDL的强大参数化能力,并且如果设计构建中不需要特定的参数化特征,则可以进行优化/移除,而不需要以显著的方式修改流水系统设计或项目代码库。

  • 手动时序调整要容易得多,因为您不必手动处理寄存器/仲裁器

  • 它会自行管理仲裁器

API由4个主要部分组成:

  • Node:表示管道中的层

  • Link:允许节点相互连接

  • Builder:生成整个管道所需的硬件

  • Payload:用于获取流水线的节点上的硬件信号

重要的是,Payload不是硬件数据/信号实例,而是用于检索流水线在节点中数据/信号的关键,并且流水线构建器随后将在节点之间的每次给定Payload出现时自动互连/流水线。

以下是一个用于阐述的例子:

../../../_images/intro_pip.png

以下是关于此API的视频:

简单示例

下面是一个简单的例子,它只使用了基本的API:

import spinal.core._
import spinal.core.sim._
import spinal.lib._
import spinal.lib.misc.pipeline._

class TopLevel extends Component {
  val io = new Bundle{
    val up = slave Stream (UInt(16 bits))
    val down = master Stream (UInt(16 bits))
  }

  // Let's define 3 Nodes for our pipeline
  val n0, n1, n2 = Node()

  // Let's connect those nodes by using simples registers
  val s01 = StageLink(n0, n1)
  val s12 = StageLink(n1, n2)

  // Let's define a few Payload things that can go through the pipeline
  val VALUE = Payload(UInt(16 bits))
  val RESULT = Payload(UInt(16 bits))

  // Let's bind io.up to n0
  io.up.ready := n0.ready
  n0.valid := io.up.valid
  n0(VALUE) := io.up.payload

  // Let's do some processing on n1
  n1(RESULT) := n1(VALUE) + 0x1200

  // Let's bind n2 to io.down
  n2.ready := io.down.ready
  io.down.valid := n2.valid
  io.down.payload := n2(RESULT)

  // Let's ask the builder to generate all the required hardware
  Builder(s01, s12)
}

这将产生以下硬件:

../../../_images/simple_pip.png

下面是一个仿真波形:

下面是相同的示例,但使用了更多的API:

import spinal.core._
import spinal.core.sim._
import spinal.lib._
import spinal.lib.misc.pipeline._

class TopLevel extends Component {
  val VALUE = Payload(UInt(16 bits))

  val io = new Bundle{
    val up = slave Stream(VALUE)  //VALUE can also be used as a HardType
    val down = master Stream(VALUE)
  }

  // NodesBuilder will be used to register all the nodes created, connect them via stages and generate the hardware
  val builder = new NodesBuilder()

  // Let's define a Node which connect from io.up
  val n0 = new builder.Node{
    arbitrateFrom(io.up)
    VALUE := io.up.payload
  }

  // Let's define a Node which do some processing
  val n1 = new builder.Node{
    val RESULT = insert(VALUE + 0x1200)
  }

  //  Let's define a Node which connect to io.down
  val n2 = new builder.Node {
    arbitrateTo(io.down)
    io.down.payload := n1.RESULT
  }

  // Let's connect those nodes by using registers stages and generate the related hardware
  builder.genStagedPipeline()
}

Payload

Payload对象用于引用可以通过流水线的数据。从技术上讲,Payload是一个HardType,它有一个名字,并被用作在流水线某个级中检索信号的“键”。

val PC = Payload(UInt(32 bits))
val PC_PLUS_4 = Payload(UInt(32 bits))

val n0, n1 = Node()
val s01 = StageLink(n0, n1)

n0(PC) := 0x42
n1(PC_PLUS_4) := n1(PC) + 4

请注意,我习惯于使用大写对Payload实例命名。这是为了让它非常明确,这不是一个硬件信号,更像是一个“键/类型”访问的东西。

Node

Node主要托管有效/就绪仲裁信号,以及所有通过它的硬件信号所需的Payload。

您可以通过以下方式访问其仲裁器:

API

访问

描述

node.valid

RW

指定节点上是否存在事务的信号。它是由上游逻辑驱动的。一旦置为1,则它必须且仅能在valid和ready同时置位或node.cancel为高的周期后解除置位。valid不依赖于ready。

node.ready

RW

指定节点的事务是否可以向下游进行的信号。它是由下游驱动以创建反压。当没有事务(node.valid被置0)时,该信号无意义

node.cancel

RW

指定节点的事务是否正在从流水线中取消的信号。它由下游驱动。当没有事务时(node.valid被置0),该信号没有意义

node.isValid

RO

node.valid的只读访问器

node.isReady

RO

node.ready的只读访问器

node.isCancel

RO

node.cancel的只读访问器

node.isFiring

RO

当节点事务成功继续进行时为True(valid && ready && !cancel)。用于提交状态更改。

node.isMoving

RO

当节点事务将不再存在于节点上时(从下一周期开始)为True,要么是因为下游准备好接收事务,要么是因为事务已从流水线中取消。(valid && (ready || cancel))用于“复位”(reset)状态。

node.isCanceling

RO

当节点事务正在被取消时为True。这意味着在将来的周期中它不会出现在流水线中的任何地方。

请注意,node.valid/node.ready信号遵循与 Stream 中相同的规范。

Node的控制信号(valid/ready/cancel)和状态信号(isValid、isReady、isCancel、isFiring等)是按需创建的。因此,例如,您可以通过永远不引用ready信号来创建没有反压的流水线。这就是在想要读取某物的状态时使用状态信号,仅在想要驱动某物时使用控制信号的重要性所在。

以下是节点上可能出现的仲裁情况列表。valid/ready/cancel定义了我们所处的状态,而isFiring/isMoving是这些状态的结果:

valid

ready

cancel

描述

isFiring

isMoving

0

X

X

无事务

0

0

1

1

0

正在进行

1

1

1

0

0

阻塞

0

0

1

X

1

取消

0

1

请注意,如果您想要建模诸如CPU级可能的阻塞和刷新的情况,可以查看 CtrlLink,因为它提供了执行此类操作的 API。

您可以通过以下方式访问由Payload引用的信号:

API

描述

node(Payload)

返回对应的硬件信号

node(Payload, Any)

与上述相同,但包括一个用作“次要键”的第二个参数。这有助于构建多通道硬件。例如,当您有一个多发射CPU流水线时,您可以使用通道Int id作为次要键

node.insert(Data)

返回一个新的Payload实例,该实例连接到给定的Data硬件信号

val n0, n1 = Node()

val PC = Payload(UInt(32 bits))
n0(PC) := 0x42
n0(PC, "true") := 0x42
n0(PC, 0x666) := 0xEE
val SOMETHING = n0.insert(myHardwareSignal) //This create a new Payload
when(n1(SOMETHING) === 0xFFAA){ ... }

虽然您可以手动驱动/读取流水线的第一个/最后一级的仲裁/数据,但有一些实用工具可以连接其边界。

API

描述

node.arbitrateFrom(Stream[T]])

由反压流驱动节点仲裁。

node.arbitrateFrom(Flow[T]])

由数据流驱动节点仲裁。

node.arbitrateTo(Stream[T]])

由节点驱动反压流仲裁。

node.arbitrateTo(Flow[T]])

由节点驱动数据流仲裁。

node.driveFrom(Stream[T]])((Node, T) => Unit)

由反压流驱动节点。提供的lambda函数可以用于连接数据

node.driveFrom(Flow[T]])((Node, T) => Unit)

与上述类似,但适用于Flow

node.driveTo(Stream[T]])((T, Node) => Unit)

由节点驱动反压流。提供的lambda函数可以用于连接数据

node.driveTo(Flow[T]])((T, Node) => Unit)

与上述类似,但适用于Flow

val n0, n1, n2 = Node()

val IN = Payload(UInt(16 bits))
val OUT = Payload(UInt(16 bits))

n1(OUT) := n1(IN) + 0x42

// Define the input / output stream that will be later connected to the pipeline
val up = slave Stream(UInt(16 bits))
val down = master Stream(UInt(16 bits)) //Note master Stream(OUT) is good aswell

n0.driveFrom(up)((self, payload) => self(IN) := payload)
n2.driveTo(down)((payload, self) => payload := self(OUT))

为了减少冗长,在Payload与其数据表示之间有一组隐式转换,可在Node下使用:

val VALUE = Payload(UInt(16 bits))
val n1 = new Node{
    val PLUS_ONE = insert(VALUE + 1) // VALUE is implicitly converted into its n1(VALUE) representation
}

您还可以通过导入它们来使用这些隐式转换:

val VALUE = Payload(UInt(16 bits))
val n1 = Node()

val n1Stuff = new Area {
    import n1._
    val PLUS_ONE = insert(VALUE) + 1 // Equivalent to n1.insert(n1(VALUE)) + 1
}

还有一个API,它允许你创建新的Area,这个Area提供了给定节点实例的全部API(包括隐式转换),而无需导入:

val n1 = Node()
val VALUE = Payload(UInt(16 bits))

val n1Stuff = new n1.Area{
    val PLUS_ONE = insert(VALUE) + 1 // Equivalent to n1.insert(n1(VALUE)) + 1
}

当硬件具有可参数化的流水线位置时,这样的功能非常有用(请参阅重定时示例)。

Builder

要生成流水线硬件,您需要提供流水线中使用的所有链接列表。

// Let's define 3 Nodes for our pipeline
val n0, n1, n2 = Node()

// Let's connect those nodes by using simples registers
val s01 = StageLink(n0, n1)
val s12 = StageLink(n1, n2)

// Let's ask the builder to generate all the required hardware
Builder(s01, s12)

此外,还有一套 “一体化 “的构建工具,您可以利用它来帮助你自己。

例如,有一个 NodesBuilder 类,可用于创建按顺序分级的流水线:

val builder = new NodesBuilder()

// Let's define a few nodes
val n0, n1, n2 = new builder.Node

// Let's connect those nodes by using registers and generate the related hardware
builder.genStagedPipeline()

组合能力(Composability)

该API的一个优点是,它可以轻松地将多个并行事物组成一个流水线。这里的 “组成 “是指有时你设计的流水线需要进行并行处理。

试想一下,如果您需要对 4 对数字进行浮点乘法运算(稍后求和)。并且这 4 对数字是由一个数据流同时提供的,那么就不需要 4 条不同的流水线来进行乘法运算,而需要在同一条流水线上并行处理。

下面的示例展示了一种模式,它将多个通道组成一个流水线,来并行处理它们。

// This area allows to take a input value and do +1 +1 +1 over 3 stages.
// I know that's useless, but let's pretend that instead it does a multiplication between two numbers over 3 stages (for FMax reasons)
class Plus3(INPUT: Payload[UInt], stage1: Node, stage2: Node, stage3: Node) extends Area {
  val ONE = stage1.insert(stage1(INPUT) + 1)
  val TWO = stage2.insert(stage2(ONE) + 1)
  val THREE = stage3.insert(stage3(TWO) + 1)
}

// Let's define a component which takes a stream as input,
// which carries 'lanesCount' values that we want to process in parallel
// and put the result on an output stream
class TopLevel(lanesCount : Int) extends Component {
  val io = new Bundle{
    val up = slave Stream(Vec.fill(lanesCount)(UInt(16 bits)))
    val down = master Stream(Vec.fill(lanesCount)(UInt(16 bits)))
  }

  // Let's define 3 Nodes for our pipeline
  val n0, n1, n2 = Node()

  // Let's connect those nodes by using simples registers
  val s01 = StageLink(n0, n1)
  val s12 = StageLink(n1, n2)

  // Let's bind io.up to n0
  n0.arbitrateFrom(io.up)
  val LANES_INPUT = io.up.payload.map(n0.insert(_))

  // Let's use our "reusable" Plus3 area to generate each processing lane
  val lanes = for(i <- 0 until lanesCount) yield new Plus3(LANES_INPUT(i), n0, n1, n2)

  // Let's bind n2 to io.down
  n2.arbitrateTo(io.down)
  for(i <- 0 until lanesCount) io.down.payload(i) := n2(lanes(i).THREE)

  // Let's ask the builder to generate all the required hardware
  Builder(s01, s12)
}

这将产生以下数据路径(假设 lanesCount = 2),仲裁器没有显示:

../../../_images/composable_lanes.png

重定时/可变长度

有时,你想设计一个流水线,但你并不真正知道关键路径在哪里,也不知道各阶段之间如何平衡。而且通常情况下,你无法依赖综合工具做好自动重定时工作。

因此,你需要一种简单的方法来构建流水线逻辑。

下面介绍如何使用此流水线 API:

// Define a component which will take a input stream of RGB value
// Process (~(R + G + B)) * 0xEE
// And provide that result into an output stream
class RgbToSomething(addAt : Int,
                     invAt : Int,
                     mulAt : Int,
                     resultAt : Int) extends Component {

  val io = new Bundle {
    val up = slave Stream(spinal.lib.graphic.Rgb(8, 8, 8))
    val down = master Stream (UInt(16 bits))
  }

  // Let's define the Nodes for our pipeline
  val nodes = Array.fill(resultAt+1)(Node())

  // Let's specify which node will be used for what part of the pipeline
  val insertNode = nodes(0)
  val addNode = nodes(addAt)
  val invNode = nodes(invAt)
  val mulNode = nodes(mulAt)
  val resultNode = nodes(resultAt)

  // Define the hardware which will feed the io.up stream into the pipeline
  val inserter = new insertNode.Area {
    arbitrateFrom(io.up)
    val RGB = insert(io.up.payload)
  }

  // sum the r g b values of the color
  val adder = new addNode.Area {
    val SUM = insert(inserter.RGB.r + inserter.RGB.g + inserter.RGB.b)
  }

  // flip all the bit of the RGB sum
  val inverter = new invNode.Area {
    val INV = insert(~adder.SUM)
  }

  // multiply the inverted bits with 0xEE
  val multiplier = new mulNode.Area {
    val MUL = insert(inverter.INV*0xEE)
  }

  // Connect the end of the pipeline to the io.down stream
  val resulter = new resultNode.Area {
    arbitrateTo(io.down)
    io.down.payload := multiplier.MUL
  }

  // Let's connect those nodes sequencialy by using simples registers
  val links = for (i <- 0 to resultAt - 1) yield StageLink(nodes(i), nodes(i + 1))

  // Let's ask the builder to generate all the required hardware
  Builder(links)
}

如果像这样生成该组件:

SpinalVerilog(
  new RgbToSomething(
    addAt    = 0,
    invAt    = 1,
    mulAt    = 2,
    resultAt = 3
  )
)

您将获得由 3 层寄存器(flip flop)分隔的 4 个处理阶段:

../../../_images/rgbToSomething.png

请注意,生成的硬件 verilog 还算干净(至少按我的标准来说是这样 :P):

// Generator : SpinalHDL dev    git head : 1259510dd72697a4f2c388ad22b269d4d2600df7
// Component : RgbToSomething
// Git hash  : 63da021a1cd082d22124888dd6c1e5017d4a37b2

`timescale 1ns/1ps

module RgbToSomething (
  input  wire          io_up_valid,
  output wire          io_up_ready,
  input  wire [7:0]    io_up_payload_r,
  input  wire [7:0]    io_up_payload_g,
  input  wire [7:0]    io_up_payload_b,
  output wire          io_down_valid,
  input  wire          io_down_ready,
  output wire [15:0]   io_down_payload,
  input  wire          clk,
  input  wire          reset
);

  wire       [7:0]    _zz_nodes_0_adder_SUM;
  reg        [15:0]   nodes_3_multiplier_MUL;
  wire       [15:0]   nodes_2_multiplier_MUL;
  reg        [7:0]    nodes_2_inverter_INV;
  wire       [7:0]    nodes_1_inverter_INV;
  reg        [7:0]    nodes_1_adder_SUM;
  wire       [7:0]    nodes_0_adder_SUM;
  wire       [7:0]    nodes_0_inserter_RGB_r;
  wire       [7:0]    nodes_0_inserter_RGB_g;
  wire       [7:0]    nodes_0_inserter_RGB_b;
  wire                nodes_0_valid;
  reg                 nodes_0_ready;
  reg                 nodes_1_valid;
  reg                 nodes_1_ready;
  reg                 nodes_2_valid;
  reg                 nodes_2_ready;
  reg                 nodes_3_valid;
  wire                nodes_3_ready;
  wire                when_StageLink_l56;
  wire                when_StageLink_l56_1;
  wire                when_StageLink_l56_2;

  assign _zz_nodes_0_adder_SUM = (nodes_0_inserter_RGB_r + nodes_0_inserter_RGB_g);
  assign nodes_0_valid = io_up_valid;
  assign io_up_ready = nodes_0_ready;
  assign nodes_0_inserter_RGB_r = io_up_payload_r;
  assign nodes_0_inserter_RGB_g = io_up_payload_g;
  assign nodes_0_inserter_RGB_b = io_up_payload_b;
  assign nodes_0_adder_SUM = (_zz_nodes_0_adder_SUM + nodes_0_inserter_RGB_b);
  assign nodes_1_inverter_INV = (~ nodes_1_adder_SUM);
  assign nodes_2_multiplier_MUL = (nodes_2_inverter_INV * 8'hee);
  assign io_down_valid = nodes_3_valid;
  assign nodes_3_ready = io_down_ready;
  assign io_down_payload = nodes_3_multiplier_MUL;
  always @(*) begin
    nodes_0_ready = nodes_1_ready;
    if(when_StageLink_l56) begin
      nodes_0_ready = 1'b1;
    end
  end

  assign when_StageLink_l56 = (! nodes_1_valid);
  always @(*) begin
    nodes_1_ready = nodes_2_ready;
    if(when_StageLink_l56_1) begin
      nodes_1_ready = 1'b1;
    end
  end

  assign when_StageLink_l56_1 = (! nodes_2_valid);
  always @(*) begin
    nodes_2_ready = nodes_3_ready;
    if(when_StageLink_l56_2) begin
      nodes_2_ready = 1'b1;
    end
  end

  assign when_StageLink_l56_2 = (! nodes_3_valid);
  always @(posedge clk or posedge reset) begin
    if(reset) begin
      nodes_1_valid <= 1'b0;
      nodes_2_valid <= 1'b0;
      nodes_3_valid <= 1'b0;
    end else begin
      if(nodes_0_ready) begin
        nodes_1_valid <= nodes_0_valid;
      end
      if(nodes_1_ready) begin
        nodes_2_valid <= nodes_1_valid;
      end
      if(nodes_2_ready) begin
        nodes_3_valid <= nodes_2_valid;
      end
    end
  end

  always @(posedge clk) begin
    if(nodes_0_ready) begin
      nodes_1_adder_SUM <= nodes_0_adder_SUM;
    end
    if(nodes_1_ready) begin
      nodes_2_inverter_INV <= nodes_1_inverter_INV;
    end
    if(nodes_2_ready) begin
      nodes_3_multiplier_MUL <= nodes_2_multiplier_MUL;
    end
  end


endmodule

此外,您还可以轻松调整处理的级数和位置,例如,您可能希望将翻转的硬件逻辑移到与加法器相同级上。具体方法如下:

SpinalVerilog(
  new RgbToSomething(
    addAt    = 0,
    invAt    = 0,
    mulAt    = 1,
    resultAt = 2
  )
)

那么您可能需要移除输出寄存器级:

SpinalVerilog(
  new RgbToSomething(
    addAt    = 0,
    invAt    = 0,
    mulAt    = 1,
    resultAt = 1
  )
)

这个示例的一个特点是,中间值必须是 addNode。例如:

val addNode = nodes(addAt)
// sum the r g b values of the color
val adder = new addNode.Area {
  ...
}

遗憾的是,scala 不允许用 new nodes(addAt).Area 替换 new addNode.Area。一种变通方法是将其定义为一个类,比如:

class NodeArea(at : Int) extends NodeMirror(nodes(at))
val adder = new NodeArea(addAt) {
    ...
}

根据您的管道规模,它可以带来一些好处。

简单的CPU示例

下面是一个简单的 8 位 CPU 示例:

  • 三级流水线(fetch, decode, execute)

  • 嵌入的获取存储器

  • add / jump / led /delay 指令

class Cpu extends Component {
  val fetch, decode, execute = CtrlLink()
  val f2d = StageLink(fetch.down, decode.up)
  val d2e = StageLink(decode.down, execute.up)

  val PC = Payload(UInt(8 bits))
  val INSTRUCTION = Payload(Bits(16 bits))

  val led = out(Reg(Bits(8 bits))) init(0)

  val fetcher = new fetch.Area{
    val pcReg = Reg(PC) init (0)
    up(PC) := pcReg
    up.valid := True
    when(up.isFiring) {
      pcReg := PC + 1
    }

    val mem = Mem.fill(256)(INSTRUCTION).simPublic
    INSTRUCTION := mem.readAsync(PC)
  }

  val decoder = new decode.Area{
    val opcode = INSTRUCTION(7 downto 0)
    val IS_ADD   = insert(opcode === 0x1)
    val IS_JUMP  = insert(opcode === 0x2)
    val IS_LED   = insert(opcode === 0x3)
    val IS_DELAY = insert(opcode === 0x4)
  }


  val alu = new execute.Area{
    val regfile = Reg(UInt(8 bits)) init(0)

    val flush = False
    for (stage <- List(fetch, decode)) {
      stage.throwWhen(flush, usingReady = true)
    }

    val delayCounter = Reg(UInt(8 bits)) init (0)

    when(isValid) {
      when(decoder.IS_ADD) {
        regfile := regfile + U(INSTRUCTION(15 downto 8))
      }
      when(decoder.IS_JUMP) {
        flush := True
        fetcher.pcReg := U(INSTRUCTION(15 downto 8))
      }
      when(decoder.IS_LED) {
        led := B(regfile)
      }
      when(decoder.IS_DELAY) {
        delayCounter := delayCounter + 1
        when(delayCounter === U(INSTRUCTION(15 downto 8))) {
          delayCounter := 0
        } otherwise {
          execute.haltIt()
        }
      }
    }
  }

  Builder(fetch, decode, execute, f2d, d2e)
}

下面是一个简单的测试平台,它实现了一个循环,使 led 计数值上升。

SimConfig.withFstWave.compile(new Cpu).doSim(seed = 2){ dut =>
  def nop() = BigInt(0)
  def add(value: Int) = BigInt(1 | (value << 8))
  def jump(target: Int) = BigInt(2 | (target << 8))
  def led() = BigInt(3)
  def delay(cycles: Int) = BigInt(4 | (cycles << 8))
  val mem = dut.fetcher.mem
  mem.setBigInt(0, nop())
  mem.setBigInt(1, nop())
  mem.setBigInt(2, add(0x1))
  mem.setBigInt(3, led())
  mem.setBigInt(4, delay(16))
  mem.setBigInt(5, jump(0x2))

  dut.clockDomain.forkStimulus(10)
  dut.clockDomain.waitSampling(100)
}