My FPGA design page

From PEL Wiki

Jump to: navigation, search

Contents

Example Design

Interfaces to HyperTransport

The XD1000 example design simplifies its interface by separating traffic into

  1. Control traffic initiated by the CPU
  2. Data traffic initiated by the FPGA

They wrap the three Atlantic interfaces (see Altera HyperTransport MegaCore) that I need to use into a control bus that:

  • Only accepts 32-bit reads and writes
  • Only allows one transaction at a time (simplifies tag handling)
  • Passes on only a limited portion of the address space.

I'm going to insert a new wrapper that forwards any 32-bit control read/write to their interface, and doesn't allow anything else until that one completes. That will let me use their existing control interface for programming without crippling the interface. I need to be able to read/write in larger chunks and have multiple outstanding requests.

Transaction Types (bold ones need to be implemented)

  • NOP
  • Sync/Error
  • Sized Write (Posted and Non-Posted) (Dword and Byte)
  • Broadcast Messages
  • Sized Read (Dword and Byte)
  • Flush
  • Fence
  • Atomic RMW (Fetch and Add OR Compare and Swap)
  • Read Response
  • Target Done

Because I will be prefetchable, I don't think I'll have to worry about Byte Reads and Writes. I'll implement Byte Reads anyway, because they would hang the machine and they are easy (translate to a single DWORD read), but I'll just drop Byte Writes and keep a counter of how many I drop.

I don't know if I'll see fences, flushes, read-modify-writes. I think the read-modify-write may be important and they are very similar to two commands, a read and a write. Fences will take care of themselves because I'm not a bridge and they travel in the posted channel with the writes they should follow. Flushes require a response, so I may have to implement them.

Hierarchy

The XD1000 example design has several blocks:

* Data generators and checkers for the DDR and HT links
* SRAM controller
* CPLD/Flash controller
* DDR_Write & DDR_Read that interface to Altera's DDR controller
* HT_MSI Message Signaled Interrupt (MSI) interface
* HT_Write & HT_Read interface to Altera's HT controller
* HT_CTRL to decode the signals on the control bus
* LED controller
* A small RAM
* Clock logic

My design can get rid of the data generators and checkers.

I need to figure out if I should add a FIFO for the data as it passes from one place to another (e.g., disk to RAM), or if I should just use the DMA engines as they are. Right now I'm leaning toward using the FIFO even though it may cost me some latency, just to keep the design simple. Otherwise I'll have to synchronize the DMA engines or combine them.

DiskRAM block

I am making a new diskRAM block that incorporates the Atlantic interfaces to HT and the DRAM interface. I need to be able to handle variable-sized reads and writes (16 bytes max) from the HT and convert that to DRAM (32 bytes + ECC). I'll use their command generators as starting points.

I need to solve Data Hazards in DiskRAM.

DiskRAM Hierarchy

diskram

  • diskram_ht_arbiter - decide which transactions DiskRAM should respond to
  • diskram_write_buffer - Combine writes into 128-byte blocks for DRAM (2-way associative, 4K total size)
    • diskram_biased_fifo_arbiter
    • diskram_ht_split - separate line fill traffic from responses
    • ht_biased_fifo_arbiter
    • cache_ram - the actual data and tag storage
  • diskram_ram_ctrl - The brains of DiskRAM. Controls the SRAM, DRAM, and disk
    • diskram_sram_if - An interface to the 4MB SRAM (36 to 144 bit width accesses) includes the old interface for testing
    • diskram_ddr_dma - Accepts commands from ram_ctrl and directs the data up or downstream
      • diskram_ddr_if - controls enables to the Altera DRAM controller, does all reads but only 128B writes
      • bus_decoupler
      • bus_decoupler
  • diskram_disk_ctrl - wraps sata_ctrl and sata_read_ctrl
    • diskram_sata_ctrl - handles all writes to buffer memory and control memory for the SATA controller
      • xd1000_ht_write_channel - a DMA engine for writes
      • slot_finder - finds the next available slot in the SATA controller
    • diskram_sata_read_ctrl - handles reads from buffer memory and control memory for the SATA controller
      • xd1000_ht_read_channel - a DMA engine for reads

Diagrams

A simple flow chart without evictions              An Atlantic Block Diagram              Another with ordered busses
Flow Chart Block Diagram Block Diagram
A flow chart (everything causes hazards)              A more complex chart (read hits are hazards)             
Flow Chart Flow Chart
The tracing module              A generic storage module             
Flow Chart Flow Chart

Assuming that I do the read hits are hazards, here's the implementation outline:

Two hashes

  • Read in Progress (RIP) hash
  • Conflict hash (write waiting for a RIP

Reads can proceed with RIP bit set but not conflict bit set. Writes can't proceed with RIP Bit set ... They set the conflict bit

If I have use the lower 5 bits of address (above the line width), that gives me 32nds of a page for conflicts.

Interesting Points

Debug

I'll add a debug module that eats writes and returns the last 16 commands. It should be easy to do, and helpful. I should be able to use it to debug each level if I keep the interface the same.

Data widths

  • HT interface is 64-bits * 8 cycles = max 64 B
  • DRAM interface is 256 bits + 32 ECC bits * 4 cycle burst = max 128 B + 16 B ECC
  • Disk is in units of pages for my purposes. 4096 Bytes or 16*DRAM or 64*HT

Tag storage

  • Conveniently, I have 4 GB DDR, 4 MB SRAM, and 4K in each page
    • 1M pages in DRAM
    • 4B SRAM per page
    • If I store tags in SRAM
      • 40-12 =28 bits tag
      • valid
      • dirty
      • 2 usage bits
    • If I store tags in ECC bits of DRAM
      • 30 usage bits
      • valid
      • dirty

DMA

  • DMA means that the disk controller uses the same interface to DiskRAM as the CPU
    • If that becomes a problem, I could:
      • DMA into RAM by the CPU (unused)
      • Have DiskRAM read it in.

That might not be too bad, since the datapath is almost the same, but the control is more complicated.

Tracing block

I'm implementing a tracing module that will keep track of which areas of the FPGA have been written to /read from. The counts for each type and area, and the last few of each type. I'm hoping that it will help me solve the XD1000 Freezing Problem I'm having, and be useful in the future.

It fits into the design at the end of the existing blocks.

The way to access it is to read from the data storage.


Better Tracing Block

This tracing block only traces unclaimed addresses. This means that each request that reaches it is recorded, and the last 8 are returned on a read. I haven't modified it to return different sizes yet. I'm hoping that I will only have to deal with different locations once the memory is prefetchable.

So right now it just puts requests into a FIFO and lets them fall out the end when it's full.


Parameterizable Write Buffer

This was optional, now it is required

The x4 dimms which are installed in the XD1000 have the byte-enables disabled on the dimm. This means that partial writes are not supported, so all writes need to be 128B at a time. The other option is to generate my own DDR controller, replace the DIMMS with x8 or x16, and play with the timings. I don't want to.

Unaligned reads are not supported

Since I'm storing the data in 64 bit chunks, I can't read on 4B boundaries. I don't know if I need to fix that, but I need to remember it! I'm hoping that with caching enabled, it never reads partial cache lines.

Design

Parameters:

  • Line Size (128 B, 256 B, 512 B)
  • Associativity (2,4,8)
  • Policy
    • Write through full lines
      • Writes will come in (max) 64 B chunks, but DRAM uses 128 B chunks more efficiently
      • When 256 B (contiguous) have been written, write them back
    • Wait to be evicted

The idea is to combine writes into larger chunks. I need to keep usage information and pass it down with the command when it gets evicted.

Each line has usage stats that can be passed down in extra bits in the HyperTransport packet. I use saturating counters, figuring that there will be a small number of accesses, and I have a small amount of space in the HyperTransport packets. I can use the unitID field, which is 5 bits long (3 for writes, 2 for reads). The real unitID will always be 0 since I only receive requests from upstream. Peer-to-peer requests need to go through the root first.

In order to efficiently use memory, I need to figure out a way to store the tags (write requests) with the data in the write buffer.

  • I could store it in the parity bits, but that would mean up to 8 reads for each check, which is too much overhead.
  • Another possibility is to put all the tags in one memory and the data in another
  • Still another possibility is to partition the memory, so that the tags end up at the end of memory and there are null words to stop you from using the tags as data.

I think that the partitioning scheme is the most flexible and parameterizable. It also allows the accesses to happen independently.


The flow chart              State Explanations             
Flow Chart
  • Idle
    • Present Tag RAM with an address
    • Wait until valid
  • Valid Command
    • Compare Tags
    • Decide if Hit, Evict, or Miss
  • Read Miss
    • Wait for bus and send command
  • Write Hit Start
    • Write updated tag
  • Write Hit Data
    • Write Hit Data to Data RAM
  • Write Hit Last
    • Write last datum to Data RAM (needed for case when bus is slow)
  • Evict Get Bus
    • Wait for control of bus
    • Load counter
  • Evict Command
    • Send write command to lower level (must be a jumbo write -- 128B)
  • Evict Data
    • Send data
  • Evict Again (used for slow bus)
    • Load counter
    • Go to Write Hit, Evict, or Read Miss
  • Read Hit Get Bus
    • Wait for bus
  • Read Hit Command
    • Send response command
  • Read Hit Data
    • Send data
  • Read Hit Last
    • Last datum (needed for slow bus)

Changes due to DDR controller

I used to write back a partial line then send the read in case of an eviction. Now that I can't write partial lines, I need to read the full line (fill the line) then write back the modified version. I'm going to use this same procedure to support smaller writes and read-modify-writes.

The flow chart              State Explanations             
Flow Chart
  • Idle
    • Present Tag RAM with an address
    • Wait until valid
  • Valid Command
    • Compare Tags
    • Decide if Hit, Evict, or Miss
  • Read Miss
    • Wait for bus and send command
  • Write Hit Start
    • Write updated tag
  • Write Hit Data
    • Write Hit Data to Data RAM
  • Write Hit Last
    • Write last datum to Data RAM (needed for case when bus is slow)
  • Evict Get Bus
    • Wait for control of bus
    • Load counter
  • Evict Command
    • Send write command to lower level (must be a jumbo write -- 128B)
  • Evict Data
    • Send data
  • Evict Again (used for slow bus)
    • Load counter
    • Go to Write Hit, Evict, or Read Miss
  • Read Hit Get Bus
    • Wait for bus
  • Read Hit Command
    • Send response command
  • Read Hit Data
    • Send data
  • Read Hit Last
    • Last datum (needed for slow bus)

Design for Testability

In order to test it, I added a bit in a control register to flush the Write Buffer. Actually I made it so it throws away whatever is there. It would be nicer to have it write back anything that was dirty. I can do this manually by writing pages of zeros to an unused area, but it would be nicer...

RAM Control

This block is the brains of DiskRAM. It controls the SRAM, and based on the tags, instructs the disk and DRAM interfaces what data to read/write and where.

RAM Control State Machine

The flow chart              State Explanations             
Flow Chart
  • Idle
    • Wait until valid
  • Check SRAM
    • SRAM holds tag RAM for DiskRAM
  • SRAM Data
    • Wait for the SRAM to respond
  • SRAM Update
    • Update the SRAM usage information, tags, and replacement bits
  • Eviction
    • Send read to DDR and write to disk
  • Miss
    • Send a read to Disk and write to DDR
  • Hit
    • Send command to DDR
  • Write
    • Send data to DDR

DDR DMA

There are two state machines in this block. One handles commands and writing. The other handles the disk interface and reading.

Main State Machine

The flow chart              State Explanations             
Flow Chart
  • Idle
    • Wait until valid
  • Command
    • Register Command
  • Read Queue Command
    • Send a read command to the Read State Machine
  • Read DDR Command
    • Send a read command to the DDR
  • Write Command
    • Send a write command to the DDR
  • Write Disk Check Command
    • Make sure the data from disk is correct
  • Write Data
    • Write the data to the disk
  • Write Data Retry
    • If the DDR is not ready - retry
  • Write Data Last
    • Write the last datum

Read State Machine

The flow chart              State Explanations             
Flow Chart
  • Idle
    • Wait for work from Main State Machine
  • Read Command
    • Register Command
  • Read Send Disk Command
    • Send command to SATA controller
  • Read Wait For Data
    • Wait until DDR data arrives
  • Read HT Command
    • Send read response command
  • Read Data State
    • Send Data to Disk or CPU

SATA IF

There's a race condition here between SATA IF and SATA Read IF!

When the ACT and CMD bits get set, then SATA Read IF starts polling. If there are lots of things in the queue, then the reads might get there before the writes!

The way I've "fixed" it is to make it check in two places (one which the drive doesn't clear).

The problem is that I need to make sure that the value changes every time. I've made it alternate between all ones and all zeros. It depends on the values being zeroed before it starts. The driver will have to do that. I chose a spot inside the scatter-gather list which is never used. It is at the address of the scatter gather list plus 64.

The flow chart              State Explanations             
Flow Chart
  • Idle
    • Wait until valid
  • Get Slot
    • Wait for an open command slot
  • Write Data Cmd
    • Write the HT command
  • Write Data
    • Write Data to buffer
  • Write Data Last
    • Write last datum to buffer (needed when the bus is not ready)
  • Write Disk Commands
    • Send series of commands to Disk controller
      • Fill FIS (command structure)
      • Fill SG (Scatter Gather list)
      • Fill CMD Slot (set pointer to FIS)
      • Set ACT bit
      • Issue CMD
  • Send Read Setup
    • Send information so that DMA can get data from buffer

SATA RD IF

The flow chart              State Explanations             
Flow Chart
  • Idle
    • Wait until valid
  • Read ACT Register
    • Set buffer address
    • Set size
  • Wait for ACT
    • Don't try to get data until disk finished
  • Set up HT DMA
    • Set buffer address
    • Set size
  • Write Data Cmd
    • Send HT cmd
  • Write Data
    • Send data from the DMA

Weirdness

When I access certain locations it gives me a CRC error in simulation, but no error in real life!

The location is 0x32c offset from the beginning of the first BAR.

Turns out that reading uninitialized block RAMs gives me UUUUU, which doesn't compute a very nice CRC. In real life I get garbage data (zeroes) so it doesn't give a CRC error.