My FPGA design page
From PEL Wiki
- General understanding of the Altera HyperTransport MegaCore
- FPGA Hints
Contents |
Example Design
Interfaces to HyperTransport
The XD1000 example design simplifies its interface by separating traffic into
- Control traffic initiated by the CPU
- Data traffic initiated by the FPGA
They wrap the three Atlantic interfaces (see Altera HyperTransport MegaCore) that I need to use into a control bus that:
- Only accepts 32-bit reads and writes
- Only allows one transaction at a time (simplifies tag handling)
- Passes on only a limited portion of the address space.
I'm going to insert a new wrapper that forwards any 32-bit control read/write to their interface, and doesn't allow anything else until that one completes. That will let me use their existing control interface for programming without crippling the interface. I need to be able to read/write in larger chunks and have multiple outstanding requests.
Transaction Types (bold ones need to be implemented)
- NOP
- Sync/Error
- Sized Write (Posted and Non-Posted) (Dword and Byte)
- Broadcast Messages
- Sized Read (Dword and Byte)
- Flush
- Fence
- Atomic RMW (Fetch and Add OR Compare and Swap)
- Read Response
- Target Done
Because I will be prefetchable, I don't think I'll have to worry about Byte Reads and Writes. I'll implement Byte Reads anyway, because they would hang the machine and they are easy (translate to a single DWORD read), but I'll just drop Byte Writes and keep a counter of how many I drop.
I don't know if I'll see fences, flushes, read-modify-writes. I think the read-modify-write may be important and they are very similar to two commands, a read and a write. Fences will take care of themselves because I'm not a bridge and they travel in the posted channel with the writes they should follow. Flushes require a response, so I may have to implement them.
Hierarchy
The XD1000 example design has several blocks:
* Data generators and checkers for the DDR and HT links * SRAM controller * CPLD/Flash controller * DDR_Write & DDR_Read that interface to Altera's DDR controller * HT_MSI Message Signaled Interrupt (MSI) interface * HT_Write & HT_Read interface to Altera's HT controller * HT_CTRL to decode the signals on the control bus * LED controller * A small RAM * Clock logic
My design can get rid of the data generators and checkers.
I need to figure out if I should add a FIFO for the data as it passes from one place to another (e.g., disk to RAM), or if I should just use the DMA engines as they are. Right now I'm leaning toward using the FIFO even though it may cost me some latency, just to keep the design simple. Otherwise I'll have to synchronize the DMA engines or combine them.
DiskRAM block
I am making a new diskRAM block that incorporates the Atlantic interfaces to HT and the DRAM interface. I need to be able to handle variable-sized reads and writes (16 bytes max) from the HT and convert that to DRAM (32 bytes + ECC). I'll use their command generators as starting points.
I need to solve Data Hazards in DiskRAM.
DiskRAM Hierarchy
diskram
- diskram_ht_arbiter - decide which transactions DiskRAM should respond to
- diskram_write_buffer - Combine writes into 128-byte blocks for DRAM (2-way associative, 4K total size)
- diskram_biased_fifo_arbiter
- diskram_ht_split - separate line fill traffic from responses
- ht_biased_fifo_arbiter
- cache_ram - the actual data and tag storage
- diskram_ram_ctrl - The brains of DiskRAM. Controls the SRAM, DRAM, and disk
- diskram_sram_if - An interface to the 4MB SRAM (36 to 144 bit width accesses) includes the old interface for testing
- diskram_ddr_dma - Accepts commands from ram_ctrl and directs the data up or downstream
- diskram_ddr_if - controls enables to the Altera DRAM controller, does all reads but only 128B writes
- bus_decoupler
- bus_decoupler
- diskram_disk_ctrl - wraps sata_ctrl and sata_read_ctrl
- diskram_sata_ctrl - handles all writes to buffer memory and control memory for the SATA controller
- xd1000_ht_write_channel - a DMA engine for writes
- slot_finder - finds the next available slot in the SATA controller
- diskram_sata_read_ctrl - handles reads from buffer memory and control memory for the SATA controller
- xd1000_ht_read_channel - a DMA engine for reads
- diskram_sata_ctrl - handles all writes to buffer memory and control memory for the SATA controller
Diagrams
| A simple flow chart without evictions | An Atlantic Block Diagram | Another with ordered busses | ||
|---|---|---|---|---|
| A flow chart (everything causes hazards) | A more complex chart (read hits are hazards) | |||
|---|---|---|---|---|
| The tracing module | A generic storage module | |||
|---|---|---|---|---|
Assuming that I do the read hits are hazards, here's the implementation outline:
Two hashes
- Read in Progress (RIP) hash
- Conflict hash (write waiting for a RIP
Reads can proceed with RIP bit set but not conflict bit set. Writes can't proceed with RIP Bit set ... They set the conflict bit
If I have use the lower 5 bits of address (above the line width), that gives me 32nds of a page for conflicts.
Interesting Points
Debug
I'll add a debug module that eats writes and returns the last 16 commands. It should be easy to do, and helpful. I should be able to use it to debug each level if I keep the interface the same.
Data widths
- HT interface is 64-bits * 8 cycles = max 64 B
- DRAM interface is 256 bits + 32 ECC bits * 4 cycle burst = max 128 B + 16 B ECC
- Disk is in units of pages for my purposes. 4096 Bytes or 16*DRAM or 64*HT
Tag storage
- Conveniently, I have 4 GB DDR, 4 MB SRAM, and 4K in each page
- 1M pages in DRAM
- 4B SRAM per page
- If I store tags in SRAM
- 40-12 =28 bits tag
- valid
- dirty
- 2 usage bits
- If I store tags in ECC bits of DRAM
- 30 usage bits
- valid
- dirty
DMA
- DMA means that the disk controller uses the same interface to DiskRAM as the CPU
- If that becomes a problem, I could:
- DMA into RAM by the CPU (unused)
- Have DiskRAM read it in.
- If that becomes a problem, I could:
That might not be too bad, since the datapath is almost the same, but the control is more complicated.
Tracing block
I'm implementing a tracing module that will keep track of which areas of the FPGA have been written to /read from. The counts for each type and area, and the last few of each type. I'm hoping that it will help me solve the XD1000 Freezing Problem I'm having, and be useful in the future.
It fits into the design at the end of the existing blocks.
The way to access it is to read from the data storage.
Better Tracing Block
This tracing block only traces unclaimed addresses. This means that each request that reaches it is recorded, and the last 8 are returned on a read. I haven't modified it to return different sizes yet. I'm hoping that I will only have to deal with different locations once the memory is prefetchable.
So right now it just puts requests into a FIFO and lets them fall out the end when it's full.
Parameterizable Write Buffer
This was optional, now it is required
The x4 dimms which are installed in the XD1000 have the byte-enables disabled on the dimm. This means that partial writes are not supported, so all writes need to be 128B at a time. The other option is to generate my own DDR controller, replace the DIMMS with x8 or x16, and play with the timings. I don't want to.
Unaligned reads are not supported
Since I'm storing the data in 64 bit chunks, I can't read on 4B boundaries. I don't know if I need to fix that, but I need to remember it! I'm hoping that with caching enabled, it never reads partial cache lines.
Design
Parameters:
- Line Size (128 B, 256 B, 512 B)
- Associativity (2,4,8)
- Policy
- Write through full lines
- Writes will come in (max) 64 B chunks, but DRAM uses 128 B chunks more efficiently
- When 256 B (contiguous) have been written, write them back
- Wait to be evicted
- Write through full lines
The idea is to combine writes into larger chunks. I need to keep usage information and pass it down with the command when it gets evicted.
Each line has usage stats that can be passed down in extra bits in the HyperTransport packet. I use saturating counters, figuring that there will be a small number of accesses, and I have a small amount of space in the HyperTransport packets. I can use the unitID field, which is 5 bits long (3 for writes, 2 for reads). The real unitID will always be 0 since I only receive requests from upstream. Peer-to-peer requests need to go through the root first.
In order to efficiently use memory, I need to figure out a way to store the tags (write requests) with the data in the write buffer.
- I could store it in the parity bits, but that would mean up to 8 reads for each check, which is too much overhead.
- Another possibility is to put all the tags in one memory and the data in another
- Still another possibility is to partition the memory, so that the tags end up at the end of memory and there are null words to stop you from using the tags as data.
I think that the partitioning scheme is the most flexible and parameterizable. It also allows the accesses to happen independently.
Changes due to DDR controller
I used to write back a partial line then send the read in case of an eviction. Now that I can't write partial lines, I need to read the full line (fill the line) then write back the modified version. I'm going to use this same procedure to support smaller writes and read-modify-writes.
Design for Testability
In order to test it, I added a bit in a control register to flush the Write Buffer. Actually I made it so it throws away whatever is there. It would be nicer to have it write back anything that was dirty. I can do this manually by writing pages of zeros to an unused area, but it would be nicer...
RAM Control
This block is the brains of DiskRAM. It controls the SRAM, and based on the tags, instructs the disk and DRAM interfaces what data to read/write and where.
RAM Control State Machine
DDR DMA
There are two state machines in this block. One handles commands and writing. The other handles the disk interface and reading.
Main State Machine
Read State Machine
SATA IF
There's a race condition here between SATA IF and SATA Read IF!
When the ACT and CMD bits get set, then SATA Read IF starts polling. If there are lots of things in the queue, then the reads might get there before the writes!
The way I've "fixed" it is to make it check in two places (one which the drive doesn't clear).
The problem is that I need to make sure that the value changes every time. I've made it alternate between all ones and all zeros. It depends on the values being zeroed before it starts. The driver will have to do that. I chose a spot inside the scatter-gather list which is never used. It is at the address of the scatter gather list plus 64.
SATA RD IF
Weirdness
When I access certain locations it gives me a CRC error in simulation, but no error in real life!
The location is 0x32c offset from the beginning of the first BAR.
Turns out that reading uninitialized block RAMs gives me UUUUU, which doesn't compute a very nice CRC. In real life I get garbage data (zeroes) so it doesn't give a CRC error.
Categories: Myles | FPGA | Pel
