Testing DiskRAM

From PEL Wiki

Jump to: navigation, search


I'm testing a piece at a time. Before I just try to assign the region as memory and boot into it, I'd like to have some idea that it will work.

In order to do this, I have left an interface to the SRAM that stores the tags. This allows me to set the tag, valid, and dirty bits before reading or writing to DiskRAM. I can then control which portions of DiskRAM are exercised.

There are several pitfalls in this testing:

  • Virtual Memory
    • Virtual pointers do not necessarily have the same alignment as the physical addresses they represent
      • This was ugly when I set up the buffers for the disk, and they wrote over other memory because of truncation (I designed it so that the buffers are aligned on a 32*page_size boundary)
  • Pointer Arithmetic
    • Adding offsets to pointers gives different results based on the type of pointer (I know this should be obvious, but it caught me.
      • casting before addition vs. addition before casting


Contents

Testing Order

After I found one too many bugs in my simple reused components, I decided to test them better. I created a testbench with two randomized packet generators and two checkers. Then I hook them up with various combinations of the bus decoupler (glorified FIFO that speaks Atlantic), arbiters (combine two buses to one), and a splitter. I then vary the clocks of the different components to make sure there are no problem when FIFOs fill or are empty.

  1. Minor blocks
    1. Randomized testing of
      1. Bus decoupler
      2. bus_split
      3. arbiters
  2. Major blocks
    1. Reads and writes to DRAM
      1. Set SRAM bits so that a selected area causes hits (no disk activity)
      2. Reads and writes to make sure that bits change to the correct values.
    2. Reads and writes to disk
      1. Set SRAM bits so that every access causes a miss
      2. Reads and writes to see that the data gets back correctly
      3. Initialize buffer memory to 0x5A7AB00##0000### where ## is the buffer number and ### is the word number
      4. Initialize hard drive contents to "Page 0x000034 0x0040" for page 52 word 64

Problems and Solutions

Freeze after seven writes *SENT TO XDI*

After seven movnti instructions, the system freezes. It freezes waiting for the disk controller to return the SCR_ACT value. The number seven may not have anything real to do with it, but since it stops on that number consistently, I'm naming the bug after it. It turns out that the number of writes which reach DiskRAM is seven, but it only causes a system hang when the writes come fast from the processor. If I split a page of writes (64x64B) into two groups of 32x64B and put a sleep in between them, it doesn't hang. I can recreate the problem in the simulator. A response from the disk controller is never getting back to DiskRAM, so it stops handling writes, and it hangs. It looks like a bug in the HT controller, because the packet arrives at the FPGA, but doesn't arrive inside.

I've sent XtremeData a zipped file so that they can reproduce the error. I hope that they'll get back to me soon.

DDR Read data mismatch *SOLVED*

Under certain conditions the DDR controller would get out of sync. It would de-assert ddr_rd_data_wait every time that the counter rolled, even on the last datum. When the DDR controller was ready with the next datum, this would result in lost data.

I added logic so that ddr_rd_data_wait only gets de-asserted when it's not the last datum.

DDR Read data mismatch #2 *SOLVED*

Somewhere, someone is eating an end_of_packet marker. Then the DDR controller gets more data then it asked for.

In the FILTER architecture of diskram_write_buffer, I was just checking eop (end of packet) to move on. I needed to check eop and ena to make sure it was a valid cycle. I had committed this error in other places as well, so I went through and checked everywhere I was deciding the next state based on sop or eop, and made sure that I was also using ena or val (depending on the bus type).

Hang in simulator (would also happen in hardware) *FIXED*

If the ht_biased_arbiter(RTL) has in1_req asserted while in2 is in the middle of the packet, in2 never gets dav asserted again. I 'fixed' it by removing that logic, but a better fix would be to make a bus_decoupler that could handle it so that there is not starvation of in1.

I fixed it so that it respects packets, and never switches in the middle.

Page Allocation on a write miss *FALSE BUG*

On a write miss, there is no read from disk.

It appears that way because in ddr_ram_ctrl a MISS becomes a HIT on the next cycle. It has already passed the disk command on, then passes on the write command.

Opteron doesn't cache writes *WORK AROUND*

In writeback mode, the Opteron is still writing 8-bytes at a time.

Workaround: I can use the movnti (Move Non-temporal) assembly instruction to get it to write larger chunks, but it uses write buffers instead of caches, so it may have problems. It lets me test the system now, though, while I figure out how to get it to cache the space.

FPGA doesn't respect ACT bits *SOLVED*

Sometimes the FPGA reads the buffer before the drive says that it is done.

The sata_read_ctrl module was getting out of sync with the ht_rxif module. The read_ctrl module was sending the same data twice, because the write_last_data_state was not asserting the ht_read signal.

Race condition with Disk Ready check

The sata_ctrl sends the read command to the disk, then to sata_read_ctrl. If sata_read_ctrl reads the ACT register before sata_ctrl sets it, it will believe the data has already been transferred.

I haven't seen the effects of this, but I'm not sure how to solve it. I could wait, but how long is long enough? I could read it active before reading it inactive, but that might lead to deadlock. I could always read it twice, figuring that the second time would be the correct status...

FPGA uses wrong command FIS *SOLVED*

The FPGA functioned correctly until the tag >16, then it started using the command address of tag-16.

It was a truncation problem in sata_ctrl. I was only using 4 bits for the tag, should have been 5. Because it gets added to 5, I actually need 6 bits for 32 tags + 5 offset. Because each command is 0x100 B, that means that the address needs to be 0x4000 B aligned, because I use 0x2500 B of space.

System hangs during evictions

The system hangs during writes that cause evictions

  1. Wrong type of bus_decoupler was losing a datum in the ddr_dma. When dav was asserted for two cycles and there was only a single-cycle packet, if there was a second packet the first word was lost. CHANGED to USE_DAV - ADDED assert statements for the next time

System hangs during read responses

The system hangs during read responses

  1. Bus_decoupler (RTL) was sampling for sop (start of packet) on fifo_rd instead of fifo_rd_d1, thus missing the sop when there was only one enable cycle.

Unlock Memory fails *WORK AROUND*

Sometimes unlocking the memory allocated for the port cmd area for the SATA disk fails. I suspect a corruption error, but I don't know exactly how to look for it.

More symptoms: looking at dmesg, it looks seems that the locking code gives handles with decreasing values starting from 0x3ff. After I lock the memory, I unlock the ones that weren't physically contiguous. On the next allocation, it starts again at 0x3ff and doesn't seem to care if it has already used that one. The problem is manifest when both allocations use the same handle.

Work around: use fixed memory from 0xf8000000-0xf8800000 instead of trying to allocate my own contiguous buffer zone. The up side is that it is less complexity. The down side is that there is no visibility into that area from user-space. I guess I could map it in in the device driver, but that adds more complexity.