Letter From Chris Wanner
From PEL Wiki
Kelly,
You make an interesting comment/observation about whether there is a better way to architect a computer system and to see if there are some global optimizations to make that hadn't been or couldn't be made as a result of the incrementalizing over time of a existing/old architecture. A major "starting over" occurred earlier in the decade when HP/Intel came out with Itanium and started over in a number or areas related to the inside of the processor (CPU microarchitecture, compiler, firmware, etc) but did little to change the system architecture. The AMD Opteron processor (like its architectural predecessor the Alpha EV7) changed several aspects of the system architecture but did not change the microarchitecture. Research areas within a number of companies are resulting in many-core solutions like Sun's Niagara chip, which represent change in its own right, but not a radical change.
One of the challenges with making architectural changes - whether at the CPU level or at the system level - is whether that change fundamentally represents an improvement that not only is a measurable improvement in the short term, but is a sustainable improvement over the long term. Additionally customers have certain operational paradigms that can't radically change without a radical justification. Its rare that we have the opportunity to make a significant change when there are so many out there looking to make incremental changes and are successfully making performance improvements a little bit at a time.
But lets focus on your question. As you are probably aware, the main reason that disk drives are treated as separately as they are from main memory is the significant access latency associated with disk drives versus that of main memory to the tune of several orders of magnitude. Putting drives and memory into the same operational domain has to be balanced against the big difference that these two data storage elements have in the time domain. CPUs are designed to be somewhat tolerant of latency but major changes in latency (100ns vs 1ms) cannot be handled by the processor itself. Intelligence is required to keep these elements separated so that the processor is not inadvertently executing code that it think is a mere 100s of CPU clocks away and get stalled waiting for data that is, literally, millions of CPU clocks away. What happens today is that when that does occur some intelligence (the OS) minimizes the impact of this action by swapping that execution thread out while the slow data is being retrieved (paged in), and letting some other thread (that hopefully will be using local/fast memory) to execute in its stead. If the OS cannot differentiate far memory from near memory then performance is impacted very quickly and very significantly.
Your proposal requires a memory subsystem that is more tolerant to latency differences than a DRAM interface is. The module interfaces on the systems previously mentioned (DL760G2/DL740 and DL580G3) are really no more latency tolerant than that of the DRAM interface itself. I am most familiar with the DL760G2 and DL740 systems since my team of design engineers architected the system and designed the chipset/memory controller used in those products. The implementation was focused on providing the simplest solution to accessing memory within the established parameters of the target memory subsystem (PC100). Read and write ordering and buffering were optimized for a latencies that did not exceed the page miss latencies of the DRAM and the chipset assumed a relatively low number of clocks between a memory request being sent to the memory module and when valid data would be returned. To go beyond those number of clocks would simply break the protocol of the chipset and the system would lock up. The interface used in the DL580G3 is similar in that respect. Neither of these interfaces had any requirement to be any more tolerant of latency than the memory subsystem that it was connected to. Thus for purposes of what you are trying to do these systems do not provide the interface flexibility that you are looking for.
There is going to be few mechanisms for accomplishing the instrumentation that you wish without affecting performance. Possibly the key for you would be which mechanism is the least intrusive and where impacts to performance is minimized. Mapping memory to the PCI bus as you describe is one of the very few options you have, but as you know memory latencies even for hits will be much longer than if local memory was being utilized. If this gives you the visibility that you require then this may be your best bet. Alternatively, there are monitoring tools that can extract memory access information from the CPU, but as you know this would only be a passive collection of information vs. a dynamic use of the information as PCI memory could be. A third alternative, though extremely impractical but very much more effective, would be to design a memory controller right on the CPU interface itself. This cannot really be done for Intel base systems, but AMD Opteron systems allow themselves to be used in this manner.
The Opteron Hypertransport interface is a reasonably forgiving, latency tolerant interface. The DL585 has 4 CPU/memory cards that hold a single Opteron processors along with its local memory. Each card's Opteron processors connect to other cards' Opteron processors through Hypertransport. Theoretically, one could design a caching agent card that replaces one of these CPU/memory cards and that responds as if it were an Opteron memory controller. All or most of the system memory could be mapped to this caching agent so that all/most memory requests are serviced by this caching agent. This gives you what you would otherwise accomplish on the PCI bus without the latency involved with mapping memory to an I/O slot. You'd have a memory controller right up there with the other processors !
Though before you get your hopes up, there are very large challenges associated with this. The largest challenge being getting a hold of the coherent Hypertransport specification from AMD that you need to design an Opteron-like caching agent. The hypertransport interface that AMD uses between processors is a superset of the industry standard Hypertransport. Only AMD and a select few partners have the rights (and the spec) to design a coherent HT based agent, and AMD does not make that spec easily available.
Regards, Chris Wanner Hewlett-Packard Company
From: Kelly Flanagan Sent: Tuesday, February 14, 2006 10:46 AM To: Wanner, Chris Cc: Gaither, Blaine D Subject: BYU project description Chris: I hope emailing you directly is appropriate. Having read the emails below I thought it would be a good idea to share with you what we are trying to investigate and do and perhaps this will generate a potential solution or path we have not thought about. Over the past 15 years or so I have been engaged in research dealing with the collection and analysis of address, instruction, and disk I/O traces from various platforms. While doing this work it has always intrigued me how systems are created from an initial general paradigm and then typically incremental changes are made to enhance performance over a lengthy period of time. While these incremental changes indeed increase performance I believe it is a worthwhile endeavor to occasionally rethink the entire system to see if some global optimizations can be found. I have recently been looking with particular interest in the memory and I/O hierarchies. It struck me and my graduate students as odd how these two systems interact in a linux and likely other environments. For example, when main memory fills we remove old stuff and place it on disk (swap). On the other hand, to increase the performance of disk we cache recently accessed items in memory. In a recent investigation we also found a swap cache which caches recently accessed things from the swap file or partition in memory. We started to ask why the disk is in a separate space from memory and why it is treated so differently. We came up with the following thought and would like to experiment with it and see what improvements can be made. Most CPUs I have dealt with have very latency tolerant buses. They can make a memory request and then do other things while waiting for the memory to respond. With this flexibility we would like to create a main memory module that consists of a reasonably large chunk of DRAM, a memory controller that acts much like a cache controller, and a hard disk drive. The controller would receive a request from the CPU for what the CPU thinks is a physical address. The controller determines whether the associated data is in the DRAM or whether it is on the HDD, a hit or a miss. If a hit the data is immediately returned. If it is a miss the data is streamed from the HDD into the DRAM and then returned. Clearly the controller can do remapping of the request address by having a lookup table that converts CPU address requests into the local location indicators. This flexibility can be useful in enhancing performance. You might ask immediately how this is different from swapping. Well in a system that swaps we have nearly no information about the memory usage patterns or frequency. We know about miss patterns because the OS gets involved in moving data from disk back into memory and statistics gathering does not impact performance significantly, but we do not have this luxury in dealing with memory hits. Studies have shown that adding instrumentation to an OS to collect some hit information degrades performance. However, using this information to improve memory layout and determine the best pages to swap, etc. can increase performance enough to overcome the penalty of collecting the necessary data. Our technique would allow us to collect nearly all hit information with little performance degradation and therefore should have better overall performance. In addition, this technique provide for very large and inexpensive main memories and if inclusion is maintained, persistent storage. These are both useful characteristics. This structure is simple to visualize, but difficult to implement in current systems. While CPUs are latency tolerant the chipsets where memory controllers are implemented are not. They are tightly coupled to DDR or DDR2 timings and cannot be easily modified to meet our needs. We are currently creating a device using a PCI card with FPGAs, DRAM memory slots, and a SATA controller onboard. The intent is to make the memory over the PCI bus the only system memory and use the FPGA as our memory controller and the SATA attached disk as our larger main memory. Of course the latency and bandwidth of the PCI interface is slow compared to the original memory interface, but gives us a baseline to work with that is much faster and more realistic than simulation. We were intrigued with and hopeful about the systems mentioned in the previous emails because with the necessary specs we would simply create a new memory module that contained the components previously described. I am sorry for the length of this email, but wanted to try an be reasonably thorough. Please feel free to respond with questions, comments, criticisms, whatever. We are interested in any thoughts or suggestions you might have. Thanks Kelly
On 2/9/06 5:07 PM, Blaine D. Gaither
Kelly,
I have asked about your request.. Chris Wanner is A technical leader in our PC server division. As you can see from the attached discussion it might be useful to engage him in a more detailed discussion on what you are trying to do.
Please let me know if I can help.
Sincerely,
Blaine
At 2/9/2006 12:20 PM Thursday, Wanner, Chris wrote:
Blaine,
The only ISS platforms that have hot swap memory modules on them are two products that are being EOL'ed (DL760G2, DL740) and our current DL580G3 platform. The hot plug memory interface on the first two products are HP designed, but since those platforms are at the end of their life I would not suggest that anyone spend time designing to them. For the DL580G3 platforms the interface from the hot swap module to the main system is a propietary Intel interconnect technology that Intel has not provided the specs for to HP. Thus, we could provide a platform but we would not be able to provide any information that BYU would need to design a new module for.
If I understood more of what BYU is attempting to do I might be able to suggest alternatives.
Chris
From: Gaither, Blaine D Sent: Thursday, February 09, 2006 11:32 AM To: \McClellan, Scott (BCS Hardware CTO); Wanner, Chris; Tom Christian, HP Subject: Request from BYU
Chris,
A researcher that we have worked with at BYU would like some help and I am wondering if ISS could help. His research involves backing memory directly with disk storage. He would like to use one of the ISS systems that had "hot swap" memory modules, and he and his students will build a cache and disk drive to replace the memory module. The researcher is Kelly Flanagan who, in addition to being a Professor of computer science, is VP of IT for BYU. He has the money to buy the system, what he needs to know is the interface specs to the hot swap memory module. BYU has done some pretty good cache research before. Both Amdahl and HP have given them grants in the past. Dr. Flanagan can be reached at 801-378-6474 Could you please advise me on who we might contact to explore this request. Blaine Gaither
Blaine Gaither, Performance Architect HP +1.970.898.3858 Cell +1.970.222..3612
Categories: Pel | Myles
