Nvidia and IBM: GPUs will connect to SSDs directly to boost performance
Nvidia and IBM: GPUs will connect to SSDs directly to boost performance.
Bypassing the CPU: Nvidia and IBM are pushing GPUs to connect directly to SSDs to boost performance.
Working with several universities, Nvidia and IBM have created a new architecture aimed at accelerating applications on GPUs, providing fast “fine-grained access” to massive data stores.
The so-called “Big Accelerator Memory” is designed to expand GPU memory capacity, effectively increase storage access bandwidth, and provide a high-level abstraction layer for GPU threads for easy on-demand, fine-grained access to massive data in the expanded memory hierarchy structure.
An example of a traditional CPU-centric model
Clearly, this technology will benefit areas such as artificial intelligence, analytics, and machine learning training even more. As a heavyweight in the BaM team, NVIDIA will dedicate its extensive resources to innovative projects.
Such as allowing the NVIDIA GPU to directly fetch data without relying on the CPU to perform virtual address translation, page-based on-demand data loading, and other extensive data management tasks for memory and external memory.
For the average user, we only need to see two major advantages of BaM. One is the GPU cache based on software management. The information transmission and distribution between data storage and graphics card will be managed by threads on the GPU core.
Using RDMA, the PCI Express interface, and custom Linux kernel drivers, BaM allows the GPU to directly read and write SSD data.
BaM model example
Second, by opening up the data communication request of the NVMe SSD, BaM only prepares the GPU thread to execute the driver command by reference only when the specific data is not in the cache area managed by the software.
Based on this, algorithms running heavy workloads on graphics processors will be able to optimize access routines for specific data, enabling efficient access to important information.
Obviously, a CPU-centric strategy results in excessive CPU-GPU synchronization overhead (and I/O traffic amplification), which drags down fine-grained data-dependent access patterns—such as graph and data analytics, recommender systems, and Storage Network Bandwidth Efficiency for Emerging Applications such as Graph Neural Networks.
To this end, the researchers provide a user-level library based on highly concurrent NVMe commit/completion queues in the GPU memory of the BaM model, so that GPU threads that are not lost from software caches can be efficiently performed in a high-throughput manner. Access storage.
Logical view of BaM design
Even better, this scheme has extremely low software overhead per storage access and supports highly concurrent threads.
The related experiments carried out on the Linux prototype test platform based on BaM design + standard GPU + NVMe SSD also delivered quite gratifying results.
As a viable alternative to today’s traditional CPU-based solutions that run everything, studies have shown that storage access can work concurrently, remove synchronization constraints, and significantly improve I/O bandwidth efficiency, making application performance unmatched.
In addition, NVIDIA Chief Scientist Bill Dally, who once led Stanford University’s Department of Computer Science, pointed out: Thanks to software caching, BaM does not rely on virtual memory address translation, so it is inherently immune to serialization events such as TLB misses.
Finally, the three parties will open source the new details of the BaM design, in the hope that more companies can invest in the optimization of software and hardware and create similar designs on their own.
Interestingly, AMD Radeon solid-state graphics cards, which place flash memory alongside the GPU, use a similar functional design philosophy.