Google’s TurboQuant Sends Memory Stocks Into a Global Selloff
Google’s TurboQuant Sends Memory Stocks Into a Global Selloff
- 60% of MD5 Password Hashes Can Be Cracked in Under an Hour with a Single GPU
- Dirty Frag: Root Access on Every Major Linux Distribution — No Patch, No Warning
- Ubuntu 26.04 LTS (Resolute Raccoon): The Most Ambitious Ubuntu LTS in a Decade
- Proton Mail: Data Transferred to FBI Again!
- How Close Are Quantum Computers to Breaking RSA-2048?
- How to Prevent Ransomware Infection Risks?
- What is the best alternative to Microsoft Office?
Google’s TurboQuant Sends Memory Stocks Into a Global Selloff
A single research paper — compressing AI memory by a factor of six — wiped billions from chip giants on two continents. But Wall Street is urging investors to buy the dip.
On Tuesday, March 24, Google Research quietly published a blog post about a new compression algorithm. By Wednesday morning, it had knocked billions of dollars off the market capitalisations of memory chip makers across three continents. The algorithm is called TurboQuant, and it addresses one of the most expensive and persistent bottlenecks in artificial intelligence infrastructure: the Key-Value (KV) cache.
The KV cache is the AI system’s working memory — a high-speed data store that holds context from prior tokens so a model does not have to recompute everything from scratch with every new word it generates. As models handle longer documents, conversations, and multi-modal inputs, this cache grows rapidly, consuming GPU memory that could otherwise be used to serve more users or run more powerful models. TurboQuant, according to Google, compresses that cache to just 3 bits per value — down from the standard 16 — reducing its memory footprint by at least six times without any measurable loss in accuracy.
The Market Reaction: A Two-Day Global Selloff
The immediate market response was swift and, in the view of several analysts, disproportionate. During Wednesday’s U.S. trading session, memory and storage stocks fell sharply — even as the broader Nasdaq 100 advanced. The declines extended into Thursday, rippling across Asian markets.
The declines dragged South Korea’s benchmark KOSPI index down by as much as 3% on Thursday, with SK Hynix and Samsung together among its largest weights. Japanese flash memory maker Kioxia Holdings fell by a similar margin in Tokyo. It was a rare moment of synchronised pressure across the global memory supply chain — caused not by an earnings miss or a supply disruption, but by a mathematics paper.
“This is Google’s DeepSeek moment.”
— Matthew Prince, CEO, CloudflareWhat TurboQuant Actually Does
The algorithm is the culmination of a multi-year research effort at Google. It builds on two earlier papers from the same team: QJL (Quantized Johnson-Lindenstrauss Transform), published at AAAI 2025, and PolarQuant, which will appear at AISTATS 2026 in Tangier, Morocco. TurboQuant itself is scheduled for presentation at ICLR 2026 in Rio de Janeiro, Brazil in April. The paper was authored by Amir Zandieh, a research scientist at Google, and Vahab Mirrokni, a vice president and Google Fellow.
How TurboQuant Works: A Two-Stage Pipeline
- PolarQuant (Stage 1): Instead of storing data vectors in standard Cartesian coordinates (X, Y, Z), it converts them to polar coordinates — separating each vector into a magnitude and a set of angles. Google’s team found that these angular distributions are highly concentrated and predictable, eliminating the need to store the normalisation constants that traditional quantisation methods require. Most of the bit budget is spent capturing the primary signal.
- QJL (Stage 2): The Johnson-Lindenstrauss Transform then compresses the small residual error from Stage 1 down to a single sign bit (+1 or −1) per dimension. This step requires no additional memory to store, completing the compression at a total of 3 bits per value.
- The result: No training or fine-tuning required. On NVIDIA H100 GPUs, 4-bit TurboQuant computes attention scores up to 8× faster than the unquantised 32-bit baseline. On long-context benchmarks including Needle in a Haystack, LongBench, ZeroSCROLLS, RULER, and L-Eval, the algorithm achieved perfect or near-perfect scores across open-source models including Llama-3.1-8B and Mistral-7B.
The key innovation is the elimination of “quantisation overhead.” Traditional compression methods reduce the size of data but must store additional constants — normalisation values needed to decompress accurately. These constants typically add one to two extra bits per number, partially undermining the headline compression ratio. TurboQuant avoids this entirely through its two-stage architecture, achieving its 3-bit target with no such overhead.
Beyond Language Models: Vector Search
Google emphasises that TurboQuant has a direct commercial application beyond language model inference. The algorithm improves vector search — the technology that powers semantic similarity lookups across billions of items. Modern search engines, recommendation systems, and advertising targeting increasingly rely on comparing the meanings of billions of high-dimensional vectors rather than just matching keywords.
Tested against existing state-of-the-art methods such as RabbiQ and Product Quantization on the GloVe benchmark dataset, TurboQuant achieved superior recall ratios without requiring the large codebooks or dataset-specific tuning that competing approaches demand. This has direct relevance to Google Search, YouTube recommendations, and Google’s advertising infrastructure — which is to say, it underpins Google’s primary revenue streams.
Wall Street’s Verdict: An Overreaction?
Several prominent analysts quickly pushed back on the severity of the market reaction, arguing that investors are misreading the technology’s scope.
“As context windows get bigger and bigger, the data storage in KV cache explodes higher, causing the need for more memory. TurboQuant is directly attacking the cost curve here. Bullish for the cost curve, again IF this gets adopted broadly.”
“Current inference models have long adopted 4-bit quantised data. Google’s claimed 8× performance boost is relative to older 32-bit models. These compression technologies are workarounds for compute bottlenecks and will not undermine resilient memory and flash demand over the next three to five years.”
“It’s like saying Aramco should crash because Toyota came out with a next-generation hybrid engine.”
Morgan Stanley analyst Shawn Kim invoked the Jevons Paradox — the economic principle that efficiency improvements often increase overall resource consumption, not decrease it. He argued that lower cost per AI inference token could unlock vast new tiers of AI deployment that were previously too expensive, ultimately expanding total memory demand rather than contracting it.
“A technology that reduces memory requirements by six times does not reduce spending by six times, because memory is only one component of a data centre.”
— The Next Web analysisImportant Caveats: What TurboQuant Does Not Do
Several critical limitations temper the more alarming market interpretations:
The Broader Context: A Pattern Investors Recognise
The episode echoes January 2025’s DeepSeek shock, when a Chinese AI lab released a highly efficient open-source model, briefly sending Nvidia and other AI hardware names sharply lower before the market recalibrated. In that case, as in this one, the initial sell-off reflected a genuine insight — that AI efficiency improvements are accelerating — but overstated the near-term demand destruction.
AI infrastructure spending remains at extraordinary levels. Meta alone committed up to $27 billion in a recent deal with Nebius for dedicated compute capacity. Google, Microsoft, and Amazon collectively plan hundreds of billions in data centre capital expenditure through 2026. A compression algorithm that reduces KV cache memory does not reduce the physical footprint of training clusters, networking, or storage for model weights — the components that dominate capex.
What TurboQuant undeniably represents is a signal: the next chapter of AI efficiency will be won as much through mathematical elegance as through brute-force hardware scaling. For memory chipmakers, the implication is not necessarily less demand, but a shift in the composition of that demand — away from raw capacity toward high-bandwidth, high-performance products that can support faster, more intelligent inference at scale.
Paper address:
https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/
