Etched is Making the Biggest Bet in AI
In 2022, we made a bet that transformers would take over the world.
We’ve spent the past two years building Sohu, the world’s first specialized chip (ASIC) for transformers (the “T” in ChatGPT).
By burning the transformer architecture into our chip, we can’t run most traditional AI models: the DLRMs powering Instagram ads, protein-folding models like AlphaFold 2, or older image models like Stable Diffusion 2. We can’t run CNNs, RNNs, or LSTMs either.
But for transformers, Sohu is the fastest chip of all time. It’s not even close.
With over 500,000 tokens per second in Llama 70B throughput, Sohu lets you build products impossible on GPUs. Sohu is an order of magnitude faster and cheaper than even NVIDIA’s next-generation Blackwell (B200) GPUs.
Today, every state-of-the-art AI model is a transformer: ChatGPT, Sora, Gemini, Stable Diffusion 3, and more. If transformers are replaced by SSMs, RWKV, or any new architecture, our chips will be useless.
But if we’re right, Sohu will change the world. Here’s why we took this bet.
Scale is all you need for superintelligence
In five years, AI models became smarter than humans on most standardized tests.
How? Because Meta used 50,000x more compute to train Llama 400B (2024 SoTA, smarter than most humans) than OpenAI used on GPT-2 (2019 SoTA).
By feeding AI models more compute and better data, they get smarter. Scale is the only trick that’s continued to work for decades, and every large AI company (Google, OpenAI / Microsoft, Anthropic / Amazon, etc.) is spending more than $100 billion over the next few years to keep scaling. We are living in the largest infrastructure buildout of all time.
Scaling the next 1,000x will be very expensive. The next-generation data centers will cost more than the GDP of a small nation. At the current pace, our hardware, our power grids, and pocketbooks can’t keep up.
GPUs are hitting a wall
Santa Clara’s dirty little secret is that GPUs haven’t gotten better, they’ve gotten bigger. The compute (TFLOPS) per area of the chip has been nearly flat for four years.
With Moore’s law slowing, the only way to improve performance is to specialize.
Specialized chips are inevitable
Before transformers took over the world, many companies built flexible AI chips and GPUs to handle the hundreds of various architectures. To name a few:
- NVIDIA’s GPUs
- Google’s TPUs
- Amazon’s Trainium
- AMD’s accelerators
- Graphcore’s IPUs
- SambaNova SN Series
No one has ever built an algorithm-specific AI chip (ASIC). Chip projects cost $50-100M and take years to bring to production. When we started, there was no market.
Suddenly, that’s changed:
- Unprecedented Demand: Before ChatGPT, the market for transformer inference was ~$50M, and now it’s billions. All big tech companies use transformer models (OpenAI, Google, Amazon, Microsoft, Facebook, etc.).
- Convergence on Architecture: AI models used to change a lot. But since GPT-2, state-of-the-art model architectures have remained nearly identical! OpenAI’s GPT-family, Google’s PaLM, Facebook’s LLaMa, and even Tesla FSD are all transformers.
When models cost $1B+ to train and $10B+ for inference, specialized chips are inevitable. At this scale, a 1% improvement would justify a $50-100M custom chip project.
In reality, ASICs are orders of magnitude faster than GPUs. When bitcoin miners hit the market in 2014, it became cheaper to throw out GPUs than to use them to mine bitcoin.
With billions of dollars on the line, the same will happen for AI.
Transformers are shockingly similar: tweaks like SwiGLU activations and RoPE encodings are used everywhere: LLMs, embedding models, image inpainting, and video generation.
While GPT-2 to Llama-3 are state-of-the-art (SoTA) models five years apart, they have nearly identical architectures. The only major difference is scale.
Transformers have a huge moat
We believe in the hardware lottery: the models that win are the ones that can run the fastest and cheapest on hardware. Transformers are powerful, useful, and profitable enough to dominate every major AI compute market before alternatives are ready:
- Transformers power every large AI product: from agents to search to chat. AI labs have spent hundreds of millions of dollars in R&D to optimize GPUs for transformers. The current and next-generation state-of-the-art models are transformers.
- As models scale from $1B to $10B to $100B training runs in the next few years, the risk of testing new architectures skyrockets. Instead of re-testing scaling laws and performance, time is better spent building features on top of transformers, such as multi-token prediction.
- Today’s software stack is optimized for transformers. Every popular library (TensorRT-LLM, vLLM, Huggingface TGI, etc.) has special kernels for running transformer models on GPUs. Many features built on top of transformers aren’t easily supported in alternatives (ex. speculative decoding, tree search).
- Tomorrow’s hardware stack will be optimized for transformers. NVIDIA’s GB200s have special support for transformers (TransformerEngine). ASICs like Sohu entering the market mark the point of no return. Transformer killers will need to run on GPUs faster than transformers run on Sohu. If that happens, we’ll build an ASIC for that too!
Meet Sohu
Sohu is the world’s first transformer ASIC. One 8xSohu server replaces 160 H100 GPUs.
By specializing, Sohu gets unprecedented performance. One 8xSohu server can serve over 500,000 Llama 70B tokens per second.
Sohu only supports transformer inference, whether Llama and Stable Diffusion 3. Sohu supports all of today’s models (Google, Meta, Microsoft, OpenAI, Anthropic, etc.) and can handle tweaks to future models.
Because Sohu can only run one algorithm, the vast majority of control flow logic can be removed, allowing it to have many more math blocks. As a result, Sohu boasts over 90% FLOPS utilization (compared to ~30% on a GPU7 with TRT-LLM).
How can we fit so much more FLOPS on our chip than GPUs?
The NVIDIA H200 has 989 TFLOPS of FP16/BF16 compute without sparsity9. This is state-of-the-art (more than even Google’s new Trillium chip), and the GB200 launching in 2025 has only 25% more compute (1,250 TFLOPS per die10).
Since the vast majority of a GPU’s area is devoted to programmability, specializing on transformers lets you fit far more compute. You can prove this to yourself from first principles:
It takes 10,000 transistors to build a single FP16/BF16/FP8 multiply-add circuit, the building block for all matrix math. The H100 SXM has 528 tensor cores, and each has 4 × 8 × 16
FMA circuits11. Multiplying tells us the H100 has 2.7 billion transistors dedicated to tensor cores.
But an H100 has 80 billion transistors12! This means only 3.3% of the transistors on an H100 GPU are used for matrix multiplication!
This is a deliberate design decision by NVIDIA and other flexible AI chips. If you want to support all kinds of models (CNNs, LSTMs, SSMs, and others), you can’t do much better than this.
By only running transformers, we can fit way more FLOPS on our chip without resorting to lower precisions or sparsity.
Isn’t inference bottlenecked on memory bandwidth, not compute?
Actually, for modern models like Llama-3, no!
Let’s use NVIDIA and AMD’s standard benchmark13: 2048 input tokens and 128 output tokens. Most AI products have much longer prompts than completions (even a new Claude chat has 1,000+ tokens in the system prompt).
On GPUs and on Sohu, inference is run in batches. Each batch loads all of the model weights once, and re-uses them across every token in the batch. Generally, LLM inputs are compute-bound, and LLM outputs are memory-bound. When we combine input and output tokens with continuous batching, the workload becomes very compute bound14.
Below is an example of continuous batching for an LLM. Here we are running sequences with four input tokens and four output tokens; each color is a different sequence.
We can scale up the same trick to run Llama-3-70B with 2048 input tokens and 128 output tokens. Have each batch consist of 2048 input tokens for one sequence, and 127 output tokens for 127 different sequences.
If we do this, each batch will require about (2048 + 127) × 70B params × 2 bytes per param = 304 TFLOPs
, while only needing to load 70B params × 2 bytes per param = 140 GB
of model weights and about 127 × 64 × 8 × 128 × (2048 + 127) × 2 × 2 = 72GB
of KV cache weights. That’s far more compute than memory bandwidth: an H200 would need 6.8 PFLOPS of compute in order to max out its memory bandwidth. And that’s at 100% utilization - if utilization was 30%, you’d need 3x more.
Since Sohu has so much compute with very high utilization, we can run enormous throughputs without bottlenecking on memory bandwidth.
How does software work?
On GPUs and TPUs, software is a nightmare. Handling arbitrary CUDA and PyTorch code requires an incredibly complicated compiler. Third-party AI chips (AMD, Intel, AWS, etc.) have together spent billions on software to little avail.
But since Sohu only runs transformers, we only need to write software for transformers!
Most companies running open-source or internal models use a transformer-specific inference library like TensorRT-LLM, vLLM, or HuggingFace’s TGI. These frameworks are very rigid - while you can tweak model hyperparameters, changing the underlying model code is not really supported. But this is fine - since all transformer models are so similar (even text/image/video ones), tweaking the hyperparameters is all you really need.
While this supports 95% of AI companies, the largest AI labs go custom. They have teams of engineers hand-tuning GPU kernels to squeeze out slightly more utilization, reverse engineering things like which registers have the lowest latency to each tensor core.
With Etched, you don’t need to reverse engineer anymore - our software, from drivers to kernels to the serving stack, will be open source. If you want to implement a custom transformer layer, your kernel wizards are free to do so.
Etched will be first
If this bet seems crazy now, imagine making it in 2022. When we started, ChatGPT didn’t exist! Image and video generation models were U-Nets, self-driving cars were powered by CNNs, and the transformer architecture was far from ubiquitous.
Luckily, the landscape has shifted in our favor. The top models in every domain—from language to vision—are now transformers. This convergence has not just validated the bet, but is also making Sohu the most important hardware project of the decade.
We’re on track for one of the fastest chip launches in history:
- Top AI researchers and hardware engineers left every major AI chip project to join us
- We’ve partnered directly with TSMC on their 4nm process. We’ve secured enough HBM and server supply from top vendors to quickly ramp our first year of production
- Our early customers have reserved tens of millions of dollars of our hardware
If we’re right, Sohu will change the world
What would happen if AI models get 20x faster and cheaper overnight?
Today, Gemini takes over 60 seconds to answer a question about a video16. Coding agents cost more than software engineers and take hours to finish tasks. Video models generate one frame per second17, and even OpenAI ran out of GPU capacity when ChatGPT signups hit 10M users, just 0.15% of the world.
We aren’t on track to fix it - even if we do keep making GPUs bigger, at a rate of 2.5x every two years, it will take a decade to make video generation real-time.
But with Sohu, it’ll be instant. What happens when real-time video, calls, agents, and search finally just work?
Soon, you can find out. Please apply for early access to the Sohu HDK here.
And if you're excited about solving the compute crunch, we'd love to meet you. This is the most important problem of our time. Please apply for one of our open roles here.
- https://www.youtube.com/watch?v=SAu2jgAF1i8&t=1125s
- https://www.youtube.com/watch?v=pGrJJnpjAFg&t=204s
- https://arxiv.org/abs/2303.08774
- https://www.youtube.com/watch?v=9TU0XjJqpOg
- https://www.youtube.com/watch?v=Y2N-uhZFPIM
- https://www.youtube.com/watch?v=qTogNUV3CAI
- https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/perf-overview.md
- https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/perf-overview.md
- https://resources.nvidia.com/en-us-tensor-core/nvidia-tensor-core-gpu-datasheet
- https://developer.nvidia.com/blog/nvidia-gb200-nvl72-delivers-trillion-parameter-llm-training-and-real-time-inference/
- https://resources.nvidia.com/en-us-tensor-core
- https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/
- https://developer.nvidia.com/blog/achieving-top-inference-performance-with-the-nvidia-h100-tensor-core-gpu-and-nvidia-tensorrt-llm/
- https://www.anyscale.com/blog/continuous-batching-llm-inference
- https://en.wikipedia.org/wiki/Poisson_distribution
- https://www.youtube.com/watch?v=wa0MT8OwHuk
- https://lumalabs.ai/dream-machine