MicroGPT Benchmarks: Proven Ways to Boost Inference Speed

If you think your high-level machine learning framework is always the fastest way to run a model, you’re likely leaving massive performance on the table. When we look at running a tiny, 4,192-parameter transformer—like Karpathy’s microGPT—the results are counter-intuitive. A single M4 Max MacBook Pro P-core running hand-tuned C code doesn't just beat an FPGA; it crushes it by 71x.

Most developers assume that using a framework like MLX or NumPy is the "optimized" path. But for models this small, the bottleneck isn't the arithmetic; it’s the dispatch overhead. When you call a function in NumPy or a kernel in MLX, the system spends more time on boundary checks, dtype dispatching, and kernel launches than it does on the actual multiply-accumulate operations.

Here is the reality of the performance landscape for this specific workload:

Pure Python: ~7,430 tokens/sec.
MLX (GPU): ~3,337 tokens/sec.
NumPy (fp32): ~40,244 tokens/sec.
TALOS-V2 (FPGA): 53,000 tokens/sec.
C (NEON intrinsics): ~3,756,165 tokens/sec.

Why does the FPGA beat NumPy but lose to C? The FPGA is purpose-built for the specific data flow of the model, avoiding the "framework tax." However, the M4 Max is so fast that once you strip away the abstraction layers and write raw C with NEON intrinsics, you’re essentially running at the speed of the silicon itself. The model fits entirely within the L1 cache, meaning the CPU never has to wait for memory. It’s just pure instruction throughput.

Here’s where most people get tripped up: they assume that because a GPU is "faster" for large models, it must be faster for everything. That’s a dangerous assumption. MLX-on-GPU is actually the slowest implementation here because the Metal kernel launch overhead is measured in tens of microseconds. When your entire forward pass only requires 4,000 multiply-accumulates, a 10-microsecond launch overhead is an eternity. You are essentially paying a massive tax to move data to a processor that doesn't have enough work to justify the trip.

If you want to see this in action, you can clone the benchmark repository and run it on your own Apple Silicon machine. You’ll see that the C implementation uses roughly 1.4% of a single core to match the FPGA’s throughput.

Does this mean FPGAs are obsolete? Not at all. The FPGA still wins on deterministic latency and power draw in embedded form factors. But for general-purpose computing, the "overhead" of modern software stacks is the real enemy. If you are building low-latency inference engines, stop relying on heavy frameworks for tiny models. Write the kernel yourself.

Are you over-engineering your inference pipeline by using frameworks where simple C would suffice? Try this today and share what you find in the comments. If you're interested in how these optimizations scale, read our breakdown of efficient transformer implementation strategies next.

Written by Admin