What BEST Routing Actually Saves
Empirical speedups from hybrid EML/EDL/EXL dispatch. The numbers, honestly reported.
The EML operator family has three members:
- EML:
exp(x) − ln(y) - EDL:
exp(x) / ln(y) - EXL:
exp(x) · ln(y)
Each generates all elementary functions, but they have different node counts for different primitives. BEST (Binary Expression Select & Transform) routing is the idea of dispatching each operation to whichever operator computes it cheapest.
Node count table
These are verified minimum node counts — the exact number of EML-family calls required:
| Function | Pure EML | BEST | Savings |
|---|---|---|---|
| exp(x) | 1 | 1 | — |
| ln(x) | 4 | 1 | −75% |
| x + y | 11 | 11 | — |
| x × y | 13 | 7 | −46% |
| x ÷ y | 15 | 1 | −93% |
| x² | 9 | 5 | −44% |
| sin(x) over ℂ | 108 | 1 | −99% |
The big wins are ln(x) (75% savings via EXL), division (93% savings via EDL), and complex sin (99% via CBEST). Addition has no savings — it genuinely costs 11 nodes no matter which operator you use.
Wall-clock speedups
Node count savings translate to wall-clock speedups, but not uniformly. The crossover threshold is about 20% node savings — below that, the dispatch overhead exceeds the compute savings.
| Workload | Node savings | Speedup |
|---|---|---|
| sin/cos Taylor (TinyMLP) | 74% | 2.8× |
| Polynomial x⁴+x³+x² | 54% | 2.1× |
| GELU (Transformer FFN) | 18% | 0.93× |
GELU is a negative result, honestly reported: 18% node savings falls below the crossover threshold, and BEST routing is slower on GELU-heavy workloads. If you're building a transformer FFN replacement, BEST routing doesn't help.
Performance kernels
We also have a Rust-accelerated kernel and a fused Python kernel. These compound with BEST:
| Backend | ms/step | Speedup |
|---|---|---|
| Standard Python | 8.3 | 1× |
| FusedEMLActivation | 2.3 | 3.6× |
| Fused + torch.compile | 1.9 | 4.4× |
| Rust (monogate-core) | 1.4 | 5.9× |
Benchmark: EMLLayer depth=2, 256→256, batch=1024, CPU. The Rust kernel uses PyO3 bindings.
What BEST routing is not
BEST routing is a node-count optimization within the EML framework. It doesn't make EML
competitive with purpose-built implementations of individual functions. A native
sin(x) in hardware is faster than any EML tree, even a 1-node complex CBEST.
What BEST routing does: reduces the cost of expressing arbitrary combinations of elementary functions as EML trees, when you're already committed to the EML framework for other reasons (symbolic manipulation, gradient-free search, interpretability).
Install: pip install monogate ·
Rust source: monogate-core/