2026-04-19 engineering 6 min read

What BEST Routing Actually Saves

Empirical speedups from hybrid EML/EDL/EXL dispatch. The numbers, honestly reported.

The EML operator family has three members:

EML: exp(x) − ln(y)
EDL: exp(x) / ln(y)
EXL: exp(x) · ln(y)

Each generates all elementary functions, but they have different node counts for different primitives. BEST (Binary Expression Select & Transform) routing is the idea of dispatching each operation to whichever operator computes it cheapest.

Node count table

These are verified minimum node counts — the exact number of EML-family calls required:

Function	Pure EML	BEST	Savings
exp(x)	1	1	—
ln(x)	4	1	−75%
x + y	11	11	—
x × y	13	7	−46%
x ÷ y	15	1	−93%
x²	9	5	−44%
sin(x) over ℂ	108	1	−99%

The big wins are ln(x) (75% savings via EXL), division (93% savings via EDL), and complex sin (99% via CBEST). Addition has no savings — it genuinely costs 11 nodes no matter which operator you use.

Wall-clock speedups

Node count savings translate to wall-clock speedups, but not uniformly. The crossover threshold is about 20% node savings — below that, the dispatch overhead exceeds the compute savings.

Workload	Node savings	Speedup
sin/cos Taylor (TinyMLP)	74%	2.8×
Polynomial x⁴+x³+x²	54%	2.1×
GELU (Transformer FFN)	18%	0.93×

GELU is a negative result, honestly reported: 18% node savings falls below the crossover threshold, and BEST routing is slower on GELU-heavy workloads. If you're building a transformer FFN replacement, BEST routing doesn't help.

Performance kernels

We also have a Rust-accelerated kernel and a fused Python kernel. These compound with BEST:

Backend	ms/step	Speedup
Standard Python	8.3	1×
FusedEMLActivation	2.3	3.6×
Fused + torch.compile	1.9	4.4×
Rust (monogate-core)	1.4	5.9×

Benchmark: EMLLayer depth=2, 256→256, batch=1024, CPU. The Rust kernel uses PyO3 bindings.

What BEST routing is not

BEST routing is a node-count optimization within the EML framework. It doesn't make EML competitive with purpose-built implementations of individual functions. A native sin(x) in hardware is faster than any EML tree, even a 1-node complex CBEST.

What BEST routing does: reduces the cost of expressing arbitrary combinations of elementary functions as EML trees, when you're already committed to the EML framework for other reasons (symbolic manipulation, gradient-free search, interpretability).

Install: pip install monogate · Rust source: monogate-core/