erxi or how I learned to love the fast testing suite
erxi is available on GitHub.
Table of Contents
- The Mission and the Goalpost
- Testing, testing, testing
- The Flames of Never-Ending Optimization
- So, what gives?
- Messaging in bandwidth-limited environments
- Key message
- EXI4JSON
- What now?
- Project Timeline
The Mission and the Goalpost
One or two months ago I read this wonderful blogpost about the lost art of XML, and it took hold of me (at least for a bit). I work professionally around XML, and I was always convinced that XML was the better option because of the potential for formal verification of messages, and I felt elated that someone made that case far better than I ever could. I was also drawn to the EXI Specification, a way of transforming ordinary XML documents into byte-efficient representations without losing any of the advantages that XML has. It was an idea I was never confronted with, and which made immediate sense as soon as I grasped it. One of those “of course, how obvious” moments that are harder to come by the older you get.
Some time later, I wanted to try to apply the EXI spec or XML more generally to the collaborative creativity apps I am currently developing. They did (and still do) use the Protobuf format for the exchanged messages because of its efficient byte-size. However, it turns out, the best available reference implementation is the Java implementation implemented by the “EXI man” Daniel Peintner (PBUH) called EXIficient, and I was not gonna force a user to load the JVM on his device just so I could flex with my choice of communication protocol. So, the idea arose to implement my own Rust implementation of the spec. Of course, Claude made fun of the idea: Start a whole software project just for the messages? When Protobuf was working perfectly?
But the idea had a certain appeal. First and foremost, I hoped that this was a project where close supervision of the LLM was not as necessary as in my other projects (and boy, was I wrong) because of three facts: 1) There exists a specification document laying out the formal steps to generate EXI files from XML files and vice-versa, 2) there exists a reference implementation, the code of which is freely available, and 3) there exists a full suite of interoperability tests to see if you fucked up. And besides all this, there were also obvious advantages to a Rust implementation: The ability to use it in WASM for web development without any hassle, the ability to use it for memory-capped environments like embedded devices without loading the full JVM, the inherent coolness befalling everybody that achieves a thing like that in the programming language the cool kids code in nowadays. Besides, there was a certain thrill to see if some script kiddie armed with an LLM could outdo the multi-year project of Siemens. So, yes Claude, I WILL embark on a whole software project just for some messages!
So, I did what I always do when I start a project with Claude: start my repository and let Claude implement an implementation plan that is issue-sized, so that I can just give it the consecutive issues with my magic words (hocus pocus for first implementation with planning that is verified with codex via hex hex, fidibus for internal/codex review, and simsalabim for simplification). Claude happily chugged along, and gave a list of 45 issues that were to be worked at consecutively to arrive at a correct and blazingly fast EXI implementation. Where I should have batted my first eye was when the first formal verification of interoperability with EXI happened in issue number 28, already halfway through the project. Turns out, that’s way too late because your LLM WILL fuck up before that point, and it’s much harder to fix a bug the later you find it. But I was young and innocent then, so I followed the infinite wisdom of our robot overlords.
It took some three days of mainly writing my magic words and making some decisions here and there (mainly to tell Claude to be less lazy) until we arrived at Issue 28, and all hell broke loose. Those were the first real interoperability tests with the reference spec, and they were brutal. Turns out, even if you have a well-defined spec lying there as an electron on memory cell copy of the literal truth, ready to be inspected at every convenient moment, LLMs just choose to interpret this however. Bugs abounded, and it took me a full day to fix them all, with constant LLM chugging. But okay, in the end it worked! The first steps were successful! Let’s fucking do this!
Testing, testing, testing
At this point, I want to emphasize the need for constant, true tests from the beginning. And I don’t mean telling the LLM to achieve 100% test coverage, I mean real-world, non-cheatable tests that cover what you really want to achieve. Coding with LLMs can ease the burden for the HOW, but you constantly need to remind them of the WHAT, and automatic tests, ideally with traces attached, are your easiest tool for them. And the thing is, if you start a project, you know what you want to achieve. So pour that into a suite of tests, and make them run fast. When I finally implemented the W3C test suite for EXI, I made the mistake to load the whole damn JVM for every goddamn fixture, making it run for more than 10 minutes. If it takes you 10 minutes to check for regressions, you push it to the end of the whole workflow, and by then you have made so many changes that it is very hard to pinpoint where the regressions come from. Time poured into a useful test suite WITH as much debug information as possible attached to every run is NEVER wasted if you’re coding with an LLM. Bonus points if there is already a readily available reference implementation you can also shove tracing code into…
Well, let me tell you, afterwards it was smooth sailing again! We rushed through the rest of the issues, until we were at the full interop tests with schema-informed mode again. By this point, I was sure that this phase would take a long time, maybe even two days, because of the experience I had with the first interop phase. But in the end, it took 5 days, almost longer than the whole project took before that. It didn’t help that this phase coincided with the training period for Claude 4.6, during which the performance on the other models degraded. I also noticed the LLM to be very lazy and starting to fine-tune the code for the tests, but with the help of Codex and a lot of oversight I managed to land at functional parity with EXIficient. A few commits before, I started to wonder how I could test the performance of my new baby against the corporate overlord, and I devised a benchmark, downloading Wiki dumps and generating some sensor data files, and letting some well-known compression algorithms and EXIficient loose on them, measuring wall time, peak memory usage, and throughput. The first runs (I have included the data here) showed that EXIficient did very, very well on the large Wiki dumps. So naturally, my curiosity for seeing the first runs of erxi was insatiable, as I was pretty sure that Rust code could pretty easily beat Java on performance. And in that I was - disappointed. The first runs did pretty, pretty terrible. Important life lesson learned - choice of language may influence performance, but performance is much more influenced by the sheer stupidity of your code. And never, ever be dependent on the LLM “just working”. The only thing that counts is to get a real-world testing suite in, and let the LLM run against it and come out smarter.
The “Before” (Commit 8c23728)
The initial implementation was a memory-hungry monster. For a simple 10 MiB dataset, it behaved like this:
| Metric | Value |
|---|---|
| Wall Time | 14.8s |
| Peak Heap | 3.60 GB |
| RSS | 3.64 GB |
| Allocations | 129 Million |
In short: it was slow, it was bloated, and it was embarrassing.
The Flames of Never-Ending Optimization
So, now the new goalpost was clear: Show that Rust actually could beat Java. For this, a lot of optimization was needed.
It took me a lot of time to land on my perfect setup for optimization with Claude, but the main takeaways are: Generate a small dataset for optimization (optimally 2-3 different sets with different characteristics) that you can run in under 2 minutes, and always generate LLM-readable output with the runs (SVGs of flamegraphs, for example, are ideal). This way your coding slave can generate its own suggestions for improvements and always check them. I’ll include a history of the changes and improvements as well as a summary of the main improvements here. And by “it took me a lot of time” I of course mean that my stupidity knew no bounds! My first idea was to let erxi run against EXIficient on a 30 GB (!) Wiki dump, to show its superiority. You can see the results. Small realistic testing suite, fast run, as much machine-readable information as you can per test run. Get it into your head, and be smarter than me.
The “After” (Phase 7b, Commit 5af9827)
After 11 days of never-ending optimizations, the numbers changed drastically:
| Metric | Before (8c23728) | After (5af9827) | Improvement |
|---|---|---|---|
| Wall Time (Encode) | 14.8s | 1.10s | 13.5x |
| Peak Heap | 3.60 GB | 20 KB | 184,000x |
| RSS | 3641 MB | 30 MB | 121x |
| Allocations | 129M | 1029 | 125,000x |
Key Optimization Fields
To get there, I had to rethink my whole life, and also the way in which XML is turned into EXI. Here are the fields that brought the most significant gainz (>20%):
1. Bitstream u64-Accumulator
Instead of writing bits directly to a Vec<u8> (which involved constant bounds checking and byte-level manipulation), I implemented a u64 accumulator. It buffers bits in a local register and only flushes to the vector when there are full bytes.
- Impact: ~73% speed improvement in the encoder.
2. Zero-Alloc Tier 2 Grammars
The EXI spec uses “Tier 2” grammars for schema-informed encoding. Initially, these were materialized as fresh Vec<Terminal> for every element. I tried caching them with Arc, but the overhead of atomic reference counting was actually slower. The breakthrough was moving to a zero-allocation approach: calculating offsets directly from the Tier 1 grammar without ever creating a new collection.
- Impact: 1.25x faster decoding, 42% fewer allocations.
3. QName Interning & Fast-Paths
I introduced a fast-path for ASCII strings and an internal cache for frequently used QNames (Qualified Names). This avoided repeated string allocations and expensive UTF-8 validation during the hot-path of event processing.
4. Input Streaming
By moving from “load-everything-to-string” to a buffered reader approach (using quick-xml’s BufRead and my own periodic flushes), I enabled erxi to handle files that are much larger than the available RAM.
After 6 days of continuous improvements I arrived at the goal: An EXI implementation in Rust, with 2-3x runtime improvements and a lot less RAM consumption than EXIficient. 2 weeks to beat the corporate giant. Now, what can this be used for?
erxi vs EXIficient
Here is the raw data comparing erxi (Rust) and EXIficient (Java) across various scales. Note the performance dominance of Rust in schemaless modes and the memory stability at large scales.
| Input Size | Mode | Implementation | Encode | Decode | Peak RAM (RSS) |
|---|---|---|---|---|---|
| 10 MB | Compression | erxi | 0.51s | 0.13s | 57 MB |
| EXIficient | 2.08s | 0.99s | 222 MB | ||
| Bitpacked | erxi | 0.10s | 0.09s | 16 MB | |
| EXIficient | 0.41s | 0.30s | 126 MB | ||
| Schema-Bitpacked | erxi | 1.47s | 1.28s | 74 MB | |
| EXIficient | 0.46s | 0.37s | 197 MB | ||
| 1 GB | Compression | erxi | 48.0 MB/s | 125.9 MB/s | 3.1 GB |
| EXIficient | 5.9 MB/s | 11.9 MB/s | 3.4 GB | ||
| Bitpacked | erxi | 93.0 MB/s | 186.6 MB/s | 1.1 GB | |
| EXIficient | 52.9 MB/s | 69.0 MB/s | 2.8 GB | ||
| 10 GB | Compression | erxi | 16.2 MB/s | 68.3 MB/s | 15.0 GB |
| EXIficient | OOM | OOM | OOM |
* All benchmarks performed on AMD Ryzen 7 3700X. “OOM” signifies a crash or stall due to memory exhaustion.
So, what gives?
Now, to be honest, if you want to compress stuff without thinking too much about it, I think xz or zstd are probably better choices. In most settings, they have similar compression rates against EXI, with similar or better runtimes and less RAM used. If you want to compress stuff fast, LZ4 or pigz are much better choices. So was it all for naught? I mean, I had a lot of fun and learned a lot of stuff about development and optimization, so of course not. But in real-world applications, I don’t think there is too much out there. For me, there are two main fields that EXI can really shine:
Long-Term Archiving of Structured, Repetitive Data
First of all is long-term (I mean disk, not RAM) storage of huge amounts of structured, repetitive data (think banks, robotics, government offices, that kind of stuff). In this regard, no other algorithm can beat the compression rates of EXI because it has schema-aware compression, which the other algorithms don’t.
Benchmark: Realistic Sensor Data (1 GB)
In this run, I compared standard MT-compressors against erxi combined with xz and zstd on simulated, realistic sensor data.
| Tool | Format | File Size | Ratio | Encode Time |
|---|---|---|---|---|
| xz-6-mt | Generic | 78 MB | 13.24:1 | 37.5s |
| zstd-19-mt | Generic | 79 MB | 13.12:1 | 83.2s |
| pbzip2-9 | Generic | 69 MiB | 14.91:1 | 8.78s |
| erxi-precomp + xz-6 | Hybrid | 46 MB | 22.38:1 | 45.5s |
| erxi-precomp-schema + xz-6 | Hybrid | 45 MB | 23.02:1 | 51.9s |
Findings
As you can see, pbzip blows xz or zstd out of the water on this one, taking a lot less time (almost a tenth of zstd) and achieving better compression results. This was not the case on the Wiki dumps, so I think pbzip also shines on structured, repetitive data. However, what is also apparent is that nothing can beat EXI precomp and xz on these kinds of datasets, reaching compression ratios of up to 23:1.
Another interesting takeaway: while schema-informed mode provides the absolute best compression (23.02:1), the advantage over the schemaless pre-compression (22.38:1) is minimal (less than 3%). However, the performance penalty is real—encoding takes about 15% longer and decoding is significantly slower. I think this happens because EXI learns the schema of these datasets pretty fast even if you don’t supply them explicitly, since it is not overly complicated, and the extra size for storing the schema information is insignificant on these datasets. So, for massive datasets, you should carefully consider if that last 1% of disk space is worth the extra CPU cycles. Often, sticking to schemaless pre-compression is the sweet spot.
Messaging in bandwidth-limited environments
The other one is messaging in bandwidth-limited IoT devices, as you get all the advantages of XML (controlled input, easier parsing) without the overhead.
Benchmark: Startup & Small Messages
For IoT devices, the “warm-up” time of a JVM is a dealbreaker. erxi provides instant-on performance with a tiny memory footprint.
| Feature | erxi (Rust) | EXIficient (Java) |
|---|---|---|
| Startup Time | ~1-2ms | ~200-300ms (JVM) |
| RSS (Small Msg) | ~16 MB | ~120 MB |
| Binary Size | ~4 MB | ~100 MB+ (JRE) |
Ironically, both areas are exactly the areas where EXIficient is worse - it doesn’t work on large datasets because it is not optimized for it, and for small messages, you always need to load the whole JVM in memory and get no advantages of JIT to show for it.
Key message
I feel I repeat myself too often, but TESTING. No choice of language or LLM guarantees good performance or feature completeness. The only source of truth are concrete, real-world tests, the output of which your LLM can read and use as a basis for optimization/bug hunting. This is the only way you will make progress. The LLM (no matter which) can, and abso-fucking-lutely WILL do the most mind-bogglingly stupid shit you have ever seen, and take the laziest shortcuts known to machine. The only way to get usable output from it (besides coding yourself, and let’s face it, we are in 2026, no one is coding anymore) is to let it actually SEE what it did wrong, so it can get it right in the tenth attempt. So having a test suite is unavoidable, but the upside is: Your LLM buddy can help you with it! Just tell it what you want to test and how, and it will do most of the actual work.
EXI4JSON
While I was wrapping this up, I added an EXI4JSON implementation so JSON workflows can be bridged into EXI without first authoring XML. That makes it easier to adopt EXI in projects that are already JSON-first, while still keeping the compression benefits of the EXI stack (schema-based validation still only applies in XML schema-informed mode, not in EXI4JSON).
What now?
I plan to use it in my webapps for messaging, as it is also compilable into WASM. As for development of erxi itself: I just added an EXI4JSON implementation so JSON messages can be compressed via EXI, and next I want to build a visualizer to show the translation of XML (and JSON) messages into EXI streams and back. In the far future, I also want to turn my eyes towards a whole XML stack in Rust, as I had so much fun with this one.
If you want to give me a lot of money for optimizing your slow as hell software or using erxi in any of the aforementioned fields, hit me up. erxi is available on GitHub. If you want to use it commercially, you have to pay.
Project Timeline
| Date | Milestone |
|---|---|
| Feb 19 | EXI4JSON implementation added |
| Feb 18 | Benchmarks finalized |
| Feb 17 | Optimization phase ended |
| Feb 10 | Interoperability achieved |
| Feb 05 | W3C Interop tests start |
| Feb 04 | Early interop failed |
| Feb 04 | Early interop started |
| Feb 02 | Project start |
Reaktionen
Noch keine Kommentare.