When NOT to rewrite a legacy file format
Every quarter, someone emails us mid-rewrite of a binary parser they should never have started. Here is the decision tree we use ourselves before agreeing to take on a format.
About once a quarter we get an email that starts something like:
Hi - we’re three months into writing our own parser for [proprietary format]. It works for our test files, but in production it falls over on about a third of customer uploads. Can you help?
We can usually help. But the help we deliver is not what the sender was hoping for. Half the time, the right answer is throw out what you have and wrap a parser that already exists. The other half is yes, you have to keep going, but here are the three classes of bugs you haven’t hit yet.
The mistake almost always traces back to the original decision: should we write a parser at all? The answer is no more often than people think. Here is the framework we use ourselves.
When NOT to rewrite
A good open-source parser exists. Even if the license is awkward. GDAL is LGPL - that scares enterprises into rewriting things they shouldn’t. The right move is almost always to wrap it (compile it as a separate process, call it over IPC; or use it via a permissive-licensed binding) rather than rebuild it. The cost of writing a parser from scratch is multiples of the cost of the awkward license workaround.
We see this most often with CAD, GIS, and structural-engineering formats. The OSGeo stack has been parsing some of these for thirty years. You will not catch up.
The format is dying. No new files are being authored. Your only need is to read an archive of existing ones. In this case, you are looking at a fixed input set. Use whatever exists - even if it’s a 2003-era C library that nobody maintains - and freeze your tooling. The bugs you would find in a rewrite are bugs that have already been found and fixed in the legacy tool.
You only need to write, not read. Writing a binary format you control output of is much cheaper than reading one you don’t control input of. You author files with a known generator; if your written file doesn’t load in the vendor’s tool, you have a fast feedback loop. Reading other people’s files is open-ended; every customer has files generated by some MicroStation version from 2007 you’ve never heard of.
The spec is fully public and stable. PDF, ZIP, BMP. Yes, you can write parsers for these. But there is almost never a business reason to - high-quality libraries exist for every language, and the spec is too big for you to outdo them by hand. Use the library.
The use case is read-once, throw-away. Migrating a dataset out of an old system into a new one. Run it once, get the data into a sane format, never read the source format again. Do not write a parser. Pay for ODA File Converter, run it once, move on.
When you actually have to
The opposite cases - when a rewrite is the right call:
Closed format, vendor charges per-seat or per-year, and you need it in production. This is the case where rewrites pay back fastest. A $50k/year license for read-only access to your own data is a financial wound that compounds. A parser is a fixed cost that earns out in months. We’ve done this. The economics are usually clear within a single email exchange.
You need to run it where the existing parser can’t run. This is what happened with CADpeek. Every existing DGN parser runs server-side: GDAL, dgnlib, Bentley’s own SDK. None of them run in a browser. We needed an in-browser DGN viewer because the data being viewed - cadastral surveys, architectural drawings, easement maps - is privacy-sensitive. Uploading it to a server was a non-starter. The only way to get the format into a browser was to port a parser to WebAssembly, which in practice meant writing one in Rust we could compile to WASM.
You need both read and write, with edits in between. Most existing parsers are read-only or write-only. If your product is edit and save, you usually end up needing your own implementation, because round-tripping through a third-party parser tends to lose metadata.
Performance-critical. Browser, mobile, embedded device, hot path in a high-throughput pipeline. The existing parser was written for a desktop tool ten years ago and assumes it can allocate megabytes per file. Yours can’t. Sometimes that’s enough reason on its own.
License conflict. Your product is closed-source commercial. The best parser is GPL. You can’t ship it. You have to write one - or wrap it via process-isolation, which is sometimes acceptable and sometimes not depending on deployment shape.
Format has secret features the spec doesn’t cover. This is the trap. The format you’re parsing has a public spec, you wrote a parser to the spec, and 30% of real-world files use features that aren’t in the spec. This is the “your test files work but production files don’t” failure mode from the email at the top. The honest decision tree branches here: did you know about the secret features before starting? If yes, full rewrite was justified. If no, you missed step zero of due diligence (looking at real customer files before writing code) and you are in a hole.
The middle case
By far the most common situation is partial rewrite. The parser exists, it does 70% of what you need, you write the remaining 30% yourself - usually for the part of the format that’s specific to your industry or your customers’ files.
We do this constantly. For CADpeek, we use the excellent dxf-parser library for DXF (a well-spec’d format with a great open parser) and a custom Rust WASM parser for DGN (no JS implementation existed, and the format has Czech-survey-specific features that GDAL doesn’t expose to JS).
The right partition is “can you cleanly split read and write paths, or features, between the borrowed and the built parts?” If yes, do the partial rewrite. If no - if you’re constantly patching the borrowed parser to bend it to your needs - you have made a hidden full rewrite and you should make that decision explicitly.
How we approach a rewrite when we do one
If, after the above, the decision is still yes, write the parser, the actual approach is the boring engineering work. Three things make the difference between a rewrite that ships and one that limps:
Real reference files. Before any code: collect 50-100 real customer files that the parser will eventually see. Not synthetic fixtures, not the spec’s example file. The wild ones - files written by every version of the producing tool that customers actually use. If you can’t get these on day one, you will get them on day fifty as bug reports.
Cross-reference with an existing parser as ground truth. GDAL prints CSV dumps of any geospatial format. ODA File Converter round-trips DGN. dgnlib documents type IDs. Use those tools to generate “this file contains these elements with these coordinates” reference data, and write your parser to match. When you disagree, it is almost always you who is wrong.
Ship the 80% as soon as it loads anything. Then iterate against the failure modes of real files. We had CADpeek loading and rendering most files days before the parser was complete. Then we shipped, watched what broke, and fixed in priority order - by what fraction of files were affected.
The parser bugs that haunt you are the ones you didn’t know existed until you saw them in the wild. The fastest way to find them is to put the parser in front of real users and watch.
The actual point
When a client comes to us with a parser project, our first question is not “what does the format look like?” It’s “have you confirmed there isn’t already a parser that solves this with a wrapper?”
If they have - great, we do the rewrite. If they haven’t, we go look for one. We’ve sent paying customers to OSGeo libraries more than once. We don’t mind: a project that didn’t need to exist is one we’d rather not bill for, and the customers who get sent away come back with a real problem later.
If you’re staring at a binary format right now wondering whether to commit to a rewrite, the test is straightforward: if I had infinite engineering budget but knew this would cost six months, would I still do it? If yes, go. If no, you’re trying to dodge a procurement conversation that you should have instead.
If you’re stuck in the middle of a parser rewrite and want a second opinion before you go further - that’s the kind of email we answer in a day. The advice is free; we charge if we end up doing the work.