Reading blueprints with Gemini, six months on

Cimenta does one thing that sounds simple: you upload an architectural plan, and it hands you back a bill of materials — quantities, specs, an Excel you can give to a supplier. For the Costa Rican construction market, where a takeoff is still mostly done by hand, that's hours of work collapsed into minutes.

In the demo it looked solved. Gemini reads a clean drawing, lists the materials, you nod, you ship. Six months in production taught us the demo was hiding the entire problem. This note is about what actually broke, and the two decisions that fixed it — one of which was admitting we'd added a stage that was quietly destroying data.

A plan is not a page

The first wrong assumption was treating a blueprint like a document you read top to bottom. It isn't. A construction project is spread across many sheets that speak different dialects: the architectural floor plan, the structural sheet, the electrical layout, the plumbing runs. The concrete grade and rebar specs live on a specification sheet. The walls and columns live on a floor plan. The materials you need to count are scattered across all of them, and no single sheet has the whole answer.

Ask one model call to "read the plans and list the materials" and you get something that looks right and is subtly, expensively wrong — because the model never had the context from the spec sheet when it was looking at the floor plan. In construction, a confident wrong quantity doesn't get caught downstream. It gets ordered, and it gets built.

The pipeline: context before counting

So instead of one big read, Cimenta runs a staged pipeline, and the core idea is that earlier stages build the context that later stages need.

Stage 0 — classify the sheet. On upload, every file gets tagged: sheet number, role (spec_sheet or floor_plan), and plan type (architectural, structural, electrical, plumbing). Nothing is extracted yet. We're just figuring out what each page is.

Stage 1 — extract the structural DNA. This runs only on the spec sheets. It pulls the things that govern the whole job — concrete grades (f'c), rebar diameters and grade, block types — and produces an obra_config: a compact context object that describes the rules of this specific build.

Stage 2 — read the floor plans, with the DNA as context. Now the floor plans get analyzed, but the model isn't reading them blind. It carries obra_config with it, so when it sees a wall, it already knows what block this project uses; when it sees a column, it knows the rebar spec. Each material comes back with a quantity, a unit, the sheet it came from, and a confidence score.

Stage 3 — aggregate into a BOM. Everything gets grouped into the six chapters a Costa Rican estimator expects — earthworks, foundation/slab, masonry, vertical structure, roof and steel beams, miscellaneous and installation — and exported.

The unlock was Stage 1 feeding Stage 2. Anchoring the read with the project's own specifications before counting anything was the single biggest accuracy gain in the whole system.

The stage we had to delete

Here's the failure mode I didn't see coming, and the most useful thing in this note.

At one point we added a verification stage — call it Stage 2.5. The idea felt obviously good: after extracting materials, ask Gemini to verify each material against the floor plans, and drop the ones it couldn't find. A confidence check. A hallucination filter.

It worked beautifully on architectural materials. And it silently deleted the entire MEP scope — the electrical, plumbing, and water materials — every time.

The reason is the exact thing the pipeline was supposed to respect: those materials don't come from the floor plan. They live on their own sheets. So when Stage 2.5 went looking for an electrical conduit on an architectural floor plan and didn't find it, it concluded the material was a hallucination and removed it. We had built a stage whose whole job was to destroy correct data with total confidence. Logged as BUG-002, and it took a couple of rounds to fully kill.

The fix wasn't a better verification prompt. It was deleting the stage entirely and replacing it with something dumber and far more trustworthy: deterministic fuzzy deduplication in Python. Materials get merged when their normalized names cross a Jaccard similarity of 0.70 — Jaccard just measures how much two sets of words overlap, scored 0 (nothing in common) to 1 (identical). Strip the parentheticals, drop generic qualifiers, singularize Spanish plurals, then compare. Walls with near-identical names collapse into one line; MEP materials are left exactly where they are, because nothing is hunting them down to "verify" them anymore.

What we'd do differently

The lesson isn't "don't use the model." It's know which job belongs to the model and which belongs to a rule.

Perception — reading a messy drawing and pulling structured meaning out of it — is genuinely what the model is good at, and nothing else comes close. But reconciliation — deciding whether two material names are the same thing, deduplicating, validating a number against a format — is deterministic work. When we handed reconciliation to Gemini (Stage 2.5), we got a smart component making confident, unpredictable, irreversible mistakes. When we handed it to twenty lines of Python, we got something boring that we could reason about and trust.

A few more, honestly:

We over-trusted "add a model call to check the model." A verification step built from the same kind of component that made the original error doesn't add safety — it adds a second way to be confidently wrong.
The Jaccard threshold (0.70) is a knob, not a law.
Pricing taught us a parallel lesson: we pull market prices via Gemini's search grounding rather than scraping, because the model handles the messiness of a volatile CR market better than a brittle scraper would. Same principle, opposite direction — use the model where the world is messy, use rules where the logic is fixed.

The takeaway

Vision problems that look solved in a demo are almost never one model call from production. The model is the easy part now. The hard part is the architecture around it: staging the work so context flows in the right order, and being ruthless about which decisions a language model should not be allowed to make. The most dangerous component in our system was never the one that read the drawing wrong. It was the clever one we added to double-check, that deleted real materials with a straight face.

Treat perception as a pipeline, give the model the jobs only it can do, and give the boring deterministic jobs to boring deterministic code. The result is a system you can hand a number to someone who's about to build with it.

Cimenta is one of the systems in the Controlled Mayhem lab. More field notes as things break.

Reading blueprints with Gemini, six months on

A plan is not a page

The pipeline: context before counting

The stage we had to delete

What we'd do differently

The takeaway

- Suggested citation

- About the author

Andrea Phillips

New notes in your inbox.

Reading blueprints with Gemini, six months on

A plan is not a page

The pipeline: context before counting

The stage we had to delete

What we'd do differently

The takeaway

- Suggested citation

- About the author

Andrea Phillips

More from the logbook.

Costs we don't talk about enough

Your agents have amnesia. I gave mine a memory.

When I'm gone: what happens when personal AI agents outlive their users

New notes in your inbox.