Can AI Really Read Your Building Plans? Introducing AECV-bench

Imagine asking ChatGPT to count the doors in your architectural floorplan. It's a task any one-year architecture student could do in seconds. Should be simple for the world's most advanced AI, right?

Wrong. When we tested GPT 5 on this seemingly basic task, it correctly identified doors only 12% of the time. That's much worse than random guessing on a multiple-choice exam.

While AI continues to disrupt industries from medicine to law, the Architecture, Engineering, and Construction (AEC) sector watches from the sidelines, caught between vendor promises and professional skepticism. We're told AI will revolutionize how we design and build, yet we lack the most basic tool for evaluation: reliable benchmarks that tell us what these models can actually do with our drawings.

At AecFoundry, we decided it was time to move beyond anecdotes and marketing demos. We created AECV-bench, the first systematic benchmark for evaluating how well multimodal AI understands architectural floorplans. The results? They're a crucial reality check for an industry considering billion-dollar bets on AI automation.

TL;DR

7 multimodal LLMs evaluated on 150 residential plans (sampled from CubiCasa5k) for element identification + counting.
No model is reliably production-ready for general drawing understanding. Gemini 2.5 Pro leads with 41% mean accuracy across five elements; best MAPE (mean avg. percentage error) is 18.8%.
Where AI helps today: Larger, labeled rooms (e.g., bedrooms 78%, toilets 74% for the best model).
Where it fails: Small, symbolic elements—doors 26%, windows 14%, spaces 15%.
General-purpose AI isn’t enough. We need domain-specific models, stronger visual pretraining on plans/CAD, and hybrid pipelines.

The Problem: AEC's Benchmarking Blind Spot

In medicine, AI models are rigorously tested against MedQA and other benchmarks before anyone considers clinical deployment. In software development, benchmarks like SWE-bench ensure code-generating AI meets minimum standards. But in AEC? We rely on cherry-picked examples and vendor demonstrations.

This isn't just an academic concern. Construction is a $12 trillion global industry where errors compound exponentially. A miscount of structural elements can cascade into ordering errors, scheduling delays, and safety violations. When a medical AI makes an error, one patient might be affected. When construction AI fails, it could impact hundreds of residents, millions in budget, and months of timeline.

Architectural drawings present unique challenges that general-purpose AI benchmarks don't capture. Unlike photographs, technical drawings use symbolic languages that evolved over centuries. A door isn't just a rectangular gap—it's an arc showing swing direction, annotations indicating fire ratings, and relationships to wall assemblies. These conventions vary by region, firm, and even individual drafters.

The cost of this benchmarking gap? We simply don't know if the AI tools being marketed to our industry are reliable enough for professional use. And in construction, where a 5% error rate can mean project failure, "probably good enough" isn't good enough.

Introducing AECV-bench: Our Methodology

We started with a deceptively simple question: can AI count basic building elements in residential floorplans? This isn't about complex spatial reasoning or code compliance—just pure recognition and counting.

Our dataset comes from CubiCasa5k, a collection of 5,000 professionally drawn residential floorplans originally created for real estate documentation.

We selected 150 representative plans after careful consideration of statistical validity versus practical constraints. Here's why this number works:

We tested sample sizes iteratively—50, 150, 200, and 300 drawings—and found remarkably consistent results across all sizes. The difference in results between 150 and 300 drawings was less than 2% in overall accuracy scores. Given that running 150 drawings through GPT-5 alone took several hours (on the medium reasoning effort), analyzing all 5,000 drawings for seven models would have taken weeks and hundreds of dollars without meaningfully improving our insights.This is a pragmatic benchmark, not an academic exercise. We needed enough data for statistical confidence but remained focused on actionable insights for practitioners. When the results stabilize at 150 samples, running 5,000 becomes an expensive way to confirm what we already know: current AI struggles with architectural drawings.

Here's something interesting: we discovered inconsistencies in the original CubiCasa5k labels— doors and windows sometimes missed (i.e., mismatched with reality). So we performed our own manual verification and relabeling of all 150 floorplans.

The surprising result? After correcting these labeling errors, the AI performance scores barely changed—less than 1% difference overall. This tells us something crucial: the models are failing so fundamentally at understanding architectural symbols that even imperfect ground truth doesn't mask their limitations. When your accuracy is 41%, whether that door was actually a window in the original label doesn't matter—the AI missed both anyway.

We tested seven leading multimodal Large Language Models, but here's what we don't show you: the failures. For each vendor, we tested multiple model variants and reported only their best performer. For instance, Claude 3.7 Sonnet outperformed the supposedly superior Opus 4.0 on our benchmark.

The models that made our final list represent each vendor's peak performance on architectural understanding:

Gemini 2.5 Pro (Google)
GPT-5 (OpenAI)
Claude 3.7 Sonnet (Anthropic)
Grok-2 Vision (X.AI)
Mistral Pixtral (Mistral AI)
Qwen2.5-VL (Alibaba)
Command A Vision (Cohere)
Llama 4 Maverick (Meta)

For each floorplan, we asked the models to count five fundamental elements:

Doors - Both interior and exterior
Windows - All types and sizes
Spaces - Distinct rooms or areas
Bedrooms - Specifically labeled sleeping areas
Toilets - Bathrooms and WCs

Why counting? Because it's foundational to everything else in AEC. Before you can analyze spatial relationships, ensure code compliance, or generate quantity takeoffs, you need to accurately identify what's in the drawing. If AI can't count doors, it certainly can't verify egress requirements.

We ran all tests through OpenRouter's API (for all models except Command A Vision), ensuring consistent prompting and fair comparison. Each model received the same simple, clear instruction:

“You are an expert floor-plan reader. Analyze carefully the attached floor-plan image and produce the JSON exactly as specified here: {json_schema}.”

And a JSON schema with all the elements that have to be extracted and simple instructions for each (e.g., "TOTAL number of doors (all types)" for doors).

The Results: A Reality Check

Our overall results paint a sobering picture. Gemini 2.5 Pro leads the pack with just 41% average accuracy across all five element types. GPT-5 follows at 37%, and Claude 3.7 Sonnet manages 35%. At the bottom, Llama 4 Maverick achieves only 11% accuracy—statistically indistinguishable from random noise.

Think about that for a moment: the best-performing AI model, from Google's cutting-edge research, successfully identifies basic architectural elements less than half the time. In academic terms, that's a failing grade. In professional terms, it's a liability.

The 30-percentage-point gap between best and worst performers reveals another crucial insight: model selection matters enormously. This isn't a mature technology where all options perform similarly. It's an emerging capability where choosing the wrong model means getting essentially random outputs.

Element-by-Element Breakdown: A Tale of Recognition

The real story emerges when we examine performance by element type. Here's where AI shows both promise and fundamental limitations:

The Success Stories:

Bedrooms: 78% accuracy (best model)
Toilets: 74% accuracy (best model)

The Failure Points:

Doors: 26% accuracy (best model)
Windows: 14% accuracy (best model)
Spaces: 15% accuracy (best model)

Why do models excel at identifying bedrooms but fail catastrophically at doors? The answer lies in how these elements are represented. Bedrooms and toilets are typically labeled with text ("BEDROOM", "WC") that AI can read directly. Doors and windows, however, are shown through symbolic conventions—arcs, double lines, and breaks in walls that require understanding architectural drawing standards.

This disparity reveals a fundamental limitation: current AI models are better at reading text than interpreting technical symbols. They're approaching architectural drawings like photographs with labels, not technical documents with standardized symbolic languages.

The Error Magnitude Story: Close, But Not Close Enough

While accuracy percentages look dire, the Mean Average Percentage Error (MAPE) tells a more nuanced story. Gemini 2.5 Pro's MAPE is 18.8%, meaning its counts are, on average, about 19% off from ground truth.

In some contexts, being within 20% might be acceptable. But consider what this means in practice: if a floorplan has 10 doors, the AI might count 8 or 12. For a fire safety review, missing 2 emergency exits could be catastrophic. For a security system quote, adding 2 phantom doors means ordering unnecessary hardware.

The error distribution varies dramatically by element. Window counts show the highest error rates, sometimes exceeding 100%—the AI might see 10 windows where only 5 exist, or vice versa. This isn't just inaccuracy; it's hallucination at a scale that makes the output unusable without complete human verification.

Case Studies: When AI Meets Reality

Case Study 1: Two-Bedroom Flat

We presented all models with a typical flat: 2 bedrooms, 1 bathroom, 6 doors, 12 windows, and 13 distinct spaces.

Case Study 2: Four-Bedroom Residential Plan

We presented all models with a typical suburban home floorplan: 4 bedrooms, 2 bathrooms, 20 doors, 12 windows, and 22 distinct spaces.

The pattern is clear: even the best model only succeeds with explicitly labeled rooms while failing at architectural elements requiring symbol interpretation. While worse able are not even able to correctly extract labeled objects.

Why Are Floorplans So Hard for AI?

Understanding AI's struggles with architectural drawings requires examining how these models "see" versus how architects draw.

Symbolic vs. Pictorial Representation: Architectural drawings aren't pictures—they're diagrams. A door symbol (an arc) doesn't look like a door; it represents the concept of a door and its operational space. AI trained primarily on photographs expects visual similarity, not symbolic abstraction.

The Annotation Problem: When models succeed (bedrooms, toilets), it's usually because they're reading text labels, not interpreting drawings. This suggests current AI approaches floorplans more like OCR tasks than true visual understanding.

Scale and Context Challenges: Architectural elements exist at multiple scales. A window might be a simple break in a wall line at 1:100 scale but include detailed mullion patterns at 1:50. Models lack the contextual understanding to interpret these scale-dependent representations.

Line Weights and Conventions: Architects communicate through line weights—thick lines for walls, thin for dimensions, dashed for hidden elements. These subtle distinctions, clear to trained humans, blur together for AI models expecting the clear edges and consistent textures of photographs.

Consider how humans learn to read drawings: architecture students spend months learning conventions, practicing symbol recognition, and understanding representational systems. Current AI attempts to bypass this learning, treating technical drawings like natural images. The results speak for themselves.

Implications for the Industry

What does 41% accuracy mean for AEC's AI ambitions? It's a reality check, not a death sentence.

Where Current AI Can Help: High-level understanding remains valuable. AI can successfully identify room types, suggest space programming, and provide rough layout analysis. For early-stage design exploration or client presentations, current accuracy might suffice with human oversight.

Where It Can't: Detailed extraction for professional use remains out of reach. Quantity takeoffs, code compliance checking, and construction documentation require near-perfect accuracy. A 26% door detection rate makes automated egress analysis impossible. A 14% window identification rate renders energy calculations meaningless.

The Trust Problem: Perhaps most critically, inconsistent performance erodes trust. If architects must verify every AI output, where's the efficiency gain? The time saved by automation evaporates when spent on quality control. Worse, inconsistent errors—missing doors here, hallucinating windows there—make verification harder than starting from scratch.

The Path Forward: From General to Specialized

The solution isn't abandoning AI but specializing it. General-purpose models trained on internet images will never understand architectural conventions without targeted intervention.

Domain-Specific Training: The path forward requires models trained specifically on architectural drawings. This means:

Fine-tuning visual language models on massive datasets of annotated floorplans
Training dedicated object detection models for architectural symbols
Developing hybrid systems combining rule-based symbol recognition with machine learning

The Data Challenge: Progress requires data—millions of annotated architectural drawings across building types, scales, and drawing conventions. Unlike ImageNet's photographs, technical drawings aren't freely available online. The industry must collaborate to build these datasets.

Promising Approaches: Several strategies show potential:

Transfer learning from engineering drawings in other domains
Synthetic data generation using BIM models to create training sets
Hybrid systems combining traditional computer vision with modern AI

The encouraging news? Once solved for architecture, these techniques could generalize across technical drawing domains—mechanical, electrical, civil engineering all face similar challenges.

Moving Beyond the Hype

AECV-bench is just the beginning. We started with simple counting because measurement requires a foundation. But floorplans contain rich spatial relationships, dimensional constraints, and semantic meaning that our benchmark doesn't touch.

The construction industry stands at a crossroads. We can continue accepting vendor promises without verification, or we can build the rigorous evaluation frameworks this profession demands. Medical AI didn't advance through marketing—it improved through systematic benchmarking against clinical standards. AEC deserves the same rigor.

Our results might seem discouraging, but they're actually liberating. Now we know where we stand. We can stop pretending general-purpose AI is ready for professional architectural use and start building the specialized tools our industry actually needs.

The challenge to our community is simple: Let's separate hype from reality. Contribute to AECV-bench. Create your own benchmarks for the specific problems you face. Share your results—both successes and failures. Because honest assessment isn't pessimism—it's the first step toward genuine progress.

The question isn't whether AI will transform architecture and construction. It's whether we'll guide that transformation through rigorous evaluation or stumble forward through trial and error. With billions in construction costs and countless lives affected by what we build, we can't afford to guess.

Aleksei Kondratenko

Written by

AI Engineer, Product Manager @ AecFoundry - Building the digital future of AEC