The Model Matters More Than You Think
Not all AI is created equal. In chip design, the difference can cost you a respin.
A note on how this was written. A combination of Claude Opus and Gemini as writing partners this time around, with me directing the argument and reviewing every claim. See Post 1 for why I think this transparency matters.
In the first post in this series, I argued that AI is coming for chip designers — as a tool, not a replacement. That the engineer’s job is moving up the abstraction stack, from writing syntax to owning intent.
But there’s a critical assumption buried in that argument: that the AI you’re using actually understands hardware.
Most don’t.
The synchronizer test
Here’s something every chip designer learns early in their career: when a signal crosses from one clock domain to another, you need a synchronizer. Two flip-flops clocked by the destination clock, in series. The first flop may go metastable — that’s expected. The second flop samples the resolved output and provides a clean signal to the destination domain. This is foundational. It’s in every textbook. Get it wrong and you get intermittent, unreproducible failures that escape simulation and show up in silicon.
I asked two popular open-source models — Qwen 3.5 (9B parameters) and Qwen Coder (14B parameters) — to write a basic two-flop synchronizer. Both produced code. Both produced code that was syntactically correct. Neither produced a synchronizer.
Both models clocked the two synchronizer flops with the source clock. Then they added a third flop on the destination clock to capture the output.
Think about what that means. The two “synchronizer” flops are just a pipeline in the source domain — they accomplish nothing. The actual clock domain crossing happens at the third flop, which is a single bare register with no metastability protection. It’s worse than no synchronizer at all, because it looks correct. It has the right structure, the right signal names, the right number of flops. A quick visual review might miss it. The synthesis tool won’t flag it. And if it goes metastable in silicon, you’ll spend weeks chasing a bug that only appears under certain temperature and voltage conditions.
I showed both models the error. Explained exactly what was wrong and why. Neither could fix it. They didn’t understand what the correction should be — because they don’t understand what a synchronizer does. They had learned the shape of the code, not the purpose of the circuit.
This is not a minor failure. This is the difference between a tool you can use and one that will bury a silicon-killing bug under syntactically perfect code.
And here’s the real damage: imagine a designer — someone with ten or twenty years of experience — sees this output. They asked the AI to do one of the simplest things in digital design, and it got the clocking wrong on a synchronizer. What do they do? They close the tool, they tell their colleagues it doesn’t work, and they go back to writing RTL by hand. You’ve lost them — not because AI can’t help them, but because the wrong model just proved to them that it can’t be trusted with the basics.
That first impression is almost impossible to undo. Every engineer I know who has dismissed AI tools has a story like this: they tried it once, it produced something obviously wrong, and they concluded the technology isn’t ready. In many cases they were right — but about the model, not about AI in general.
Expert models—like Claude, GPT-4, Gemini, and even advanced open-weight models like Gemma 4 31B—get this right every time. I’ve generated dozens of simple CDC blocks with them, and the destination clock is exactly where it belongs. When asked to explain why the flops need to be on the destination clock, they give a correct, coherent explanation about metastability resolution.
At first glance, this looks like a huge win. The larger models know how to write the code. But there is a dangerous trap here: just because a model has memorized the correct structure of a common circuit doesn’t mean it actually understands the physics underlying it. Getting the basic coding right is just table stakes. The real test is methodology.
Pushing the frontier models
The two-flop synchronizer test is the baseline. But I wanted to see what happens when you push the frontier models on a much more delicate CDC architecture question. We explicitly said in the first post that I am not pushing a specific tool, so I ran this test across four of the best models available today: Claude Opus 4.6 (extended), Gemini 3.1 Pro, ChatGPT 5.3, and Gemma 4 31B.
Here is the setup: I gave them a complex single-bit synchronizer scenario with multiple OR’ed requests, testing four different configurations involving combinational logic before the synchronizer and clock domain crossings.
The results were incredibly revealing:
ChatGPT 5.3 hit immediate methodology violations, confidently calling an unsafe single-domain OR “Generally OK.”
Claude Opus 4.6 caught the risk of a narrow glitch but concluded it “works, with a caveat” because the glitch probability is low—steering a designer down a dangerous path.
Gemini 3.1 Pro provided the only methodologically correct answer: none of these implementations are safe.
When pushed on the physics of why, Gemini correctly landed on the deeper truth of synchronizer design: that runt glitches force flops into deeper metastable states, which completely invalidates the MTBF math that proves the silicon safe.
Keep in mind, I only caught this because it was a microscopic, directed test. But imagine a designer asking an AI to generate a large, complex block. If the AI hallucinates a structure like Case A and calls it “generally OK,” that error will get buried in thousands of lines of RTL. The synthesis tool won’t flag it. It might not get caught until CDC signoff—or worse, a tired engineer might decide to waive the CDC warning because the logic “looks right,” sending a fundamental metastability flaw straight to silicon. The fact that Gemini definitively “refused” to budge on the methodology is exactly the kind of safety net hardware engineering requires.
But models change constantly. This introduces a new operational requirement for hardware teams: regression testing. You cannot blindly deploy a model update just because its generic coding benchmarks improved. Companies must build their own suite of CDC and methodology tests and run rigorous regressions on every model version to ensure these fundamental capabilities remain intact.
(This specific test gets into the deep physics of runt glitches and metastability calculations. Analyzing what the models got wrong here tells you everything you need to know about how AI handles complex hardware constraints. I’ll do a full teardown of this problem—and exactly why a runt glitch breaks a synchronizer’s safety envelope—in the next post.)
Why this happens
The explanation is straightforward: training data.
Large language models learn from the data they’re trained on. The public internet contains billions of lines of Python, JavaScript, Java, and C++. It contains a vanishingly small amount of SystemVerilog, and even less of it is production-quality RTL with correct CDC handling. Verilog and SystemVerilog combined account for fewer than 7,000 publicly tagged repositories on GitHub — against over 500 million total.
Smaller models, trained on smaller datasets, have seen even less hardware content. A 14B parameter model has likely encountered a handful of synchronizer examples in its entire training set — if any. It has learned that synchronizers involve two flops and a clock, but it hasn’t seen enough correct examples to learn which clock matters and why.
Frontier models — Claude, GPT-4 — have larger training sets and more capacity to retain domain-specific knowledge. They’ve seen more RTL, more CDC documentation, more EDA tool references. But even they are working from a fundamentally impoverished corpus compared to what they have for software.
This is the training data gap that Moshe Zalcberg identified in his hype cycle analysis. It’s not just an abstract structural problem. It shows up concretely, in wrong clocks on synchronizer flops, in incorrect reset sequencing, in SDC constraints that are syntactically legal but semantically wrong.
The NVIDIA lesson
This is also exactly why NVIDIA built ChipNeMo. As I discussed in Post 1, NVIDIA invested in domain-adaptive pretraining on 23 billion tokens of their own internal chip design data — 30 years of design documents, bug reports, and verification scripts. They did this because they understood that general-purpose models, no matter how large, don’t have enough hardware knowledge to be reliable.
But here’s the uncomfortable follow-up: NVIDIA can do this because they’re NVIDIA. They have the data, the compute, and the institutional history to build a domain-specific model. Only a few top companies can do something similar, or can even afford to.
Everyone else — and that’s the vast majority of the semiconductor industry — does not have 30 years of proprietary RTL, bug databases, and design reviews to train their own model. They cannot build a ChipNeMo. They are entirely dependent on what the commercially available LLMs offer out of the box. For these companies, the quality of the frontier models isn’t a nice-to-have — it’s the whole story. If the best available model can’t write a correct synchronizer, there is no fallback. There is no internal corpus to fine-tune against, no domain-adapted alternative to switch to.
This is why model selection matters so much more in chip design than in software. A software team that picks a weaker model gets slower code reviews and buggier autocomplete — annoyances they can iterate past. A chip design team that picks a weaker model gets wrong clocks on synchronizers, incorrect CDC assumptions, and SDC constraints that look right but aren’t. And they may not find out until silicon.
What I’ve seen work (and what doesn’t)
I want to be specific here, because vague claims about AI quality are part of the noise this series is trying to cut through.
Frontier models (Claude Opus/Sonnet, GPT-4) — reliable for structured hardware tasks. Synchronizers, FSMs, protocol implementations, SVA assertions, SDC constraints. They understand the domain well enough that the output is correct more often than not, and when they make mistakes, they can usually diagnose and fix them when shown the error. Claude in particular has been consistently strong on CDC-related work — I’ve used it extensively for assertion writing and formal verification setup, and it understands the semantics, not just the syntax.
Mid-tier models — hit or miss. They can produce useful boilerplate and handle simple RTL tasks, but they start failing on anything that requires domain-specific reasoning. CDC, timing constraints, reset sequencing — the areas where getting the semantics wrong is dangerous — are unreliable.
Small open-source models (sub-20B parameters) — not ready for hardware. The synchronizer test is the simple version. These models also struggle with basic concepts like the difference between blocking and non-blocking assignments in the context of synthesis, or why you can’t have combinational loops in synchronous logic. They have learned Verilog syntax from whatever fragments exist in their training data, but they haven’t learned hardware design.
This doesn’t mean small models are useless everywhere. For scripting, documentation, code formatting, and other tasks where hardware domain knowledge isn’t critical, they can be fine. But for anything that touches the actual design — RTL, constraints, assertions, verification — model selection is a professional decision with real consequences.
Model selection is now an engineering skill
This brings me to the point I want to leave you with.
When the industry talks about “using AI for chip design,” it often treats the model as interchangeable — as if the magic is in the workflow, the RAG system, the agent framework, the integration with EDA tools. Those things matter. But they’re all downstream of a more fundamental question: does the model understand hardware?
If it doesn’t, no amount of prompt engineering or toolchain integration will save you. You’ll get confident, syntactically correct output that embeds domain errors you may not catch until silicon. And the smaller your team, the fewer experienced engineers you have to review that output, the more dangerous this becomes.
Choosing the right model for hardware work is not a procurement decision. It’s an engineering decision, and it deserves the same rigor you’d apply to selecting an EDA tool or a verification methodology.
So what does that rigor look like in practice?
It means you don’t evaluate AI for your team by reading marketing claims, and you certainly don’t rely on generic software coding benchmarks. You need to build your own “synchronizer test.” You take half a dozen fundamental, domain-specific tasks that are critical to your workflow—a specific CDC scenario, a tricky SDC constraint problem, a finite state machine with specific reset conditions—and you see how the models handle them. You test them on edge cases. You see if they can correct themselves when you explain an error in their logic.
You establish this baseline of hardware competence before you roll the tool out to your engineers. Because as we saw earlier, you only get one chance to build their trust. If you hand them a model that doesn’t understand the domain, they won’t blame the model. They’ll blame the technology, and they’ll go back to writing RTL by hand.
In the next post, we will do a full teardown of the advanced CDC synchronizer test, breaking down exactly how these models handled the deep physics of hardware methodology and what it means for your verification flow. After that, we’ll move to the spec: what it means to treat it as the central artifact in an AI-assisted flow, and why Design by Contract isn’t just a software concept.
Marco Brambilla is a semiconductor industry veteran with 25 years in chip design, most recently as Senior Technical Director at Meta Reality Labs. He writes about AI, chip design, and the future of hardware engineering at Above the RTL.
