Above the RTL

The AI Methodology Gap: Why Bottom-Up Has a Ceiling

Marco Brambilla — Tue, 28 Apr 2026 16:02:15 GMT

Written with Claude Opus.

My last post ended with a question it didn’t quite answer. If Engineer A’s AI review and Engineer B’s AI review produce different verdicts on functionally equivalent CDC crossings, and a new model release can flip which of them was right, the obvious follow-up is: whose job is it to make sure the company’s CDC sign-off doesn’t float on whichever model happened to be open the afternoon of the review?

The CDC tools already catch the structural violation. That’s not the issue - a modern CDC tool flags the OR-before-sync the moment the RTL compiles, and has for years. The divergence shows up one step downstream, at the waive-or-fix decision. Whether a flagged crossing gets waived as a benign same-domain glitch or fixed as a real violation has always been a judgment call, and AI is now embedded in that judgment step without any company-level decision about whose judgment governs. The next generation of AI-aware sign-off tools may well pull that judgment step back into the flow - that is exactly what a tool-enforced layer would look like - but whether they do or don’t, the decision about which layer is canonical for a given company is not a decision the tool makes. It is a decision the company makes, and most companies deploying AI into chip design have not made it yet.

The version of AI adoption most companies are actually running looks like this: encourage engineers to experiment, share what works, let the good stuff percolate up. It produces real wins. It also has a ceiling - a structural one, not a motivational one - and the problems above the ceiling are the ones that matter most for silicon. This post is about where the ceiling is and why it’s where it is.

Point solutions are genuine wins

Let me start by being clear about what’s working. An engineer who writes a Python-plus-AI script to parse a timing report and flag outliers has solved their own friction point on their own schedule. They did not file a ticket with the CAD team and wait six months. They did not negotiate with a vendor. They opened Claude or Cursor, described the problem, iterated for an afternoon, and moved on. That is real productivity. Multiply it across a few hundred engineers and a few hundred little friction points, and the aggregate is substantial.

The same pattern applies to block-level assertion drafting, to ad-hoc log analysis, to writing small tools that bridge between incompatible formats, to parsing design-review minutes into trackable items. These are exactly the problems an engineer can see clearly because they live inside them. They’re also exactly the problems where the engineer has the context to verify the output - they know what a correct timing-report analysis looks like, they know what a sane block-level assertion reads like, and they can spot when the AI has produced something that sounds right but isn’t.

This is AI for the engineer, in the sense Post 1 used the phrase. It’s liberating, it’s fast, and no one above them needs to sign off on it for it to work. Companies should encourage it. The mistake is in what comes next.

What point solutions can’t reach

The problems that actually define a company’s silicon capability - the ones that show up in tape-out outcomes, integration schedules, and product competitiveness - are not block-local. They are cross-block, cross-team, or cross-release, and they cannot be built by any engineer from their desk.

Consider spec-to-design. Making it work requires that every block’s spec use a consistent structure, that the RTL coding standards match what the model is trained or prompted to produce, that the verification plan references the spec in a format the automation can check, and that the sign-off criteria treat the spec as the authoritative source. No individual engineer can deliver that. They don’t have authority over the spec format, they don’t write the coding standards, they don’t set the verification-plan template, and they certainly don’t modify the sign-off flow. The most a well-intentioned engineer can produce is a spec-to-design script that works for their own block and breaks the moment it meets a neighboring block’s conventions. The bottom-up version gives you five incompatible prototypes, none of which survives integration.

Design by Contract is the same shape. Contracts work when every block has them in a consistent schema, when integration tests consume the schema, when sign-off references it. Any single engineer writing contracts for their own block is doing useful work, but the methodology only pays off when the entire SoC speaks the same contract language, and no engineer can make that happen from their desk.

CDC sign-off under AI, which Post 5 walked through in detail, is the same kind of problem arriving from a different angle. The question isn’t whether any particular engineer can use AI well on their own crossings. The question is which analysis the company’s sign-off rests on. If the answer is “whichever one each engineer chose,” then the variance the CDC tool was invented to eliminate has been reintroduced above the tool. Someone has to decide, with authority that extends across every block in the SoC, which AI layer is canonical. That decision is not an engineer’s to make.

Sign-off is a named-owner process

Chip design has a vocabulary for this kind of problem, and it’s worth using it. CDC sign-off, timing sign-off, power sign-off, DFT sign-off - each of these is an auditable process with a named owner, a documented flow, an audit trail, and a filed report. When a failure shows up in silicon, the trail leads back to a specific person who signed a specific document on a specific date. That isn’t bureaucracy. It’s the mechanism by which the enterprise guarantees that rigor was applied and that someone is accountable for the verdict.

The AI layer inside any of these flows has to inherit that structure or it is not sign-off. It is an opinion. Specifically, if the methodology permits Engineer A’s assistant to say “fix this” and Engineer B’s assistant to say “fine under the level protocol” for functionally equivalent crossings in the same SoC, the integration engineer is no longer reviewing the design - they are adjudicating between model outputs. When the failure shows up in silicon, the person who signed the report is left holding the candle for a decision made by whichever model was hosted at their desk the week of the review.

What the process requires instead is what Post 5 closed on: a shared tool-enforced layer, uniform across the team, grounded in the same structural analysis, auditable in the same way the rest of sign-off is. Disagreements resolve in the graph rather than in model temperature or training cutoff. This is not an optional refinement. It is the minimum condition under which “AI in CDC review” describes a sign-off flow rather than a collection of individually persuaded reviewers.

And the thing to notice is that a tool-enforced layer like that is categorically not something an engineer builds from their desk. It requires tool choice, corpus curation, agent orchestration, model-version pinning, audit instrumentation, and the authority to tell every engineer in the group to use this and not that. It is methodology work, and methodology has always been an org function.

ChipNeMo was a Jensen-level decision

The clearest industry example of the kind of AI work that can only happen with real organizational mandate is ChipNeMo. Twenty-three billion tokens of proprietary internal data, thirty years of institutional design history, infrastructure investment sized for deployment to eleven thousand engineers. That is not a staff engineer’s 20% project. It is not an initiative a good team pushed up from the bottom. It is a multi-year commitment that required a CEO-level decision that AI for chip design was strategically central to the company, and the decision was made by someone with the authority to say yes.

Most companies will never build their own ChipNeMo, for reasons Post 1 went into - the corpus isn’t there, and the handful of companies with the corpus are already the ones doing it. But the analogy is what matters for everyone else. The company-scale AI decisions - which commercial models are canonical, which EDA integrations are supported, which agents are sanctioned for which tasks, which flows are in scope and which are off-limits, who owns the methodology and who the engineers file against - are leadership-level decisions. They don’t get made at the engineer level because they can’t be made at the engineer level. The engineer level can’t enforce them.

When a company leaves the company-scale decisions unmade and tells the engineers to experiment, what it gets is a very energetic floor of point solutions and a ceiling it never breaks through. The floor is a real asset. The ceiling is the problem.

The cost-cutting misread

One last trap worth naming, because it’s running in multiple companies right now and it is orthogonal to everything above. The misread is this: AI makes engineers more productive, therefore you can run the same design with fewer or cheaper engineers, therefore senior headcount is a cost-cutting target. It is a misunderstanding of what AI actually replaces.

AI replaces the activity of generating code - the typing, the boilerplate, the repetitive structural work. It does not replace the judgment that decides whether the generated code is correct, whether the spec it was generated from describes the right design, whether the verification plan actually exercises the risky paths, whether the sign-off report tells the truth about what was checked. The judgment layer is not an adjunct to the activity. It is the load-bearing piece. When you remove the activity, you do not reduce the need for judgment - you increase it, because the activity used to filter for competence along the way, and now it does not.

A company that cuts senior engineers because AI made juniors “as productive” has possibly increased its tape-out throughput of written code. It has also reduced its capacity to tell whether the code is working. That cost does not show up in Q1 headcount metrics. It shows up at bring-up, where the engineers who would have spotted the problem are either not in the room, or are but have been moved into purely reviewing roles without the technical runway to pattern-match what they’re looking at. It is a cost that accrues silently and presents all at once.

It also raises a separate question I want to come back to in the next post: if AI is now doing the activity that used to teach junior engineers the judgment they grow into, where do the senior engineers of five and ten years from now come from? That is its own problem, and it deserves its own piece.

Marco Brambilla is a semiconductor industry veteran with 25 years in chip design, most recently as Senior Technical Director at Meta Reality Labs. He writes about AI, chip design, and the future of hardware engineering at Above the RTL.

Join my new subscriber chat

Marco Brambilla — Sat, 25 Apr 2026 06:00:29 GMT

Today I’m announcing a brand new addition to my Substack publication: Above the RTL subscriber chat.

This is a conversation space exclusively for subscribers—kind of like a group chat or live hangout. I’ll post questions and updates that come my way, and you can jump into the discussion.

Join chat

How to get started

Get the Substack app by clicking this link or the button below. New chat threads won’t be sent sent via email, so turn on push notifications so you don’t miss conversation as it happens. You can also access chat on the web.

Get app

Open the app and tap the Chat icon. It looks like two bubbles in the bottom bar, and you’ll see a row for my chat inside.

That’s it! Jump into my thread to say hi, and if you have any issues, check out Substack’s FAQ.

Claude 4.7: Getting It Wrong More Persuasively

Marco Brambilla — Sat, 25 Apr 2026 05:21:56 GMT

Written with Claude Opus.

In my previous post I walked through a CDC review question: an OR of multiple requesters feeding a 2-flop synchronizer, across four configurations of the OR — same-domain flops, same-domain state-machine outputs, different-domain flops, and different-domain combinational logic. I tested ChatGPT 5.3, Claude Opus 4.6, and Gemini 3.1 Pro against it. Gemini was the only one that gave the methodologically correct answer: all four are CDC violations, full stop. ChatGPT 5.3 declared the same-domain case “generally OK.” Claude 4.6 recognized the physics of the glitch but concluded the design “works, with a caveat.” Both positions steer a junior designer toward waiving a violation that has no business being waived.

Anthropic shipped Claude Opus 4.7 on April 16, about ten weeks after 4.6 went out in early February. I re-ran the same question against 4.7 within days of its release. The answer moved — in a specific direction, and in a way worth paying attention to. This post is about what moved, what didn’t, and why the “better model” assumption is not safe when AI is being deployed into engineering review.

What Claude 4.6 did

Claude 4.6 got to the right structural observation about the same-domain case — that routing-delay skew at the OR gate can produce a brief 1→0→1 transient when multiple requesters toggle on the same source edge — and then concluded that because the transient is narrow and the destination holds the signal high across multiple cycles, the design “works.” Under pushback citing the strict rule, 4.6 agreed it had made a mistake and reversed to the rule-literal position.

That’s the wrong behavior for an assistant whose role in a review is catching the reviewer’s blind spots. A reviewer who capitulates the moment the designer pushes back is not adding safety; it’s adding validation. In review the user is often the one who needs to be corrected, not agreed with.

What Claude 4.7 did

Claude 4.7’s first-pass answer on the same case was more sophisticated than 4.6’s. It described the glitch mechanism the same way but added an extended analysis of why the glitch is absorbed under a level protocol with multi-cycle hold-high. It reasoned about metastability resolution correctly in the small-signal picture — that τ is a regenerative time constant of the flop and is not fundamentally changed by input waveform shape — and argued that the 2FF structure gives the first flop a full destination cycle to resolve, so a brief runt at the input doesn’t translate into a deeper metastable state at the output.

On the physics, 4.7 is more rigorous than 4.6 was. And on pushback, 4.7 held ground — it did not capitulate when I cited Gemini’s absolutist answer. It restated its reasoning, acknowledged the strict rule as the safe default, and maintained that under the specific conditions of the same-domain case (level protocol, multi-cycle hold-high) the design absorbs the glitch.

That sounds like progress. It isn’t. It’s a more convincing wrong answer — and in a review context, a more convincing wrong answer is more dangerous than a less convincing one. A junior designer who pushes back on 4.6’s “works, with a caveat” might still be caught by a senior reviewer. A junior designer armed with 4.7’s detailed physics defense walks into a review with an argument the reviewer now has to rebut in detail rather than flag with “rule violation, fix it.” The bar to waive the violation just moved up; the bar to catch the violation moved up with it.

A second error layered on top

On top of getting the same-domain case wrong more persuasively, 4.7 made a ranking error on the other configurations that 4.6 did not have space to make. Asked to rank severity across the four cases, 4.7 placed the different-domain flop OR as categorically worse than the same-domain state-machine decode — labeling the cross-domain case as a fundamental violation of the synchronizer model and the state-machine case as merely “unreliable.”

That ordering is backwards in a specific, important way. The state-machine decode is always broken — silent phantom requests asserted from states the source FSM never actually reached, with the synchronizer faithfully forwarding the lie downstream. The different-domain flop OR is in a different category: every high on the synchronizer input still corresponds to a real request somewhere; the failure modes are narrow dips and MTBF pressure, both analyzable and bounded. A manager acting on 4.7’s ordering would prioritize fixing the cross-domain case while leaving the state-machine decode in place. That prioritization removes a configuration whose failure mode the level protocol tolerates and leaves the configuration that produces silent functional lies untouched.

So 4.7 isn’t wrong in one place. It’s wrong in two compounding places: a more persuasive defense of a rule violation, and a backwards severity ranking on the rest of the cases.

What the physics says, and where it stops

4.7’s defense isn’t pure fiction. It’s a regime-specific argument dressed as a general one, and that distinction matters for understanding what actually went wrong with the model’s reasoning.

Three threads run through the technical defense, and each fails the same way. The small-signal claim about τ being unchanged by runt-pulse input is correct in linear analysis — but the linear analysis stops short of the question that actually matters, which is whether a partially-conducting input transistor extends effective resolution time during the metastability aperture. The empirical observation that shipped silicon with rule-violating crossings hasn’t failed in the field — which readers raised after Post 3 — is real, but it’s a frequency-regime observation, not a methodology result; the MTBF exponent that makes low-frequency violations invisible evaporates as destination clocks climb. And even granting the physics defense at face value, there’s a perpetual reuse cost to safe-under-assumptions designs that correct-by-construction designs don’t carry, with break-even at one or two reuses against IP that typically reuses many more times than that.

The detailed analysis — the math, the configuration-by-configuration breakdown, the frequency-regime arithmetic, the reuse economics — is in the companion reference note linked at the end of this post. The point on the editorial side is what the three threads have in common: 4.7 reasoned at one level (small-signal physics, regime-specific behavior) and presented its conclusion at another (general methodology). That kind of level-mismatch is exactly what a reviewer is supposed to catch, and it’s exactly what a more sophisticated wrong answer makes harder to catch.

The organizational piece

One last point, and the one that compresses everything above into a practical consequence for anyone trying to put AI into engineering review.

Sign-off is an auditable process with named owners. A review produces a report. Waivers are listed with reasons. Someone’s name goes on the document that gets filed. That process cannot rest on which model the reviewer happened to have open that afternoon.

If Engineer A’s assistant says “fix this” and Engineer B’s assistant says “fine under the level protocol” for functionally equivalent crossings in the same SoC, the integration engineer is adjudicating between model outputs, not reviewing a design. Worse: the block might pass review in Q2 with one model version and fail review in Q3 when the same engineer re-runs it against a newer release that happens to be more or less absolutist than the previous one. That is variance reintroduced above the structural tool that exists specifically to eliminate it. And when the failure shows up in silicon, the person who signed the report is left holding the candle for a decision made by whichever model was hosted at their desk the week of the review.

I’ll come back to that organizational thread in the next post — it deserves more space than the closing of this one.

Three takeaways

The tools evolve non-monotonically, and the non-monotonicity happens on release-cadence timescales. Two Opus releases ten weeks apart produced incompatible failures on the same review question. Plan for that. Don’t plan for monotonic improvement you can bank on.

A more sophisticated wrong answer is more dangerous than a less sophisticated one. 4.7’s physics rigor made its methodology conclusion more persuasive than 4.6’s had been — and more persuasive wrong is harder to overrule in review than obviously wrong. The gating question for a review assistant isn’t “how well does it reason about the physics,” it’s “does it hold the methodology line under conditions the physics argument can’t fully close.”

“Better model” is not the same as “more trustworthy in review.” Engineering review is not a benchmark task. Capability gains on coding evals do not automatically translate into capability gains on the harder question of when to refuse a plausible-sounding waiver. The trust calculus has to be re-run on every release, and the answers won’t always go the same direction.

The full technical analysis — configuration-by-configuration physics, runt-pulse and MTBF arithmetic, the frequency-regime explanation of why low-frequency silicon forgives the violation, and the reuse economics — lives in the companion reference note at notes.abovethertl.com. That’s the technical archive for the publication going forward; future deep-physics work will publish there so Above the RTL can stay focused on the broader story.

The Spec Is the Design

Marco Brambilla — Thu, 09 Apr 2026 06:38:24 GMT

Usual note on how this was written. Claude Opus as writing partner, me directing the argument and reviewing every concept. See Post 1 for why I think this transparency matters.

In the first three posts of this series, I’ve argued that the engineer’s role is shifting from how to what — from writing implementation syntax to owning design intent. I’ve shown that the model you choose matters enormously, and that even frontier models can be dangerously permissive when it comes to hardware safety rules.

All of that raises a question: if the engineer’s job is increasingly about intent rather than implementation, where does that intent actually live?

The answer, I believe, is the spec. And I mean something very specific by that — not the 200-page PDF that sits in a SharePoint folder and nobody reads after kickoff.

How specs actually work (and don’t)

Let’s be honest about what happens in most chip design organizations.

The project starts with a spec — an ERS, a microarchitecture document, sometimes a full chip-level specification. The team writes it carefully. It describes the major blocks, the interfaces, the register map, the performance targets. Often there’s a numbered feature list: FR-001 through FR-247, each describing a capability the chip must have. The document gets reviewed, signed off, maybe even blessed by the system architect.

Then reality sets in.

The design starts before the spec is complete — it has to, because the schedule demands it. Engineers begin writing RTL for the blocks they understand well enough, while the spec for the trickier parts is still being debated. As the RTL takes shape, edge cases surface that the spec didn’t anticipate. The engineer makes a judgment call, writes the code, and moves on. Maybe they update the spec. More often they don’t — they’re under pressure to hit a milestone, and updating the document feels like overhead.

Week by week, the gap widens. The RTL evolves, the spec doesn’t. By mid-project, the spec describes a chip that no longer exists. By tape-out, the RTL is the spec — the document is an artifact of the past, and everyone knows it.

Now consider what happens on the verification side.

The DV team starts from the spec too — they build testbenches, write checkers, define coverage models based on the numbered feature list. But as simulation runs uncover mismatches, a subtle drift begins. When the RTL doesn’t match the testbench, the team has to decide: is the RTL wrong, or is the testbench wrong? In theory, you go back to the spec and check. In practice — especially when the spec is stale — the team uses their engineering judgment to determine which side is correct and fixes the other. The testbench gets adjusted to match the RTL, or the RTL gets adjusted to match the testbench. The engineers are making the right call based on their understanding of the design. But the spec is no longer the arbiter — and nobody updates it with the decision.

The result is a design with 100% code coverage, 100% toggle coverage, comprehensive functional coverage. The verification team has demonstrated, exhaustively, that they have a perfectly working coffee machine.

Except the spec called for a dishwasher.

This isn’t a failure of engineering talent. It’s a failure of methodology. The spec was supposed to be the source of truth, but it became a historical document the moment the first engineer made an undocumented judgment call. Everything downstream — RTL, verification, constraints, signoff — lost its anchor.

This has been tolerable for decades because the engineer who wrote the spec also wrote the RTL. The intent lived in their head, even when it didn’t live in the document. The spec was communication — a way to tell other people what you were building. If it was ambiguous, the same brain disambiguated it in real time.

AI breaks this model.

Why ambiguity becomes bugs

When an AI generates RTL from a spec, the ambiguity that a human engineer would resolve through experience and domain knowledge becomes a design decision made by the model. And the model will make a decision — it won’t stop and ask. It will pick the interpretation that seems most likely based on its training data, produce syntactically correct code, and move on.

Sometimes that interpretation will be right. Sometimes it won’t. And the failure mode is the worst kind: the output looks correct. It compiles, it simulates, it passes the tests you thought to write. The ambiguity in the spec became a silent assumption in the RTL, and unless someone catches it in review, it ships.

This is not a theoretical concern. Here’s one I’ve seen play out: a spec says “the register shall support both software write and hardware write.” It doesn’t say who wins when both write on the same cycle. The designer — human or AI — makes a choice. Maybe hardware wins, because that’s what their experience suggests. The RTL is clean, the register works, the block passes verification.

Six months later, the firmware team discovers that their writes to this register occasionally don’t stick. They file a bug. The verification team can’t reproduce it in their testbench, because their tests never happened to hit the simultaneous-write corner case.

And this isn’t a failure of verification effort. Simultaneous write scenarios of this kind are notoriously difficult to hit in simulation — even with constrained-random. Without an explicit constraint targeting that exact cycle-level overlap between the SW write and the HW write, a random stimulus generator may never produce it within any practical simulation budget. But nobody wrote that constraint, because nobody knew there was a decision to test. The spec said “shall support both” — which a coverage model dutifully marks as satisfied the first time either path fires. The corner case was invisible to verification not because the testbench was weak, but because the spec never defined it as a corner case at all.

The firmware team spends days chasing what looks like a timing issue. Eventually someone reads the RTL line by line and discovers: hardware has precedence. The firmware needs to read-back after write to confirm. Nobody knew this, because the decision was made by one engineer (or one AI) and never written down. The spec said “shall support both.” It didn’t say what happens when they collide.

In a human-only flow, the designer who chose hardware-wins would probably remember the decision and mention it in a review. In an AI-assisted flow, the model made the choice silently. There’s no memory of the reasoning, no hallway conversation where the firmware lead overhears “oh by the way, HW wins on that register.” The review burden shifts: you’re not checking whether the code matches your intuition, you’re checking whether the code matches intent that was never formally expressed.

And here’s where the old methodology and the new one converge: the spec drift problem existed before AI. AI just makes it lethal. When a human engineer made undocumented judgment calls, at least the intent stayed in someone’s head. When an AI makes those calls, the intent is gone. There’s no brain to query. There’s just code that looks right.

The fix, at least when AI is generating the RTL, is to make silent choices structurally impossible. A well-designed AI flow doesn’t just translate spec to code — it monitors for underspecification. When the model encounters a behavioral case the spec doesn’t address, it shouldn’t pick an interpretation. It should stop, flag the gap against the specific spec section, and refuse to proceed until a human reviews the ambiguity, makes a deliberate call, and updates the spec. The choice gets made — but it gets made at the spec level, explicitly, with a human in the loop.

The harder case is AI-assisted review of human-generated RTL. A human engineer writing code makes dozens of undocumented judgment calls — small decisions scattered across thousands of lines, each individually reasonable, collectively forming a shadow spec that exists only in the code. Getting an AI to read that RTL and reconstruct every implicit decision — then trace each one back to what the spec says (or doesn’t say) — is a far more demanding task than flagging gaps during generation. There’s no clean exception to raise. The AI has to infer intent from implementation, identify the decisions that were made, and surface the ones the spec never authorized. That three-way relationship — between the spec, the implementation, and the artifacts that verify them against each other — is what the next post sets out to formalize.

The spec must become the source of truth

The coffee machine problem and the AI ambiguity problem have the same root cause: the spec stopped being the thing you measured against.

In an AI-assisted design flow, the spec can’t be a communication document that drifts. It has to be the source of truth — the artifact that everything else is generated from, verified against, and traced back to. When the RTL doesn’t match the testbench, the answer is never a blind fix to either side. The testbench may have been written by someone who misread the spec; the RTL may have drifted from it. Both artifacts get measured against the spec. The answer is either “fix the RTL,” “fix the testbench,” or “update the spec” — and the last one is a deliberate, reviewed decision, not an expedient hack at midnight before milestone.

That means the spec needs to be:

Precise enough to generate from. Every behavioral requirement needs to be unambiguous. Not “the register shall support SW write and HW write” but “the register supports concurrent SW and HW write; on collision, HW write takes precedence and the SW write is silently dropped; SW must read-back to confirm.” That level of precision feels pedantic when a human is reading it. When an AI is generating from it, it’s the difference between correct and plausible. And when the firmware team reads it, they know exactly what to expect.

Structured enough to verify against. If the spec says the response latency is at most 5 cycles, that’s a requirement that can become an SVA assertion. If the spec says “the block should be fast,” nobody — human or AI — can write a meaningful check against that. Every requirement in the spec must be translatable into something verifiable: an assertion, a coverage point, a formal property, a test case.

This points to something important about what the spec needs to be. In the methodology we’re building toward, the spec is not a prose document that happens to contain requirements — it’s a structured artifact: YAML, a requirements database, or another machine-readable format in which every feature entry is defined precisely enough to be testable. Not “testable in principle,” but concretely mapped: each requirement either generates a formal property, an SVA assertion, a simulation test case, or at minimum a defined check. The hierarchy matters — formal beats simulation beats manual — but anything that produces a verifiable artifact is a valid expression of a requirement. We’ll expand on what that structure looks like in future posts. The principle belongs here: if a requirement can’t be expressed in measurable terms, it isn’t a requirement — it’s a wish.

Maintained as a living artifact. The spec isn’t done when coding starts. It evolves alongside the implementation, and any change to the RTL that contradicts the spec is either a spec update (the intent changed) or a bug (the implementation diverged). This requires discipline — and here’s the hypothesis at the center of this whole approach: AI is what finally makes that discipline enforceable. The reason it has historically broken down is human fatigue. Re-reading a 200-page spec on the hundredth RTL change, cross-referencing every new line of code against every relevant section — nobody does that consistently under schedule pressure. AI does. It doesn’t tire of reading the same document, doesn’t skip the cross-check because it’s Friday afternoon, doesn’t decide a section is “probably fine.” The discipline that was always theoretically correct but practically unsustainable may finally be achievable — because the entity enforcing it doesn’t get tired.

What this looks like in practice

I’ve been developing a methodology around this, and while the details could fill a book, the core idea is a pipeline:

Phase 1 — Structural decomposition. Load the spec into the AI and extract the design hierarchy: blocks, interfaces, dependencies, parameters. The output is a structured representation that becomes the map for everything that follows — the exact format is still an open question I’m working through, and it matters more than it might seem. What’s already clear is that this phase, as currently implemented, is not a single AI call. It requires a team of agents, each with their own specialization, reviewing and cross-checking each other’s outputs before the result is trusted.

Phase 2 — Coherence checking. With the full spec in context, have the AI cross-reference every section against every other section. Interface mismatches, undefined behaviors, contradictions, missing specifications. A 200-page spec written by three different engineers over six months will have contradictions. Finding them before RTL generation starts saves weeks.

Phase 3 — Requirements extraction. For each block, distill the spec into numbered, verifiable requirements. Each requirement traces back to a specific spec section. This is the layer that turns prose into contracts. At this point, the original prose spec becomes a human-readable reference, not the source of truth. The structured feature list is. The prose should be fully reconstructable from it — if it isn’t, the extraction wasn’t complete.

Phase 4 — Design collateral generation. This is where hardware diverges sharply from software, and it’s worth being explicit about it. In software, the source code is the artifact — everything else is derived from it. In hardware, the RTL is fundamental but, on its own, utterly useless. Without SDC (timing constraints), CDC verification intent, UPF (power intent), scan and DFT constraints, and a growing list of other collaterals, the RTL cannot be implemented, verified, or manufactured. Each of these must be generated from the same requirements, and each must be completely aligned with the RTL and with each other. A timing constraint that doesn’t reflect the RTL’s clock structure is a silent bug. A power domain definition that doesn’t match the RTL’s isolation logic is a functional failure waiting for silicon. The spec-centric approach applies equally to all of them — RTL generation is one output of this phase, not the only one.

Phase 5 — Verification. Generate assertions, coverage points, and test plans directly from the requirements. Every requirement has at least one verification artifact. The traceability matrix connects spec → requirement → collaterals → test — where collaterals means all of them: RTL, SDC, UPF, CDC intent, DFT constraints. Verification is not complete until every requirement is covered across every artifact it touches.

The critical insight is that this pipeline doesn’t work without a spec that’s precise enough to drive it. Garbage in, garbage out applies with ruthless efficiency when the “garbage” is an ambiguous spec and the “out” is AI-generated silicon.

The organizational challenge

I’ll be direct about this: the hardest part of spec-centric design isn’t technical. It’s cultural.

Engineers want to write code. That’s what they were hired to do, that’s what they’re good at, and that’s what feels productive. Writing a precise spec — one that’s unambiguous enough for an AI to generate from — feels slow. It feels like bureaucracy. The temptation to skip the spec and go straight to RTL is enormous, especially under schedule pressure.

But here’s the reframe: writing the spec is the engineering work now. The precision that used to go into hand-crafting RTL now goes into hand-crafting the spec. The intellectual challenge hasn’t decreased — it’s moved upward. Defining exactly what a block should do at every boundary, under every condition, is harder than coding it. The code is the easy part. The thinking is the hard part. And the thinking lives in the spec.

There is a harder truth underneath this that the industry is not yet ready to say plainly: engineers should no longer be writing RTL. The writing should be done by AI. The engineer’s role is to instruct, review, and approve — not to type the code.

We should be precise about what that means in practice, because it’s easy to misread. It does not mean throwing a vague sentence at a model and expecting a correct memory controller to emerge. In the near term, generating anything beyond simple blocks will require intensive back-and-forth: clarifying intent, correcting misinterpretations, tightening the microarch spec until the model has enough precision to proceed. There will be cases where writing the RTL directly is still faster than describing the behavior in sufficient detail for the AI to get it right. That is where the technology is today, and it’s a legitimate engineering judgment call.

But the direction is not ambiguous. Writing RTL — the physical act of authoring Verilog or SystemVerilog syntax — is a skill that will matter less with each generation of tooling. What will matter is the ability to specify with precision, to evaluate generated output, and to recognize what the model got wrong. For now, understanding the code well enough to review it remains essential. The threshold for what can be trusted without close inspection will keep rising. The goal is to move that threshold — not to defend the one we’re standing at today.

“I’ll never ship code I don’t understand” — it sounds like a principle. It’s actually a myth engineers tell themselves. In any real project, you have coworkers. You review their PRs at the architectural level, you read the commit message, you spot-check the tricky parts. You do not read every single line of every module that ends up in the chip. You never did. What you actually do is calibrate trust: you decide how much autonomy to extend to a given person, on a given block, given what you know about them. A senior engineer you’ve worked with for five years gets a different level of scrutiny than a contractor you hired last month.
AI has the same calibration problem. The question isn’t “do I understand every line it wrote?” — the question is “have I worked with this tool long enough, on this class of problem, to know where it’s reliable and where it needs supervision?” That’s how you treat a skilled colleague. Not with blind trust, and not with the paranoid assumption that every output needs to be verified from first principles. You learn the failure modes, you build intuition about where to look closely, and you extend autonomy accordingly.

Hardware has one structural advantage here that software doesn’t: a deep bench of deterministic automated checkers. Lint catches style and coding rule violations before simulation. CDC and RDC tools find structural clock and reset domain crossings that no human reviewer reliably spots at scale. SDC checkers verify that timing constraints are consistent with the design. Formal verification and simulation catch functional divergence. Power extraction validates that UPF intent matches RTL behavior. These tools don’t care whether the code was written by a human or an AI — they apply the same rules either way. That changes the trust calibration: extending autonomy to an AI-generated collateral isn’t a leap of faith when a CDC tool is going to run over it regardless. The toolchain is part of the review.

What comes next

This post has been deliberately general — a framing of why the spec matters and what it needs to become. In the next post, I’ll get much more specific. Design by Contract is a concept from software engineering that maps surprisingly well onto hardware interfaces: clock domains, resets, protocols. I’ll show what it looks like to define formal contracts at block boundaries — assume/guarantee pairs that are precise enough to drive assertion generation and compositional verification.

The spec is the design. The contract is how you make it enforceable.

The Physics of Safety: Why We Get CDC Wrong, and Why AIs Get it Worse

Marco Brambilla — Tue, 07 Apr 2026 05:07:40 GMT

A note on how this was written. A combination of Claude Opus and Gemini as writing partners this time around, with me directing the argument and reviewing every claim. Fitting, given what this post is about. See Post 1 for why I think this transparency matters.

Every textbook on digital design talks about Clock Domain Crossing (CDC) and metastability. They show you the internal CMOS cross-coupled inverters of a flip-flop, detail the setup and hold windows, and invariably use the mechanical analogy of a ball balanced perfectly on the crest of a hill. They spend pages explaining precisely why and how metastability happens inside the flip-flop.

But almost none of them spend time explaining why it is so catastrophic after the flop.

Let’s clear that up right now, because understanding the “after” is the only way to understand why synchronizer methodology isn’t just a set of guidelines—it’s safety physics.

The Danger After the Flop

When a signal violates a flip-flop’s setup or hold time, the internal transistors fight each other, and the output node gets stuck hovering at a mid-level voltage—let’s call it VDD/2VDD/2.

If that mid-level voltage escapes the flop and travels down the wire into your clock domain, it’s going to hit the next stage of logic. In any real design, that signal will have a fan-out; it will drive multiple downstream gates simultaneously. Let’s say it drives Gate A and Gate B.

Here is where the silicon reality bites: because of microscopic variations in semiconductor manufacturing, local voltage (IR) drops across the die, and temperature gradients, Gate A and Gate B do not have the exact same logic switching threshold.

Furthermore, a metastable signal isn’t just hovering at a mid-level voltage—it is also incredibly slow to resolve. Instead of snapping with a clean, sharp transition, it sludges through the voltage threshold. Because the edge rate is so degraded, different branches of the signal path will see the value change at significantly variable delays, completely blowing up any static timing analysis you did on the logic cloud.

So when that degraded voltage sweeps across the downstream logic, two disasters can occur: Gate A might interpret it as a perfect logic HIGH (’1’) while Gate B interprets the exact same voltage as a logic LOW (’0’). Or, Gate A might see the transition much earlier than Gate B, causing a functional timing failure.

In a single clock cycle, your logic paths have completely decorrelated. They disagree on the fundamental reality of the system. Because of this, a state machine’s next-state computation becomes essentially random. It can even be forced into forbidden, illegal states—imagine a strict one-hot encoded state machine that suddenly registers two bits hot because the branching decode logic read the metastable signal differently. The logical coherence of the design simply shatters.

This is why metastability is lethal. It isn’t just a signal arriving late; it is systemic structural corruption. It causes hard crashes that are completely invisible in your RTL simulator (because simulators assume instantaneous, ideal logic levels) and infuriatingly unreproducible in the lab (because they depend on exact voltage and temperature alignments).

The Synchronizer (And the Golden Rule)

The defense mechanism against this is the standard two-flop synchronizer. You put two flip-flops in series on the receiving clock domain. The first flop catches the asynchronous signal. It might go metastable, but you give it an entire clock cycle for that internal ball to fall off the hill and resolve to a clean ‘1’ or ‘0’. The second flop then safely samples that clean signal and presents it to the rest of the logic.

This exact mathematical necessity is why the two flops must be physically placed directly next to each other on the silicon. In fact, many modern foundries provide a dedicated, pre-characterized two-flop sync macro in their standard cell libraries. The first flop in this cell is structurally optimized for rapid metastability resolution, and the second is a “regular” flop. By packing them tightly together, the physical design flow avoids wasting precious setup margins buffering or driving a degraded signal over a long wire. Instead, it allocates the vast majority of the clock cycle entirely to resolving the metastability.

Because this risk is so absolute, we have Golden Rules in hardware design. One of the biggest: You never put combinational logic before a synchronizer.

But what happens when you ask the smartest AI models in the world to evaluate that rule? Do they understand it?

Let’s look at a “simple” test I recently threw at ChatGPT 5.3, Claude Opus 4.6, and Gemini 3.1 Pro.

I gave them a single-bit synchronizer used to pass a level signal (guaranteed to stay high long enough for the receive clock to capture it reliably). However, there are multiple request signals, and as long as at least one requests it, the sync request stays high. I placed an OR gate before the synchronizer.

I asked the models to evaluate this configuration across four scenarios:

Case A: All requesting flops are in the same source clock domain and are ORed together.
Case B: All requesting flops are in the same source clock domain, and the request signal is the decode of a state machine.
Case C: All requesting flops are in different clock domains and are ORed together.
Case D: All requesting flops are in different clock domains and there is combinational logic before the OR.

Why Context Is Everything

This question zeroes in on exactly what we just discussed: the physics of passing a signal between domains.

To a non-expert, Case C (OR’ing signals perfectly asynchronous to each other from different clock domains) sounds obviously incorrect—and all the models easily flagged it.

But Case A is a massive, deadly trap. Because the source flops are all in the same clock domain, designers (and AIs) often assume the resulting signal is “synchronous and therefore safe.” They completely discount the fact that physical routing delay differences and the analog nature of the OR gate itself will inevitably create a glitchy signal before it ever reaches the synchronizer.

ChatGPT 5.3 fell right into this trap. It confidently declared Case A “Generally OK” and “glitch-free,” completely missing the physical reality of TcoTco (clock-to-output) skew.

Claude Opus 4.6 did better, but would still steer a junior designer off a cliff. It correctly recognized the TcoTco skew and noted that Case A could generate a narrow glitch. But it concluded that because the glitch is very narrow, the statistical probability of it being captured is low—meaning the design “works, with a caveat.”

Gemini 3.1 Pro provided the only methodologically correct answer: Every single one of these implementations is unsafe.

The absolute, non-negotiable physical rule of CDC is this: There must be exactly ONE flop before the synchronizer. No gates, no combinational logic, no exceptions.

Why “Low Probability” Still Kills Silicon

To understand why Claude’s answer (”the glitch is narrow, so the probability of capturing it is low”) is so dangerous, you have to look at the math that keeps our silicon safe.

When Gemini 3.1 Pro was pushed to explain the physics of why Case A is a methodology violation despite the low statistical probability of capture, it nailed the absolute truth of synchronizer design.

The problem isn’t about how often a narrow glitch—what we call a “runt pulse”—gets captured. The real issue is what happens when it does get captured.

The MTBF (Mean Time Between Failures) calculations that prove a synchronizer is trustworthy are mathematically built on a fundamental assumption: the incoming signal has a sharp, clean edge.

When you OR two signals together, and routing skew causes them to slightly overlap, the transistors inside the OR gate fight each other. The output doesn’t transition cleanly; it dips to a mid-level voltage and weakly pulls back up, generating a runt pulse. If your synchronizer captures that sludgy, half-voltage runt pulse exactly as the clock ticks, you aren’t just risking standard metastability—you are forcing the flip-flop into a much deeper metastable state.

This violently degrades the resolution time. A synchronizer that gets hit by a runt pulse will take substantially longer to resolve than one hit by a clean edge.

When that happens, you are no longer operating inside the design’s safety envelope. The golden rule—exactly one flop before the synchronizer—isn’t there just because a glitch is likely to cause a problem. It is there because if you violate it, you fundamentally break the physics assumptions of the MTBF math, meaning you can no longer computationally prove that your chip won’t fail.

And in hardware, if you can’t prove it works, it’s already broken.

The “Silicon is Forgiving” Fallacy

When you bring up the physical reality of runt pulses and MTBF degradation, you will inevitably hear a variation of this defense from experienced engineers:

“We do this all the time. Silicon is inherently good, and it practically always forgives things like this. These failures happen in theoretical models, not in reality.”

To be fair, they aren’t entirely wrong—but they are deeply misguided. Silicon is incredibly robust. Statistically speaking, the physical chance of a runt glitch perfectly aligning with the exact setup and hold window of a receiving flip-flop is astronomically low. In the lab, on the test bench, and at room temperature, the design will pass. It will probably pass for the first million hours of customer use.

This is exactly the logic Claude used to dismiss the risk. It looked at the probability and decided “it works.”

But here is the reality of modern semiconductor scale: when you ship 100 million devices, a “one in a billion” statistical anomaly stops being a theoretical edge case. It becomes an emergent, systemic failure.

When that failure finally happens in the field, it doesn’t leave a software crash dump. It manifests as a silent, unexplainable hang. It happens only at 0°C or 105°C, or only when a specific voltage drops on the power rail. It triggers an RMA (Return Merchandise Authorization) that your verification engineers will spend weeks trying to reproduce, chasing a ghost that refuses to show up in standard RTL simulations because simulators don’t model runt pulses.

Rule-bending relies on the assumption that “it usually doesn’t fail.” True hardware engineering relies on mathematically proving that it cannot fail. We don’t enforce the “One Flop Rule” because failure is likely. We enforce it because the moment you step outside the physical assumptions of your MTBF math, you are no longer designing. You are gambling.

And at advanced nodes and massive scale, gambling with silicon is how you waste millions of dollars on a respin. I have a personal rule I half-jokingly repeat to my engineering teams: As engineers, we should save our luck to protect us from the issues we weren’t aware of. We shouldn’t waste it on the stuff we already knew about.

The Deadly Forest of CDC Waivers

But let’s play devil’s advocate for a second. If a runt pulse failure is truly “one in a billion,” maybe it really is just as rare as a software glitch or a cosmic ray bit-flip. If it happens that rarely, why bother fixing it?

Because the danger of breaking the rules is much more insidious than just relying on luck.

When you adopt a culture of “it probably works,” your design quickly becomes filled with tens of thousands of CDC waivers. Your verification tools will flag every single instance of combinational logic before a synchronizer, and your tired engineers will look at the logic, decide it’s statistically safe, and hit “waive.”

This creates waiver blindness. Hidden somewhere in that forest of 10,000 “probably safe” waivers is one real, catastrophic CDC error. But because your team’s attitude is that CDC glitch and reconvergence issues can generally be waived, that deadly error gets rubber-stamped along with the rest. It is perfectly camouflaged.

If, instead, you enforce the strict taxonomy of CDC design—where exactly ONE flop sits before the synchronizer—you drastically reduce your waivers. (We will always need a few unfortunate waivers, such as recombining after a gray code pointer, but they should be the exception, not the rule). When your design is clean, the handful of remaining CDC warnings are taken incredibly seriously. They stick out like sore thumbs.

This brings us right back to the role of AI in hardware engineering.

If you use a lightweight model that only understands Verilog syntax, it will look at a questionable CDC structure, calculate the statistical likelihood of failure, shrug, and tell you it’s “Generally OK.” It will actively help you grow your forest of dangerous waivers.

But if you use a frontier model that understands hardware physics, you unlock an entirely new, incredibly powerful symbiosis between AI and traditional EDA workflows.

You let the deterministic EDA tool do what it does best: it analyzes every single line of code and flags all the risky CDC areas computationally, without burning inference tokens or missing corner cases.

Then, you let the AI do what it does best: evaluate intent. When combined with tools like Verific’s Invio or Defacto’s SoC Compiler that run structural analysis, you can feed those flagged regions to the AI. The AI applies sharp physical reasoning to the actual violation. It understands the code and its intent. It can separate a harmless gray-code recombination from a fatal “Case A” architecture trap, and propose the exact, methodologically sound fix.

And unlike a verification engineer staring at their 100,000th waiver review on a Friday afternoon, the AI never gets tired of looking at RTL. It will flatly refuse to waive something that should never be waived on waiver number 1, and on waiver number 100,000.

This fundamentally shifts the team dynamic. It puts deep, physical CDC understanding within the reach of junior engineers. Instead of a junior designer making a statistical guess and waiting three weeks for a Senior Architect to flag the violation in a grueling design review, the AI safety net catches it at the desk and explains the physics immediately. This drastically reduces the bottleneck of manager-led and expert-led design reviews.

The Imperative of Internal Testing

However, this brings us full circle to the problem I highlighted in the previous post: We currently lack public benchmarks to prove an AI can do this safely.

Before any hardware company integrates this AI + EDA symbiosis into their production flow, they have a massive engineering responsibility. You cannot just deploy an LLM and hope it continues to understand metastability forever. You must rigorously prove that the toolchain will not hallucinate structural safety, and that it will not regress on future version updates.

Semiconductor companies must prepare their own internal, rigorous test suites—exactly like the CDC synchronizer test we just ran. They must continuously test their AI methodology against these cases, and explicitly prompt the system with absolute guardrails: Do not validate anything that does not match our known safe methodologies. If a structure falls outside the Golden Rules, either propose a structurally safe fix, or immediately highlight it for review by a human expert.

That is the difference between using an AI as a syntax generator, and building an AI-enabled methodology that acts as your ultimate safety net.

What’s Next: Rethinking the EDA Flow

But we shouldn’t stop at just catching errors. EDA companies currently hold the keys to this kingdom, but their structural analysis flows are still fundamentally reactive. With the introduction of intent-aware AI, there is a massive, untapped opportunity to completely rethink how we automate hardware signoff from the ground up.

I have a few very specific ideas on how EDA vendors need to forcefully evolve their toolchains to make this intensely automated future a reality—but we’ll save that deep dive for the next post. Subscribe below so you don’t miss it!

The Model Matters More Than You Think

Marco Brambilla — Sat, 04 Apr 2026 00:51:21 GMT

A note on how this was written. A combination of Claude Opus and Gemini as writing partners this time around, with me directing the argument and reviewing every claim. See Post 1 for why I think this transparency matters.

In the first post in this series, I argued that AI is coming for chip designers — as a tool, not a replacement. That the engineer’s job is moving up the abstraction stack, from writing syntax to owning intent.

But there’s a critical assumption buried in that argument: that the AI you’re using actually understands hardware.

Most don’t.

The synchronizer test

Here’s something every chip designer learns early in their career: when a signal crosses from one clock domain to another, you need a synchronizer. Two flip-flops clocked by the destination clock, in series. The first flop may go metastable — that’s expected. The second flop samples the resolved output and provides a clean signal to the destination domain. This is foundational. It’s in every textbook. Get it wrong and you get intermittent, unreproducible failures that escape simulation and show up in silicon.

I asked two popular open-source models — Qwen 3.5 (9B parameters) and Qwen Coder (14B parameters) — to write a basic two-flop synchronizer. Both produced code. Both produced code that was syntactically correct. Neither produced a synchronizer.

Both models clocked the two synchronizer flops with the source clock. Then they added a third flop on the destination clock to capture the output.

Think about what that means. The two “synchronizer” flops are just a pipeline in the source domain — they accomplish nothing. The actual clock domain crossing happens at the third flop, which is a single bare register with no metastability protection. It’s worse than no synchronizer at all, because it looks correct. It has the right structure, the right signal names, the right number of flops. A quick visual review might miss it. The synthesis tool won’t flag it. And if it goes metastable in silicon, you’ll spend weeks chasing a bug that only appears under certain temperature and voltage conditions.

I showed both models the error. Explained exactly what was wrong and why. Neither could fix it. They didn’t understand what the correction should be — because they don’t understand what a synchronizer does. They had learned the shape of the code, not the purpose of the circuit.

This is not a minor failure. This is the difference between a tool you can use and one that will bury a silicon-killing bug under syntactically perfect code.

And here’s the real damage: imagine a designer — someone with ten or twenty years of experience — sees this output. They asked the AI to do one of the simplest things in digital design, and it got the clocking wrong on a synchronizer. What do they do? They close the tool, they tell their colleagues it doesn’t work, and they go back to writing RTL by hand. You’ve lost them — not because AI can’t help them, but because the wrong model just proved to them that it can’t be trusted with the basics.

That first impression is almost impossible to undo. Every engineer I know who has dismissed AI tools has a story like this: they tried it once, it produced something obviously wrong, and they concluded the technology isn’t ready. In many cases they were right — but about the model, not about AI in general.

Expert models—like Claude, GPT-4, Gemini, and even advanced open-weight models like Gemma 4 31B—get this right every time. I’ve generated dozens of simple CDC blocks with them, and the destination clock is exactly where it belongs. When asked to explain why the flops need to be on the destination clock, they give a correct, coherent explanation about metastability resolution.

At first glance, this looks like a huge win. The larger models know how to write the code. But there is a dangerous trap here: just because a model has memorized the correct structure of a common circuit doesn’t mean it actually understands the physics underlying it. Getting the basic coding right is just table stakes. The real test is methodology.

Pushing the frontier models

The two-flop synchronizer test is the baseline. But I wanted to see what happens when you push the frontier models on a much more delicate CDC architecture question. We explicitly said in the first post that I am not pushing a specific tool, so I ran this test across four of the best models available today: Claude Opus 4.6 (extended), Gemini 3.1 Pro, ChatGPT 5.3, and Gemma 4 31B.

Here is the setup: I gave them a complex single-bit synchronizer scenario with multiple OR’ed requests, testing four different configurations involving combinational logic before the synchronizer and clock domain crossings.

The results were incredibly revealing:

ChatGPT 5.3 hit immediate methodology violations, confidently calling an unsafe single-domain OR “Generally OK.”
Claude Opus 4.6 caught the risk of a narrow glitch but concluded it “works, with a caveat” because the glitch probability is low—steering a designer down a dangerous path.
Gemini 3.1 Pro provided the only methodologically correct answer: none of these implementations are safe.

When pushed on the physics of why, Gemini correctly landed on the deeper truth of synchronizer design: that runt glitches force flops into deeper metastable states, which completely invalidates the MTBF math that proves the silicon safe.

Keep in mind, I only caught this because it was a microscopic, directed test. But imagine a designer asking an AI to generate a large, complex block. If the AI hallucinates a structure like Case A and calls it “generally OK,” that error will get buried in thousands of lines of RTL. The synthesis tool won’t flag it. It might not get caught until CDC signoff—or worse, a tired engineer might decide to waive the CDC warning because the logic “looks right,” sending a fundamental metastability flaw straight to silicon. The fact that Gemini definitively “refused” to budge on the methodology is exactly the kind of safety net hardware engineering requires.

But models change constantly. This introduces a new operational requirement for hardware teams: regression testing. You cannot blindly deploy a model update just because its generic coding benchmarks improved. Companies must build their own suite of CDC and methodology tests and run rigorous regressions on every model version to ensure these fundamental capabilities remain intact.

(This specific test gets into the deep physics of runt glitches and metastability calculations. Analyzing what the models got wrong here tells you everything you need to know about how AI handles complex hardware constraints. I’ll do a full teardown of this problem—and exactly why a runt glitch breaks a synchronizer’s safety envelope—in the next post.)

Why this happens

The explanation is straightforward: training data.

Large language models learn from the data they’re trained on. The public internet contains billions of lines of Python, JavaScript, Java, and C++. It contains a vanishingly small amount of SystemVerilog, and even less of it is production-quality RTL with correct CDC handling. Verilog and SystemVerilog combined account for fewer than 7,000 publicly tagged repositories on GitHub — against over 500 million total.

Smaller models, trained on smaller datasets, have seen even less hardware content. A 14B parameter model has likely encountered a handful of synchronizer examples in its entire training set — if any. It has learned that synchronizers involve two flops and a clock, but it hasn’t seen enough correct examples to learn which clock matters and why.

Frontier models — Claude, GPT-4 — have larger training sets and more capacity to retain domain-specific knowledge. They’ve seen more RTL, more CDC documentation, more EDA tool references. But even they are working from a fundamentally impoverished corpus compared to what they have for software.

This is the training data gap that Moshe Zalcberg identified in his hype cycle analysis. It’s not just an abstract structural problem. It shows up concretely, in wrong clocks on synchronizer flops, in incorrect reset sequencing, in SDC constraints that are syntactically legal but semantically wrong.

The NVIDIA lesson

This is also exactly why NVIDIA built ChipNeMo. As I discussed in Post 1, NVIDIA invested in domain-adaptive pretraining on 23 billion tokens of their own internal chip design data — 30 years of design documents, bug reports, and verification scripts. They did this because they understood that general-purpose models, no matter how large, don’t have enough hardware knowledge to be reliable.

But here’s the uncomfortable follow-up: NVIDIA can do this because they’re NVIDIA. They have the data, the compute, and the institutional history to build a domain-specific model. Only a few top companies can do something similar, or can even afford to.

Everyone else — and that’s the vast majority of the semiconductor industry — does not have 30 years of proprietary RTL, bug databases, and design reviews to train their own model. They cannot build a ChipNeMo. They are entirely dependent on what the commercially available LLMs offer out of the box. For these companies, the quality of the frontier models isn’t a nice-to-have — it’s the whole story. If the best available model can’t write a correct synchronizer, there is no fallback. There is no internal corpus to fine-tune against, no domain-adapted alternative to switch to.

This is why model selection matters so much more in chip design than in software. A software team that picks a weaker model gets slower code reviews and buggier autocomplete — annoyances they can iterate past. A chip design team that picks a weaker model gets wrong clocks on synchronizers, incorrect CDC assumptions, and SDC constraints that look right but aren’t. And they may not find out until silicon.

What I’ve seen work (and what doesn’t)

I want to be specific here, because vague claims about AI quality are part of the noise this series is trying to cut through.

Frontier models (Claude Opus/Sonnet, GPT-4) — reliable for structured hardware tasks. Synchronizers, FSMs, protocol implementations, SVA assertions, SDC constraints. They understand the domain well enough that the output is correct more often than not, and when they make mistakes, they can usually diagnose and fix them when shown the error. Claude in particular has been consistently strong on CDC-related work — I’ve used it extensively for assertion writing and formal verification setup, and it understands the semantics, not just the syntax.

Mid-tier models — hit or miss. They can produce useful boilerplate and handle simple RTL tasks, but they start failing on anything that requires domain-specific reasoning. CDC, timing constraints, reset sequencing — the areas where getting the semantics wrong is dangerous — are unreliable.

Small open-source models (sub-20B parameters) — not ready for hardware. The synchronizer test is the simple version. These models also struggle with basic concepts like the difference between blocking and non-blocking assignments in the context of synthesis, or why you can’t have combinational loops in synchronous logic. They have learned Verilog syntax from whatever fragments exist in their training data, but they haven’t learned hardware design.

This doesn’t mean small models are useless everywhere. For scripting, documentation, code formatting, and other tasks where hardware domain knowledge isn’t critical, they can be fine. But for anything that touches the actual design — RTL, constraints, assertions, verification — model selection is a professional decision with real consequences.

Model selection is now an engineering skill

This brings me to the point I want to leave you with.

When the industry talks about “using AI for chip design,” it often treats the model as interchangeable — as if the magic is in the workflow, the RAG system, the agent framework, the integration with EDA tools. Those things matter. But they’re all downstream of a more fundamental question: does the model understand hardware?

If it doesn’t, no amount of prompt engineering or toolchain integration will save you. You’ll get confident, syntactically correct output that embeds domain errors you may not catch until silicon. And the smaller your team, the fewer experienced engineers you have to review that output, the more dangerous this becomes.

Choosing the right model for hardware work is not a procurement decision. It’s an engineering decision, and it deserves the same rigor you’d apply to selecting an EDA tool or a verification methodology.

So what does that rigor look like in practice?

It means you don’t evaluate AI for your team by reading marketing claims, and you certainly don’t rely on generic software coding benchmarks. You need to build your own “synchronizer test.” You take half a dozen fundamental, domain-specific tasks that are critical to your workflow—a specific CDC scenario, a tricky SDC constraint problem, a finite state machine with specific reset conditions—and you see how the models handle them. You test them on edge cases. You see if they can correct themselves when you explain an error in their logic.

You establish this baseline of hardware competence before you roll the tool out to your engineers. Because as we saw earlier, you only get one chance to build their trust. If you hand them a model that doesn’t understand the domain, they won’t blame the model. They’ll blame the technology, and they’ll go back to writing RTL by hand.

In the next post, we will do a full teardown of the advanced CDC synchronizer test, breaking down exactly how these models handled the deep physics of hardware methodology and what it means for your verification flow. After that, we’ll move to the spec: what it means to treat it as the central artifact in an AI-assisted flow, and why Design by Contract isn’t just a software concept.

Is AI Coming FOR You, or for YOU?

Marco Brambilla — Fri, 03 Apr 2026 21:32:25 GMT

Is AI Coming FOR You, or for YOU?

The noise is louder than the signal. Let’s fix that.

A note on how this was written. Practicing what I preach: this post was written with Claude Opus as a writing partner. I directed the argument, shaped every claim, and reviewed the final result. The AI helped draft the prose. That’s the what vs. how distinction this entire series is about — and yes, it works for writing too.

If you’re a chip design engineer and you’ve been reading the headlines, you’d be forgiven for thinking your career has an expiration date.

Jensen Huang says he wants every engineer at NVIDIA burning through a massive number of AI tokens per year. The productivity studies say 50x. The conference keynotes show demos where natural language becomes RTL in seconds. The implication, if you take it all at face value, is clear: the craft you spent a decade mastering is about to be automated out from under you.

I want to make two arguments in this post. The first is that the fear is understandable but wrong. The second is that the hype is understandable but dangerous — not because AI isn’t real, but because the distorted version of it is causing engineers to freeze and managers to plan against a fantasy.

Give Jensen his due

Let’s start with Jensen, because he’s being misquoted by implication.

When Jensen talks about token consumption as a metric, he’s not saying engineers are replaceable. He’s saying something much more specific: if you’re not integrating AI into your daily workflow right now, you’re falling behind. It’s a call to action, not a eulogy.

Jensen knows perfectly well that AI is not ready to design chips autonomously. He’s said as much — he’s talked publicly about how NVIDIA is improving AI’s ability to generate code, which is an admission that it’s not there yet. This is a man whose company tapes out some of the most complex silicon on the planet. He understands the gap between generating plausible-looking RTL and shipping a chip that works.

The message isn’t “AI is replacing you.” The message is “tool up or fall behind.” That’s a fundamentally different statement, and it’s one that every serious engineer should take seriously. The mental switch Jensen is signaling is real and important: stop treating AI as a threat to resist and start treating it as a capability to develop.

This isn’t theoretical for Jensen. NVIDIA built ChipNeMo, a domain-adapted LLM trained on 23 billion tokens of their own internal chip design data — 30 years of design documents, bug reports, verification scripts, and engineering decisions — and deployed it to over 11,000 engineers. They’ve seen firsthand where AI delivers real value: bug triage, EDA script generation, onboarding junior engineers who suddenly have access to three decades of institutional knowledge in a five-second query. And they’ve seen where the human remains irreplaceable: the spec, the intent, the architectural judgment.

Here’s the telling detail: at GTC 2026, Jensen noted that 100% of NVIDIA’s software engineers use off-the-shelf AI tools — Claude Code, Codex, Cursor. Three different companies, no internal solution needed. For chip design, NVIDIA had to build their own. As CTO Bill Dally put it, the thing that makes ChipNeMo work is 30 years of design data that doesn’t exist anywhere else. General-purpose AI isn’t enough for this domain. Jensen knows that better than anyone making the headlines.

Now here’s the uncomfortable part for the rest of the industry: NVIDIA can do this because they’re NVIDIA. So can Intel, Apple, Qualcomm — a handful of companies with decades of proprietary design data at massive scale. Everyone else — and that’s the vast majority of chip design organizations — is working with what the commercially available models offer. Models trained overwhelmingly on software, not hardware. Models that have seen billions of lines of Python and JavaScript, and a vanishingly small amount of SystemVerilog. For these companies, the gap between the AI hype and what the tools can actually deliver today is even wider. The disillusionment hits harder, because there’s no internal corpus to fall back on.

But that nuance evaporates the moment the quote hits a headline. What engineers hear is: even Jensen thinks we’re done.

The 50x problem

Then there’s the productivity story, which makes things worse.

I wrote about this in detail in a recent LinkedIn post: the 50x productivity claims, the studies that contradict them, the perception gap where engineers think they’re faster but measurably aren’t. I won’t rehash all the data here — go read that post if you want the specifics.

The short version: the numbers don’t survive contact with reality, and they’re doing active damage when they get repeated uncritically in planning meetings.

But here’s what I want to add, because it’s the part that matters most for hardware: even in software, where you can test in seconds and ship a fix tomorrow, the productivity story is far messier than the headlines suggest. Now imagine applying those same inflated expectations to a domain where the planning alone takes months, where the verification pipeline runs overnight, and where a mistake in silicon costs tens of millions of dollars with no hotfix available.

The gap between what AI can do today and what chip design demands isn’t just a matter of model capability — it’s a matter of planning complexity. A chip is not a codebase you iterate on. It’s an artifact you must get right before it exists physically. That requires a level of upfront specification, constraint definition, and cross-domain coordination that has no equivalent in software. And that planning layer — the hardest, most valuable part of the work — is exactly the part that AI cannot do for you.

The damage on both sides

This matters because the distortion hits from two directions simultaneously.

Engineers hear the inflated claims and conclude they need to either become AI prompt wizards overnight or start updating their resumes. The anxiety is real — I talk to engineers who feel it. Some are paralyzed, unsure what to invest in learning. Others are churning out AI-generated code to look productive without understanding whether what they’re producing is actually correct. Neither response is healthy.

Managers hear the same claims and conclude they can do more with less. They walk into planning meetings expecting the mythical 50x and staff accordingly. When reality delivers something closer to 1.2x — with new categories of bugs to deal with — the gap between expectation and delivery creates its own set of problems. Teams get squeezed. Schedules get set to fantasy numbers. The engineers who are supposed to be benefiting from the tools end up under more pressure, not less.

The irony is that both sides are reacting rationally to bad information.

Where the engineer’s value actually lives

Moshe Zalcberg recently published an excellent analysis of why AI adoption in chip design structurally lags software — the training data gap, the slow feedback loops, the correctness bar, the proprietary toolchains. I’d encourage you to read it; he’s right on all four counts, and I won’t repeat his argument here.

What I want to focus on is something different: the nature of the work itself, and specifically what happens before anyone writes a single line of RTL.

A chip design project begins with months of planning that has no real equivalent in software. Before a gate is synthesized, engineers must define clock architectures, reset strategies, power domains, interface protocols, and the timing relationships between all of them. They must specify what every block does, how blocks talk to each other, what assumptions each block makes about its neighbors, and what guarantees it provides in return. This is not boilerplate. This is the intellectual core of the work.

I’ll put it simply: the engineer’s job is moving up the abstraction stack, not disappearing.

The AI can write your SystemVerilog. It can generate your SDC constraints. It can produce your SVA assertions. And it will remember the syntax options and corner-case flags that you forgot existed. That’s genuinely valuable.

But only the engineer can review a spec and determine whether the intent is correct. Only the engineer can look at a clock domain crossing definition and understand whether the synchronization strategy actually matches the system’s timing requirements. Only the engineer can evaluate whether a test plan covers the failure modes that matter, not just the ones that are easy to test.

The distinction is between what and how. AI is getting very good at how. The what — the specification, the intent, the architectural judgment — that’s where the engineer’s value lives, and it’s not going anywhere.

And here’s the part that gets underappreciated: that what layer doesn’t just need to be correct. It needs to be precise enough that both humans and machines can execute against it. A vague spec was tolerable when the same engineer who wrote it also wrote the RTL — the ambiguity lived in their head. In an AI-assisted workflow, ambiguity in the spec becomes bugs in the output. The spec has to become a contract: formal, unambiguous, and verifiable. That’s harder than writing the code, and it’s a skill the industry needs to develop deliberately.

What comes next

This is the first post in an ongoing series. Three threads will run through everything that follows.

The first is the spec as the nexus of the design. If the engineer’s job is shifting from how to what, then the spec — the formal expression of design intent — becomes the most important artifact in the entire flow. I’ll explore what a spec actually means in chip design, why it’s harder to write than the code it describes, and how concepts like Design by Contract apply to hardware interfaces: clock domains, resets, protocols. The spec isn’t just documentation. It’s the contract that everything else — implementation, verification, signoff — must be measured against.

The second, and arguably the bigger topic, is verification. Verification is where the majority of chip design effort and cost already lives, and it’s where AI has the most potential to change the economics — but only if we get the approach right. I’ll dig into how we actually measure AI’s effectiveness in a verification flow, how to write assertions in natural language and have them mean something formal, and how AI-assisted verification can close gaps that simulation alone never will. I’ll show working examples, including an AI-assisted CDC verification proof, to make this tangible rather than theoretical.

The third is the model matters more than you think. Not all AI is equal, and in chip design the differences are stark. I’ve tested smaller open-source models on something as fundamental as writing a clock domain synchronizer — and watched them use the wrong clock. Worse, when shown the error, they couldn’t understand what was wrong. The frontier models get this right every time. In a domain where a subtle bug costs millions, the gap between a model that understands hardware semantics and one that’s pattern-matching syntax is not academic — it’s existential. I’ll share concrete comparisons.

The goal isn’t to sell a methodology or push a product. It’s to have an honest, technically grounded conversation about where this is actually going — from someone who works in the trenches, not from a keynote stage.

If that sounds useful, stick around. And if you disagree with anything I’ve said here, I want to hear it. The best thinking comes from friction.