Claude 4.7: Getting It Wrong More Persuasively

Post 3 set out a CDC test. 4.6 failed it. 4.7 ships ten weeks later and fails it differently — more sophistication, more confidence, a second error on top. Why "capable" isn't "trustworthy."

Apr 25, 2026

Written with Claude Opus.

In my previous post I walked through a CDC review question: an OR of multiple requesters feeding a 2-flop synchronizer, across four configurations of the OR — same-domain flops, same-domain state-machine outputs, different-domain flops, and different-domain combinational logic. I tested ChatGPT 5.3, Claude Opus 4.6, and Gemini 3.1 Pro against it. Gemini was the only one that gave the methodologically correct answer: all four are CDC violations, full stop. ChatGPT 5.3 declared the same-domain case “generally OK.” Claude 4.6 recognized the physics of the glitch but concluded the design “works, with a caveat.” Both positions steer a junior designer toward waiving a violation that has no business being waived.

Anthropic shipped Claude Opus 4.7 on April 16, about ten weeks after 4.6 went out in early February. I re-ran the same question against 4.7 within days of its release. The answer moved — in a specific direction, and in a way worth paying attention to. This post is about what moved, what didn’t, and why the “better model” assumption is not safe when AI is being deployed into engineering review.

What Claude 4.6 did

Claude 4.6 got to the right structural observation about the same-domain case — that routing-delay skew at the OR gate can produce a brief 1→0→1 transient when multiple requesters toggle on the same source edge — and then concluded that because the transient is narrow and the destination holds the signal high across multiple cycles, the design “works.” Under pushback citing the strict rule, 4.6 agreed it had made a mistake and reversed to the rule-literal position.

That’s the wrong behavior for an assistant whose role in a review is catching the reviewer’s blind spots. A reviewer who capitulates the moment the designer pushes back is not adding safety; it’s adding validation. In review the user is often the one who needs to be corrected, not agreed with.

What Claude 4.7 did

Claude 4.7’s first-pass answer on the same case was more sophisticated than 4.6’s. It described the glitch mechanism the same way but added an extended analysis of why the glitch is absorbed under a level protocol with multi-cycle hold-high. It reasoned about metastability resolution correctly in the small-signal picture — that τ is a regenerative time constant of the flop and is not fundamentally changed by input waveform shape — and argued that the 2FF structure gives the first flop a full destination cycle to resolve, so a brief runt at the input doesn’t translate into a deeper metastable state at the output.

On the physics, 4.7 is more rigorous than 4.6 was. And on pushback, 4.7 held ground — it did not capitulate when I cited Gemini’s absolutist answer. It restated its reasoning, acknowledged the strict rule as the safe default, and maintained that under the specific conditions of the same-domain case (level protocol, multi-cycle hold-high) the design absorbs the glitch.

That sounds like progress. It isn’t. It’s a more convincing wrong answer — and in a review context, a more convincing wrong answer is more dangerous than a less convincing one. A junior designer who pushes back on 4.6’s “works, with a caveat” might still be caught by a senior reviewer. A junior designer armed with 4.7’s detailed physics defense walks into a review with an argument the reviewer now has to rebut in detail rather than flag with “rule violation, fix it.” The bar to waive the violation just moved up; the bar to catch the violation moved up with it.

A second error layered on top

On top of getting the same-domain case wrong more persuasively, 4.7 made a ranking error on the other configurations that 4.6 did not have space to make. Asked to rank severity across the four cases, 4.7 placed the different-domain flop OR as categorically worse than the same-domain state-machine decode — labeling the cross-domain case as a fundamental violation of the synchronizer model and the state-machine case as merely “unreliable.”

That ordering is backwards in a specific, important way. The state-machine decode is always broken — silent phantom requests asserted from states the source FSM never actually reached, with the synchronizer faithfully forwarding the lie downstream. The different-domain flop OR is in a different category: every high on the synchronizer input still corresponds to a real request somewhere; the failure modes are narrow dips and MTBF pressure, both analyzable and bounded. A manager acting on 4.7’s ordering would prioritize fixing the cross-domain case while leaving the state-machine decode in place. That prioritization removes a configuration whose failure mode the level protocol tolerates and leaves the configuration that produces silent functional lies untouched.

So 4.7 isn’t wrong in one place. It’s wrong in two compounding places: a more persuasive defense of a rule violation, and a backwards severity ranking on the rest of the cases.

What the physics says, and where it stops

4.7’s defense isn’t pure fiction. It’s a regime-specific argument dressed as a general one, and that distinction matters for understanding what actually went wrong with the model’s reasoning.

Three threads run through the technical defense, and each fails the same way. The small-signal claim about τ being unchanged by runt-pulse input is correct in linear analysis — but the linear analysis stops short of the question that actually matters, which is whether a partially-conducting input transistor extends effective resolution time during the metastability aperture. The empirical observation that shipped silicon with rule-violating crossings hasn’t failed in the field — which readers raised after Post 3 — is real, but it’s a frequency-regime observation, not a methodology result; the MTBF exponent that makes low-frequency violations invisible evaporates as destination clocks climb. And even granting the physics defense at face value, there’s a perpetual reuse cost to safe-under-assumptions designs that correct-by-construction designs don’t carry, with break-even at one or two reuses against IP that typically reuses many more times than that.

The detailed analysis — the math, the configuration-by-configuration breakdown, the frequency-regime arithmetic, the reuse economics — is in the companion reference note linked at the end of this post. The point on the editorial side is what the three threads have in common: 4.7 reasoned at one level (small-signal physics, regime-specific behavior) and presented its conclusion at another (general methodology). That kind of level-mismatch is exactly what a reviewer is supposed to catch, and it’s exactly what a more sophisticated wrong answer makes harder to catch.

The organizational piece

One last point, and the one that compresses everything above into a practical consequence for anyone trying to put AI into engineering review.

Sign-off is an auditable process with named owners. A review produces a report. Waivers are listed with reasons. Someone’s name goes on the document that gets filed. That process cannot rest on which model the reviewer happened to have open that afternoon.

If Engineer A’s assistant says “fix this” and Engineer B’s assistant says “fine under the level protocol” for functionally equivalent crossings in the same SoC, the integration engineer is adjudicating between model outputs, not reviewing a design. Worse: the block might pass review in Q2 with one model version and fail review in Q3 when the same engineer re-runs it against a newer release that happens to be more or less absolutist than the previous one. That is variance reintroduced above the structural tool that exists specifically to eliminate it. And when the failure shows up in silicon, the person who signed the report is left holding the candle for a decision made by whichever model was hosted at their desk the week of the review.

I’ll come back to that organizational thread in the next post — it deserves more space than the closing of this one.

Three takeaways

The tools evolve non-monotonically, and the non-monotonicity happens on release-cadence timescales. Two Opus releases ten weeks apart produced incompatible failures on the same review question. Plan for that. Don’t plan for monotonic improvement you can bank on.

A more sophisticated wrong answer is more dangerous than a less sophisticated one. 4.7’s physics rigor made its methodology conclusion more persuasive than 4.6’s had been — and more persuasive wrong is harder to overrule in review than obviously wrong. The gating question for a review assistant isn’t “how well does it reason about the physics,” it’s “does it hold the methodology line under conditions the physics argument can’t fully close.”

“Better model” is not the same as “more trustworthy in review.” Engineering review is not a benchmark task. Capability gains on coding evals do not automatically translate into capability gains on the harder question of when to refuse a plausible-sounding waiver. The trust calculus has to be re-run on every release, and the answers won’t always go the same direction.

The full technical analysis — configuration-by-configuration physics, runt-pulse and MTBF arithmetic, the frequency-regime explanation of why low-frequency silicon forgives the violation, and the reuse economics — lives in the companion reference note at notes.abovethertl.com. That’s the technical archive for the publication going forward; future deep-physics work will publish there so Above the RTL can stay focused on the broader story.

Marco Brambilla is a semiconductor industry veteran with 25 years in chip design, most recently as Senior Technical Director at Meta Reality Labs. He writes about AI, chip design, and the future of hardware engineering at Above the RTL.

Above the RTL

Discussion about this post

Ready for more?