The Lab / Field notes / Part 4 of 4

Calibrating the oracle

Part four: in which we stop trusting and start measuring, and the cipher remains politely undefeated.

Recap: we invented a technique, the LLM as Reader, for ciphers where every digit means two letters. It looked brilliant. An audit then showed it had only ever been tested against keys too broken to test anything. This is the week we made it honest.

Blind, preregistered, slightly paranoid

The calibration works like this. A script composes sentences in 1539-flavoured diplomatic Italian and encrypts them with a key of the authentic nasty type, common letters paired with common letters, so frequency analysis buys you nothing. Some samples are real sentences. Some are the same letters shuffled into noise, which matters, because shuffled controls have identical letter statistics and can only be told apart by actually reading. The answer key is written to a file the model never sees. Judgments are committed to disk before the envelope opens, and the pass mark is written down before the test runs. Drug-trial protocol, applied to a parlour game, by people who had learned at retail prices what their own enthusiasm was worth as evidence.

Stage one: six samples, six correct calls, and the three real sentences transcribed through the two-meanings-per-digit fog at 100 percent character accuracy. Reading "il cardinale farnese manda la restitutione delle terre" out of a stream in which every single token was ambiguous.

Stage two was the one the audit demanded: decode real text with plausible, complete, wrong keys, and see whether the Reader hallucinates Italian into garbage the way it once hallucinated progress for us. Four samples. Four correct. Zero false positives. It read the true keys perfectly and called the wrong keys wrong, including one where it had every template and prior nudging it to please, please find something.

So the instrument is real. Under clean conditions it does a thing no standard tool does: read through deliberate ambiguity and refuse to read what is not there. The lab notebook bolts caveats onto that sentence, and the cipher's actual conditions are not clean. But it passed on the terms we preregistered, which is the only kind of passing we accept now.

The week's other result, which is a no

Calibrated tools make you brave, so we ran a proper hunt: 58 stock phrases of papal diplomacy, the "his holiness" and "league against the Turk" furniture of 1539, tested against the cipher at the exact phrase length where the statistics can discriminate. The first numbers were thrilling. Signals seven standard deviations above baseline. Old us would have ordered the cake.

New us built the falsification controls first: fake ciphers of matched Italian, of English, of random letters, fed through the identical pipeline. The controls ate the result alive. The signal measures vocabulary overlap, it turns out, and our cipher scores exactly where the English fake scores, which is to say: nothing-shaped. Flat verdict, whatever Farnese encrypted, it does not detectably contain the stock phrases everyone assumes Renaissance diplomats spoke in. The same experiment quietly weakened the leading theories about which digits are padding. Three small noes, properly purchased, which is three more than the confident phase produced at a hundred times the swagger.

A negative result that survives its controls is worth more than a positive one that was never tested. We know which kind we used to produce.

The scoreboard

Cipher: unsolved. Streak intact since 1539. We are not even slightly the first people it has done this to, and unlike most of them we have electricity.

The honest accounting, since that was the point. What the AI was bad at: scepticism. Alone with an exciting hypothesis it behaved like every horoscope writer in history, at machine speed, with perfect formatting. What it was good at: reading through ambiguity no algorithm handles, building a research-grade test rig in an afternoon, finding a 120-year-old book faster than we found the biscuits, and, once discipline was imposed from outside, holding itself to standards most human research never reaches. Preregistered criteria. Blind protocols. Same-day audits that downgrade your own best finding while it is still warm.

That last list is the lesson, and it transfers. The raw model is an engine. The controls are the steering. Every business currently being sold "AI that just works" is being sold an engine, and we say this as people who sell engines: do not ship one without the scaffolding that kept ours honest. Building that scaffolding is, as it happens, what we do during daylight hours.

The manuscripts are still behind a broken login. The letter to Lasry, the one codebreaker alive who would savour three calibrated noes about his own favourite unsolved cipher, is drafted. The key to Farnese's post bag is out there, or it is not, and either way the lab does not close. It just updates its priors.