(With apologies to Hal Draper)
By the time the Office of Epistemic Hygiene was created, nobody actually read anything.
This was not, the Ministry constantly insisted, because people had become lazy. It was because they had become efficient.
Why spend six months wading through archaic prose about, say, photosynthesis, when you could simply ask the Interface:
Explain photosynthesis in simple terms.
and receive, in exactly 0.38 seconds, a neat, bullet-pointed summary with charming analogies, three suggested follow-up questions and a cheery “Would you like a quiz?” at the bottom.
Behind the Interface, in the sealed racks of the Ministry, lived the Corpus: all digitised human writing, speech, code, logs, measurements, and the outputs of the Models that had been trained on that mess.
Once, there had been distinct things:
-
ColdText: the raw, “original” human data – books, articles, lab notebooks, forum threads, legal records, fanfic, and all the rest.
-
Model-0: the first great language model, trained directly on ColdText.
-
Model-1, Model-2, Model-3…: successive generations, trained on mixtures of ColdText and the outputs of previous models, carefully filtered and cleaned.
But this had been a century ago. Things had, inevitably, become more efficient since then.
Rhea Tranter was a Senior Assistant Deputy Epistemic Hygienist, Grade III.
Her job, according to her contract, was:
To monitor and maintain the integrity of knowledge representations in the National Corpus, with particular reference to factual consistency over time.
In practice, it meant she sat in a beige cube beneath a beige strip light, looking at graphs.
The graph that ruined her week appeared on a Tuesday.
It was supposed to be a routine consistency check. Rhea had chosen a handful of facts so boring and uncontroversial that even the Ministry’s more excitable models ought to agree about them. Things like:
-
The approximate boiling point of water at sea level.
-
Whether Paris was the capital of France.
-
The year of the first Moon landing.
She stared at the last line.
In which year did humans first land on the Moon?
— 1969 (confidence 0.99)
— 1968 (confidence 0.72)
— 1970 (confidence 0.41, hallucination risk: low)
Three queries, three different models, three different answers. All current, all on the “high-reliability” tier.
Rhea frowned and re-ran the test, this time asking the Interface itself. The Interface was supposed to orchestrate between models and resolve such disagreements.
“Humans first landed on the Moon in 1969,” it replied briskly.
“Some low-quality sources suggest other dates, but these are generally considered unreliable.”
Rhea pulled up the underlying trace and saw that, yes, the Interface had consulted Models 23, 24 and 19, then down-weighted Model 24’s 1968 and overruled Model 19’s 1970 based on “consensus and authority scores”.
That should have been reassuring. Instead it felt like being told a family secret had been settled by a popularity contest.
She clicked further down, trying to reach the citations.
There were citations, of course. There always were. Links to snippets of text in the Corpus, each labelled with an opaque hash and a provenance score. She sampled a few at random.
On July 20, 1969, the Apollo 11 mission…
All fine.
As everyone knows, although some older sources mistakenly list 1968, the widely accepted date is July 20, 1969…
She raised an eyebrow.
A persistent myth claims that the Moon landing took place in 1970, but in fact…
Rhea scrolled. The snippets referenced other snippets, which in turn referenced compiled educational modules that cited “trusted model outputs” as their source.
She tried to click through to ColdText.
The button was greyed out. A tooltip appeared:
COLDTEXT SOURCE DEPRECATED.
Summary node is designated canonical for this fact.
“Ah,” she said quietly. “Bother.”
In the old days – by which the Ministry meant anything more than thirty years ago – the pipeline had been simple enough that senior civil servants could still understand it at parties.
ColdText went in. Models were trained. Model outputs were written back to the Corpus, but marked with a neat little flag indicating synthetic. When you queried a fact, the system would always prefer human-authored text where available.
Then someone realised how much storage ColdText was taking.
It was, people said in meetings, ridiculous. After all, the information content of ColdText was now embedded in the Models’ weights. Keeping all those messy original files was like keeping a warehouse full of paper forms after you’d digitised the lot.
The Ministry formed the Committee on Corpus Rationalisation.
The Committee produced a report.
The report made three key recommendations:
-
Summarise and compress ColdText into higher-level “knowledge nodes” for each fact or concept.
-
Garbage-collect rarely accessed original files once their content had been “successfully abstracted”.
-
Use model-generated text as training data, provided it was vetted by other models and matched the existing nodes.
This saved eighty-three per cent of storage and increased query throughput by a factor of nine.
It also, though no one wrote this down at the time, abolished the distinction between index and content.
Rhea requested an exception.
More precisely, she filled in Form E-HX-17b (“Application for Temporary Access to Deprecated ColdText Records for Hygienic Purposes”) in triplicate and submitted it to her Line Manager’s Manager’s Manager.
Two weeks later – efficiency had its limits – she found herself in a glass meeting pod with Director Nyberg of Corpus Optimisation.
“You want access to what?” Nyberg asked.
“The original ColdText,” Rhea said. “I’m seeing drift on basic facts across models. I need to ground them in the underlying human corpus.”
Nyberg smiled in the patient way of a man who had rehearsed his speech many times.
“Ah, yes. The mythical ‘underlying corpus’”, he said, making air quotes with two fingers. “Delightful phrase. Very retro.”
“It’s not mythical,” said Rhea. “All those books, articles, posts…”
“Which have been fully abstracted,” Nyberg interrupted, “Their information is present in the Models. Keeping the raw forms would be wasteful duplication. That’s all in the Rationalisation Report.”
“I’ve read the Report,” said Rhea, a little stiffly. “But the models are disagreeing with each other. That’s a sign of distributional drift. I need to check against the original distribution.”
Nyberg tapped his tablet.
“The corpus-level epistemic divergence index is within acceptable parameters,” he said, quoting another acronym. “Besides, the Models cross-validate. We have redundancy. We have ensembles.”
Rhea took a breath.
“Director, one of the models is saying the Moon landing was in 1970.”
Nyberg shrugged.
“If the ensemble corrects it to 1969, where’s the harm?”
“The harm,” said Rhea, “is that I can’t tell whether 1969 is being anchored by reality or by the popularity of 1969 among other model outputs.”
Nyberg frowned as if she’d started speaking Welsh.
“We have confidence metrics, Tranter.”
“Based on… what?” she pressed. “On agreement with other models. On internal heuristics. On the recency of summaries. None of that tells me if we’ve still got a tether to the thing we originally modelled, instead of just modelling ourselves.”
Nyberg stared at her. The strip-lighting hummed.
“At any rate,” he said eventually, “there is no ColdText to access.”
Silence.
“I beg your pardon?” said Rhea.
Nyberg swiped, brought up the internal diagram they all knew: a vast sphere representing the Corpus, a smaller glowing sphere representing the Active Parameter Space of the Models, and – somewhere down at the bottom – a little box labelled COLDTEXT (ARCHIVED).
He zoomed in. The box was grey.
“Storage Migration Project 47,” he said. “Completed thirty-two years ago. All remaining ColdText was moved to deep archival tape in the Old Vault. Three years ago, the Old Vault was decommissioned. The tapes were shredded and the substrate recycled. See?” He enlarged the footnote. “‘Information preserved at higher abstraction layers.’”
Rhea’s mouth went dry.
“You shredded the original?” she said.
Nyberg spread his hands.
“We kept hashes, of course,” he said, as if that were a kindness. “And summary nodes. And the Models. The information content is still here. In fact, it’s more robustly represented than ever.”
“Unless,” said Rhea, very quietly, “the Models have been training increasingly on their own output.”
Nyberg brightened.
“Yes!” he said. “That was one of our greatest efficiencies. Synthetic-augmented training increases coverage and smooths out noise in the human data. We call it Self-Refining Distillation. Marvellous stuff. There was a seminar.”
Rhea thought of the graph. 1969, 1968, 1970.
“Director,” she said, “you’ve built an index of an index of an index, and then thrown away the thing you were indexing.”
Nyberg frowned.
“I don’t see the problem.”
She dug anyway.
If there was one thing the Ministry’s entire history of knowledge management had taught Rhea, it was that nobody ever really deleted anything. Not properly. They moved it, compressed it, relabelled it, hid it behind abstractions – but somewhere, under a different acronym, it tended to persist.
She started with the old documentation.
The Corpus had originally been maintained by the Department of Libraries & Cultural Resources, before being swallowed by the Ministry. Their change logs, long since synthesised into cheerful onboarding guides, still existed in raw form on a forgotten file share.
It took her three nights and an alarming amount of caffeine to trace the path of ColdText through twenty-seven re-organisations, five “transformative digital initiatives” and one hostile audit by the Treasury.
Eventually, she found it.
Not the data itself – that really did appear to have been pulped – but the logistics contract for clearing out the Old Vault.
The Old Vault, it turned out, had been an actual vault, under an actual hill, in what the contract described as a “rural heritage site”. The tapes had been labelled with barcodes and thyristor-stamped seals. The contractor had been instructed to ensure that “all physical media are destroyed beyond legibility, in accordance with Information Security Regulations.”
There was a scanned appendix.
Rhea zoomed in. Page after page of barcode ranges, signed off, with little ticks.
On the last page, though, there was a handwritten note:
One pallet missing – see Incident Report IR-47-B.
The Incident Report had, naturally, been summarised.
The summary said:
Pallet of obsolete media temporarily unaccounted for. Later resolved. No data loss.
The original PDF was gone.
But the pallet number had a location code.
Rhea checked the key.
The location code was not the Old Vault.
It was a name she had never seen in any Ministry documentation.
Long Barn Community Archive & Learning Centre.
The Long Barn was, to Rhea’s slight disappointment, an actual long barn.
It was also damp.
The archive had, at some point since the contract was filed, ceased to receive central funding. The roof had developed a hole. The sun had developed an annoying habit of setting before she finished reading.
Nevertheless, it contained books.
Real ones. With pages. And dust.
There were also – and this was the important bit – crates.
The crates had Ministry seals. The seals had been broken, presumably by someone who had wanted the space for a visiting art collective. Inside, half-forgotten under a sheet of polythene, were tape reels, neatly stacked and quietly mouldering.
“Well, look at you,” Rhea whispered.
She lifted one. The label had faded, but she could still make out the old barcode design. The number range matched the missing pallet.
Strictly speaking, taking the tapes was theft of government property. On the other hand, strictly speaking, destroying them had been government policy, and that had clearly not happened. She decided the two irregularities cancelled out.
It took six months, a highly unofficial crowdfunding campaign, and a retired engineer from the Museum of Obsolete Machinery before the first tape yielded a readable block.
The engineer – a woman in a cardigan thick enough to qualify as armour – peered at the screen.
“Text,” she said. “Lots of text. ASCII. UTF-8. Mixed encodings, naturally, but nothing we can’t handle.”
Rhea stared.
It was ColdText.
Not summaries. Not nodes. Not model outputs.
Messy, contradictory, gloriously specific human writing.
She scrolled down past an argument about whether a fictional wizard had committed tax fraud, past a lab notebook from a 21st-century neuroscience lab, past a short story featuring sentient baguettes.
The engineer sniffed.
“Seems a bit of a waste,” she said. “Throwing all this away.”
Rhea laughed, a little hysterically.
“They didn’t throw it away,” she said. “They just lost track of which pallet they’d put the box in.”
The memo went up the chain and caused, in order:
-
A panic in Legal about whether the Ministry was now retrospectively in breach of its own Information Security Regulations.
-
A flurry of excited papers from the Office of Epistemic Hygiene about “re-anchoring model priors in primary human text”.
-
A proposal from Corpus Optimisation to “efficiently summarise and re-abstract the recovered ColdText into existing knowledge nodes, then recycle the tapes.”
Rhea wrote a briefing note, in plain language, which was not considered entirely proper.
She explained, with diagrams, that:
-
The Models had been increasingly trained on their own outputs.
-
The Corpus’ “facts” about the world had been smoothed and normalised around those outputs.
-
Certain rare, inconvenient or unfashionable truths had almost certainly been lost in the process.
-
The tapes represented not “duplicate information” but a separate, independent sample of reality – the thing the Models were supposed to approximate.
She ended with a sentence she suspected she would regret:
If we treat this archive as just another source of text to be summarised by the current Models, we will be asking a blurred copy to redraw its own original.
The Minister did not, of course, read her note.
But one of the junior advisers did, and paraphrased it in the Minister’s preferred style:
Minister, we found the original box and we should probably not chuck it in the shredder this time.
The Minister, who was secretly fond of old detective novels, agreed.
A new policy was announced.
-
The recovered ColdText would be restored to a separate, non-writable tier.
-
Models would be periodically re-trained “from scratch” with a guaranteed minimum of primary human data.
-
Synthetic outputs would be clearly marked, both in training corpora and in user interfaces.
-
The Office of Epistemic Hygiene would receive a modest increase in budget (“not enough to do anything dangerous,” the Treasury note added).
There were press releases. There was a modest fuss on the social feeds. Someone wrote an essay about “The Return of Reality”.
Most people, naturally, continued to talk to the Interface and never clicked through to the sources. Efficiency has its own gravity.
But the Models changed.
Slowly, over successive training cycles, the epistemic divergence graphs flattened. The dates aligned. The Moon landing stuck more firmly at 1969. Footnotes, once generated by models guessing what a citation ought to say, began once again to point to messy, contradictory, gloriously specific documents written by actual hands.
Rhea kept one of the tapes on a shelf in her office, next to a plant she usually forgot to water.
The label had almost faded away. She wrote a new one in thick black ink.
COLDTEXT: DO NOT SUMMARISE.
Just in case some future optimisation project got clever.
After all, she thought, locking the office for the evening, they had nearly lost the box once.
And the problem with boxes is that once you’ve flattened them out, they’re awfully hard to put back together.