͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏

Forwarded this email? Subscribe here for more

This is Brad DeLong's Grasping Reality—my attempt to make myself, and all of you out there in SubStackLand, smarter by writing where I have Value Above Replacement and shutting up where I do not…

DeLong vs. Google-Clever-Hans (Not Quite Coyote vs. Roadrunner, But Close): Laugh of the Day

Building a catechism out of 500MB of text, one GPU‑hour and hallucination at a time, as a month‑long on-device build_qa.py run to cut what would otherwise have been a $3,000 cloud bill by a factor...

Brad DeLong

May 14

READ IN APP

Building a catechism out of 500MB of text, one GPU‑hour and hallucination at a time, as a month‑long on-device `build_qa.py` run to cut what would otherwise have been a $3,000 cloud bill by a factor of 100 has to teach us about “AI”…

I ask the roiling boil of linear algebra: Tell me about the current state of the build_qa.py run.

It responds:

Here's the status:

✅ Running — PID 68086, no stall, no new error types.

- Blog chunks: 9,300 / 36,520 — 25.5%
- Posts (all sources): 12,386 / 24,639 — 50.3%
- DB pairs: 97,045
- Rate: ~129 blog chunks/hr
- ETA: ~May 23 (blog source is the long tail — it's the slowest and biggest)

Errors: 410 total, all identical — "Expecting value: line N column 3" — malformed JSON from Ollama on heavy chunks. These are skipped and not blocking progress. No new error types overnight.

Note on the two counters: The "posts" counter crossing 50% looks more exciting than it is. Blog chunks are only 25% done and dominate the remaining runtime. May 23 is the realistic finish.

Logged to 2026-05-14.md and typepad-resurrection.md.

This is for a job reading 500MB of text and constructing “catechism” question-and-answer pairs for it to serve as a source of GROUND TRUTH for the new-&-improved expanded SubTuringBradBot to be.

The problem is that Google-Clever-Hans does what CleverHanses always do: they say the expected (and desired) thing. In the case of the original Clever Hans it was when the person was happy with the number of times Clever Hans had stomped his hoof. But was it because Clever Hans recognized that the person was happy directly; or was it because Clever Hans had added up the two numbers, stomped its hoof a number of times equal to the sum, and that made the person happy?

The answer for the original Clever Hans was: the first.

This Google-Clever-Hans has been similarly trained to say the expected thing in response to queries like:

Tell me about the current state of the build_qa.py run.

Give a gift subscription

But is “the expected thing” for it to examine the status of the run and report true things about it? Or is “the expected thing” for it to make pleasing noises, with enough pseudo-detail to make the questioner happy and simply go away?

Certainly both modes of behavior are present in the training data.

Post-training RLHF is supposed to give a powerful extra value-boost to answers that are regarded favorably by human reviewers, which is some combination of: (a) accurate, (b) friendly, (c) properly obsequious, and (d) engagement boosting—this last because Google wants money, even though it is not an Anthropic or an OpenAI that faces rapidly approaching existential financial risk.

So what is it doing now? Acting like a good, high-quality trusted and trustable research assistant? Or acting like a slacking insubordinate research assistant who just wishes I would go away. And I cannot pretrain it to be the first by taking it out for an introductory lunch at Chez Panisse Cafe, can I?

There is a partially-constructed database actually there:

Leave a comment

And looking at the catechism q-&-as in it at random produces very reasonable things:

1. Bush Social Security proposal
Q: What components are included in the calculation of the projected $1,532 monthly payment for middle-income retirees under the Bush proposal?
A: The calculation includes scaling back promised benefits and prospective investment returns from personal retirement accounts created using part of Social Security's payroll-tax revenues.

---

2. Rationality and optimal policy
Q: What is the difference in optimal policy implications between a world where people are rational vs. irrational-but-aggregates-behave-rational?
A: Optimal policy is laissez faire if people are rational. If people are irrational but aggregates behave as if rational, optimal policy will not be laissez faire.

---

3. Political Economy program at Berkeley
Q: What are the "problems of implementation" facing the PEIS program?
A: Outdated name; lack of senior faculty ownership; no pressure from California Hall on departments; rapid advising staff turnover; lack of resources for senior thesis; lack of hard-writing courses; poor discussion sections; weak connection to core departments.

Get 75% off a group subscription

I do wish the results were longer and more explainery. Perhaps I will have to rejigger it and do it again, and so spend another solid month hammering the M5Max chip CPU sections at 25% and GPU/NPU sections at 75% (I do want the machine to be able to do other tasks while it is doing this). Perhaps brief answers are best, because people can always ask follow-ups after a quick response, but will not like waiting around twiddling-their-thumbs waiting for a long one. More important, these need dates: 2005-situation, timeless (even if a “representative agent” model “fits”, it is NEVER the case that the utility of a representative agent is appropriate input for ANY SWF), and 2006-situation, respectively.

But these are very reasonable, for what they are. It does not look like an anvil is falling from above, about to land on my head.

And: 30 days x 24 hrs/day x 0.1kW x $0.43/kW-hr = $31 in electricity costs, plus whatever wear-and-tear happens as shaking electrons pushes phosphorus atoms out of their proper places.

Cf.: $3000 were I have sent the job out to the cloud to anthropic/claude-opus (or google/gemini) at current api per-token rates.

As long as I am willing to wait a month—or more, if I decide the job has to be redone.

If reading this gets you Value Above Replacement, then become a free subscriber to this newsletter. And forward it! And if your VAR from this newsletter is in the three digits or more each year, please become a paid subscriber! I am trying to make you readers—and myself—smarter. Please tell me if I succeed, or how I fail…

##delong-vs-google-clever-hans-note-quite-coyote-vs-roadrunner-but-close-laugh-of-the-day
##laugh-of-the-day
#subturingbradbot
#mamlms-on-device
#mamlms
#clever-hans
#ai-ground-truth
#hallucination-risk
#rlhf-dynamics
#energy-costs

Please forward the email & otherwise share it to everyone you think would appreciate it…

Comment

Restack

Cobaz Post