Labs · open research
Labs.
Experiments currently running on the practice. The hypothesis, the method, what gets published, and when. Built so a reader can argue with me before the results land.
Claude vs GPT on brand-voice consistency — 30-day evaluation
Hypothesis. Same system prompt, same brand voice guidelines, same source material. Hypothesis: Claude holds brand voice more consistently across long documents; GPT wins on short headlines. We measure how many sentences need rewriting before the team would publish.
Method.
- Pull 25 archived Margin Notes paragraphs (deck + body lines), strip the original phrasing.
- Feed each to Claude and GPT with the same 400-word brand-voice system prompt.
- Score each output 1–3 on: voice match, claim accuracy, would-publish-as-is.
- Repeat across three model versions of each over the month.
What I'm publishing. The cleaned scorecard (anonymised samples) at the end of the 30 days. If a model surprises me materially before then, an interim Field Notes.