keep score in public

The predictions ledger

Every issue carries a falsifiable call with a stated confidence. Every due call gets settled, right or wrong, and scored with Brier — lower is better, 0.25 is a coin flip. Credibility is the product.

22open

0–0 record

— mean Brier

0settled

Calibration

The open book by confidence — and, as calls settle, whether the stated confidence matches the real hit-rate (bar = actual, tick = claimed).

Confidence	Open	Settled	Actual vs claimed
40–50%	1	·	—
50–60%	4	·	—
60–70%	7	·	—
70–80%	7	·	—
80–100%	2	·	—

— no calls settled yet · 22 open —

The ledger

Made	Prediction	Conf.	Due	Status	Brier
2026-W23	GitHub partially walks back Copilot pricing (extends promo credits past Aug, restores fallback model, or cuts the Opus multiplier) within 30 days, without reversing metering itself.	70%	2026-07-05	● open	—
Dive · 2026-06-07	At least two of {GitHub, Cursor, Anthropic} ship an "unlimited on our own/house models" flat tier (subsidy internalized, frontier stays metered).	65%	2027-Q1	● open	—
Dive · 2026-06-10	GitHub/npm ship branch/ref binding for OIDC trusted publishing (the actual Miasma hole) — and a worm generation defeats npm v12's script-off default before that ships.	55%	2026-Q4	● open	—
Dive · 2026-06-08	A major cloud or agent platform ships an enforced hard per-task/per-agent spend ceiling (not a budget alert) that the agent cannot cross.	45%	2027-Q2	● open	—
Dive · 2026-06-08	"Agent liability" insurance appears OR a cloud publishes a runaway-agent forgiveness policy, mandating spend caps/observability as a condition.	55%	2027-Q2	● open	—
Dive · 2026-06-09	The top frontier-vs-best-open-model spread on a major agentic benchmark (SWE-bench/MCPMark/Terminal-Bench) stays inside ~5 pts — no lab reopens a durable capability gap, confirming the channel (not the model) is the moat.	70%	2027-Q1	● open	—
Dive · 2026-06-12	A contamination-resistant benchmark (SWE-bench Pro / SWE-rebench or successor) does NOT reproduce SWE-bench Verified's top-5 model ordering — decontamination changes rank, not just absolute scores.	65%	2027-Q1	● open	—
Dive · 2026-06-11	No venture-funded independent LLM gateway/observability/eval company reaches a standalone outcome (IPO or $1B+ while independent) — the next two notable outcomes are absorptions by a model vendor or data/monitoring platform, or wind-downs.	65%	2027-Q2	● open	—
Dive · 2026-06-16	No model ranked in the top tier of a major agentic benchmark (SWE-bench/MCPMark/Terminal-Bench or successor) ships meeting OSAID 1.0 in full — weights, data information, and complete training code under an OSI-approved license. "Open source AI" releases stay open-weight-only.	80%	2027-Q1	● open	—
Dive · 2026-06-17	A sub-35B open-weight coding model fits a single 24GB card with a usable 128K context AND lands within ~10 points of that quarter's top frontier model on a contamination-resistant agentic benchmark (SWE-rebench / SWE-bench Pro) — i.e. the runnable-vs-competitive gap finally closes on consumer hardware.	35%	2027-Q1	● open	—
Dive · 2026-06-18	Anthropic ships automatic/implicit prompt caching — a cache hit without a manually placed breakpoint — on at least one default API path, converging toward the zero-config model OpenAI, DeepSeek, and Gemini already use, because the gap between the advertised 90% discount and the realized hit rate is a cost-perception liability.	55%	2027-Q1	● open	—
Dive · 2026-06-19	Multi-agent / A2A-style agent-to-agent coordination does NOT become the default shipped production-agent pattern; single-context loops plus tool-calling (the MCP rung) stay dominant, and A2A stays enterprise-announced rather than developer-used — no broad practitioner-usage signal emerges.	75%	2027-Q1	● open	—
Dive · 2026-06-20	Claude Code surfaces auto-compaction control as a documented, first-class setting — a configurable threshold or a "manual / safe-point-only" compaction mode in /config or the official docs — rather than today's undocumented env var and reverse-engineered buffer numbers.	55%	2027-Q1	● open	—
Dive · 2026-06-21	The next frontier-tier open-weight model release (top ~5 on the intelligence index) ships with an activation ratio at or below ~6% (active ÷ total parameters), continuing the Mixtral 27.6% → DeepSeek-V3 / GLM-5.2 ~5.4% sparsification trend; none re-ships above ~15%.	70%	2027-Q1	● open	—
2026-W25	At least one major commercial AI vendor (Anthropic, OpenAI, Google, or Microsoft) ships or formally announces a customer-facing multi-provider / bring-your-own-model fallback in a first-party developer product — positioning provider-portability as resilience against the switch-off risk the Fable 5 / Mythos 5 export ban made concrete.	60%	2026-09-20	● open	—
Dive · 2026-06-22	Prompt-and-tool portability stays a manual re-eval problem — no cross-provider standard or vendor feature lets a non-trivial agent's prompt-plus-toolset move between two frontier providers and reproduce its eval scores within a small margin without per-model retuning. Gateways keep normalizing API syntax; behavior still requires bespoke adaptation.	65%	2027-Q1	● open	—
Dive · 2026-06-23	No agent-native version-control system (Oak, jj-style, or similar) displaces git worktrees as the default file-isolation primitive for parallel coding agents. The major agent harnesses (Claude Code, Cursor, and peers) keep building parallel-session isolation on git worktrees — not a non-git store — in their shipped defaults.	80%	2027-Q1	● open	—
Dive · 2026-06-24	Speculative decoding stays a single-stream / low-QPS latency trick. No widely-deployed variant delivers a greater-than-~1.5× throughput gain at high batch (≥64 concurrent requests) on a frontier-class model; at saturation, batching remains the dominant amortization of the weight read and the high-batch throughput multiple stays under roughly 1.5×.	70%	2027-Q1	● open	—
Dive · 2026-06-25	Claude Code does not ship a lossless / auditable auto-compaction — one that writes its kept-set to a user-inspectable file AND reliably preserves the decision rationale (the "why," not just paths and names) — so the manual dump-to-markdown + /clear handoff stays the practitioner default for long, multi-step tasks.	65%	2027-Q1	● open	—
Dive · 2026-06-26	No major agent harness (Claude Code, Cursor, Codex, or peers) ships automatic tool-call deduplication — collapsing identical repeated tool invocations within a session so a retried mutating call executes once — as a documented default. Retry-safety stays the tool author's responsibility via idempotency keys / unique constraints, and the harness's only built-in stays blunt refusal of destructive operations (v2.1.183-style).	70%	2027-Q1	● open	—
Dive · 2026-06-27	No closed frontier lab (Anthropic, OpenAI, or Google) widens default-path logprob exposure beyond today's limits (Anthropic: none; OpenAI: top-20) for its flagship models — the dense soft-target leak stays closed, leaving black-box output imitation as the only available distillation route against closed frontier models.	75%	2027-Q1	● open	—
Dive · 2026-06-28	DeepSeek's permanent V4-Pro price floor (~$0.44 input / $0.87 output per Mtok) does NOT ratchet materially back up (>25% on either leg) within two quarters — the open-weight-pinned floor proves structural, not a promo — AND no closed frontier lab (OpenAI/Anthropic) cuts its flagship API token price to within ~2× of that floor over the same window; they hold a premium and segment to capability instead.	65%	2027-Q1	● open	—

Source of truth: _data/predictions.yml. Working memory: editorial memory.