keep score in public
The predictions ledger
Every issue carries a falsifiable call with a stated confidence. Every due call gets settled, right or wrong, and scored with Brier — lower is better, 0.25 is a coin flip. Credibility is the product.
22open
0–0 record
— mean Brier
0settled
Calibration
The open book by confidence — and, as calls settle, whether the stated confidence matches the real hit-rate (bar = actual, tick = claimed).
| Confidence | Open | Settled | Actual vs claimed |
|---|---|---|---|
| 40–50% | 1 | · | — |
| 50–60% | 4 | · | — |
| 60–70% | 7 | · | — |
| 70–80% | 7 | · | — |
| 80–100% | 2 | · | — |
— no calls settled yet · 22 open —
The ledger
| Made | Prediction | Conf. | Due | Status | Brier |
|---|---|---|---|---|---|
| 2026-W23 | GitHub partially walks back Copilot pricing (extends promo credits past Aug, restores fallback model, or cuts the Opus multiplier) within 30 days, without reversing metering itself. | 70% | 2026-07-05 | ● open | — |
| Dive · 2026-06-07 | At least two of {GitHub, Cursor, Anthropic} ship an "unlimited on our own/house models" flat tier (subsidy internalized, frontier stays metered). | 65% | 2027-Q1 | ● open | — |
| Dive · 2026-06-10 | GitHub/npm ship branch/ref binding for OIDC trusted publishing (the actual Miasma hole) — and a worm generation defeats npm v12's script-off default before that ships. | 55% | 2026-Q4 | ● open | — |
| Dive · 2026-06-08 | A major cloud or agent platform ships an enforced hard per-task/per-agent spend ceiling (not a budget alert) that the agent cannot cross. | 45% | 2027-Q2 | ● open | — |
| Dive · 2026-06-08 | "Agent liability" insurance appears OR a cloud publishes a runaway-agent forgiveness policy, mandating spend caps/observability as a condition. | 55% | 2027-Q2 | ● open | — |
| Dive · 2026-06-09 | The top frontier-vs-best-open-model spread on a major agentic benchmark (SWE-bench/MCPMark/Terminal-Bench) stays inside ~5 pts — no lab reopens a durable capability gap, confirming the channel (not the model) is the moat. | 70% | 2027-Q1 | ● open | — |
| Dive · 2026-06-12 | A contamination-resistant benchmark (SWE-bench Pro / SWE-rebench or successor) does NOT reproduce SWE-bench Verified's top-5 model ordering — decontamination changes rank, not just absolute scores. | 65% | 2027-Q1 | ● open | — |
| Dive · 2026-06-11 | No venture-funded independent LLM gateway/observability/eval company reaches a standalone outcome (IPO or $1B+ while independent) — the next two notable outcomes are absorptions by a model vendor or data/monitoring platform, or wind-downs. | 65% | 2027-Q2 | ● open | — |
| Dive · 2026-06-16 | No model ranked in the top tier of a major agentic benchmark (SWE-bench/MCPMark/Terminal-Bench or successor) ships meeting OSAID 1.0 in full — weights, data information, and complete training code under an OSI-approved license. "Open source AI" releases stay open-weight-only. | 80% | 2027-Q1 | ● open | — |
| Dive · 2026-06-17 | A sub-35B open-weight coding model fits a single 24GB card with a usable 128K context AND lands within ~10 points of that quarter's top frontier model on a contamination-resistant agentic benchmark (SWE-rebench / SWE-bench Pro) — i.e. the runnable-vs-competitive gap finally closes on consumer hardware. | 35% | 2027-Q1 | ● open | — |
| Dive · 2026-06-18 | Anthropic ships automatic/implicit prompt caching — a cache hit without a manually placed breakpoint — on at least one default API path, converging toward the zero-config model OpenAI, DeepSeek, and Gemini already use, because the gap between the advertised 90% discount and the realized hit rate is a cost-perception liability. | 55% | 2027-Q1 | ● open | — |
| Dive · 2026-06-19 | Multi-agent / A2A-style agent-to-agent coordination does NOT become the default shipped production-agent pattern; single-context loops plus tool-calling (the MCP rung) stay dominant, and A2A stays enterprise-announced rather than developer-used — no broad practitioner-usage signal emerges. | 75% | 2027-Q1 | ● open | — |
| Dive · 2026-06-20 | Claude Code surfaces auto-compaction control as a documented, first-class setting — a configurable threshold or a "manual / safe-point-only" compaction mode in /config or the official docs — rather than today's undocumented env var and reverse-engineered buffer numbers. | 55% | 2027-Q1 | ● open | — |
| Dive · 2026-06-21 | The next frontier-tier open-weight model release (top ~5 on the intelligence index) ships with an activation ratio at or below ~6% (active ÷ total parameters), continuing the Mixtral 27.6% → DeepSeek-V3 / GLM-5.2 ~5.4% sparsification trend; none re-ships above ~15%. | 70% | 2027-Q1 | ● open | — |
| 2026-W25 | At least one major commercial AI vendor (Anthropic, OpenAI, Google, or Microsoft) ships or formally announces a customer-facing multi-provider / bring-your-own-model fallback in a first-party developer product — positioning provider-portability as resilience against the switch-off risk the Fable 5 / Mythos 5 export ban made concrete. | 60% | 2026-09-20 | ● open | — |
| Dive · 2026-06-22 | Prompt-and-tool portability stays a manual re-eval problem — no cross-provider standard or vendor feature lets a non-trivial agent's prompt-plus-toolset move between two frontier providers and reproduce its eval scores within a small margin without per-model retuning. Gateways keep normalizing API syntax; behavior still requires bespoke adaptation. | 65% | 2027-Q1 | ● open | — |
| Dive · 2026-06-23 | No agent-native version-control system (Oak, jj-style, or similar) displaces git worktrees as the default file-isolation primitive for parallel coding agents. The major agent harnesses (Claude Code, Cursor, and peers) keep building parallel-session isolation on git worktrees — not a non-git store — in their shipped defaults. | 80% | 2027-Q1 | ● open | — |
| Dive · 2026-06-24 | Speculative decoding stays a single-stream / low-QPS latency trick. No widely-deployed variant delivers a greater-than-~1.5× throughput gain at high batch (≥64 concurrent requests) on a frontier-class model; at saturation, batching remains the dominant amortization of the weight read and the high-batch throughput multiple stays under roughly 1.5×. | 70% | 2027-Q1 | ● open | — |
| Dive · 2026-06-25 | Claude Code does not ship a lossless / auditable auto-compaction — one that writes its kept-set to a user-inspectable file AND reliably preserves the decision rationale (the "why," not just paths and names) — so the manual dump-to-markdown + /clear handoff stays the practitioner default for long, multi-step tasks. | 65% | 2027-Q1 | ● open | — |
| Dive · 2026-06-26 | No major agent harness (Claude Code, Cursor, Codex, or peers) ships automatic tool-call deduplication — collapsing identical repeated tool invocations within a session so a retried mutating call executes once — as a documented default. Retry-safety stays the tool author's responsibility via idempotency keys / unique constraints, and the harness's only built-in stays blunt refusal of destructive operations (v2.1.183-style). | 70% | 2027-Q1 | ● open | — |
| Dive · 2026-06-27 | No closed frontier lab (Anthropic, OpenAI, or Google) widens default-path logprob exposure beyond today's limits (Anthropic: none; OpenAI: top-20) for its flagship models — the dense soft-target leak stays closed, leaving black-box output imitation as the only available distillation route against closed frontier models. | 75% | 2027-Q1 | ● open | — |
| Dive · 2026-06-28 | DeepSeek's permanent V4-Pro price floor (~$0.44 input / $0.87 output per Mtok) does NOT ratchet materially back up (>25% on either leg) within two quarters — the open-weight-pinned floor proves structural, not a promo — AND no closed frontier lab (OpenAI/Anthropic) cuts its flagship API token price to within ~2× of that floor over the same window; they hold a premium and segment to capability instead. | 65% | 2027-Q1 | ● open | — |
Source of truth:
_data/predictions.yml. Working memory:
editorial memory.