Case study: recovery of a corrupted 12 TB multi-device pool

Incident summary
It has been reported that a hard power cycle on a three‑device Btrfs pool (data=single, metadata=dup, DM‑SMR disks) left the extent tree and free space tree in a state no native repair could resolve. Worse, it has been reported that a subsequent btrfs check --repair run spun into an apparently infinite loop — some 46,000+ commits — with no net progress, rotating backup_root slots past any useful pre‑crash rollback point. Ouch. The pool held about 12 TB of raw capacity and roughly 4.59 TB of user data; recovery was risky, and tension ran high.
Recovery and results
Recovery allegedly succeeded after the author built 14 custom C tools against the internal btrfs‑progs API and applied a single‑line patch to alloc_reserved_tree_block. The tools are published under GPL‑2.0 and include read‑only scan modes by default (write is opt‑in). The final reported data loss was tiny: about 7.2 MB out of 4.59 TB (roughly 0.00016 percent), and the pool is now claimed to be fully operational. The full incident analysis and toolset are posted on GitHub for scrutiny and reuse (incident writeup: https://github.com/msedek/btrfs_fixes/blob/main/INCIDENT-ANALYSIS.md; tools and discussion: https://github.com/kdave/btrfs-progs/issues/1107).
Recommendations and next steps
The author distilled nine targeted improvement areas — ranked by expected impact — that might have prevented most of the custom work. Rather than shotgunting nine PRs, they published a reference implementation and invited collaboration, offering logs and tests on request. So here’s the question for maintainers and operators: should these fixes be upstreamed as design discussions, or left as a cautionary toolbox for power‑cycle nightmares? Either way, the case is a useful roadmap — a narrow, triumphant win after a long, technical slog that underscores both Btrfs’s resilience and the deep expertise required when things go sideways.
Sources: github.com/kdave, Hacker News
Comments