Show HN: Pseudonymizing sensitive data for LLMs without losing context

April 15, 2026
Detailed view of a hand taking fingerprints on a document with ink pad on desk.
Photo by cottonbro studio on Pexels

The problem

A small team building a "Ghost Analyst" on Anthropic’s Claude to triage Microsoft Sentinel and Defender alerts ran into a familiar corporate headache: useful triage data contains client IPs, usernames and internal hostnames you don’t want floating across the cloud. It has been reported that they considered local models, but found open-source alternatives couldn’t match Claude Opus for the kind of reasoning they needed. The compromise? A Data Loss Prevention proxy that sits between the agent and Anthropic: pseudonymize going out, restore coming back — the LLM never sees real data, the analyst never sees fake data.

The iterations

The first pass was painfully naive. Regex replaced emails with bracketed tags and Claude allegedly invented a user named "Sarah Kowalski" to repair the statistical oddity — fan fiction instead of a triage query. Fragmentation followed: partial matches tripped up the proxy and produced empty queries. Prompt engineering might have tamed the model, but the goal was a transparent middleware, not a library of hacks.

So they evolved the proxy. A lightweight NER (spaCy) joined regex, and replacements became syntactically valid pseudonyms — user_account_001@domain-internal-001.com instead of [User_1]. The proxy also registered both full emails and bare usernames to avoid fragmentation. Progress, yes — but new problems popped up. Masking erased signals: "impossible travel" patterns and typosquatting indicators disappeared when IPs and domains became generic placeholders. How do you hide the patient but not the symptoms?

Why it matters

It has been reported that the team iterated three times to turn a regex eraser into a context-aware translator, and that they are open-sourcing the result. This is part of a larger trend: companies want the reasoning power of frontier models without handing over sensitive telemetry. The emotional core here is obvious — fear of leakage versus the hunger for automation. Will proxy layers like this become the standard duct tape between enterprise telemetry and cloud LLMs? Maybe — and if nothing else, this project shows the problem is trickier than simple redaction.

Sources: atticsecurity.com, Hacker News