Every team I talk to says the same thing when an agent does something expensive: “we have logs.” They do. The logs are detailed, structured, searchable, and completely useless the moment the dispute becomes adversarial. Not because the data is wrong, but because nobody on the other side has any reason to believe it.

A log is a record you keep for yourself. Evidence is a record someone who distrusts you can still rely on. Those are different artifacts with different requirements, and most observability stacks only build the first one.

What makes a record evidence

Strip away the tooling and there are three properties that separate a log from evidence:

Ordinary logs fail all three. A row in a database can be updated. A timestamp can drift or be set. A single compromised service account — or one well-meaning engineer running a cleanup script — can rewrite history, and the rewritten version looks exactly as authoritative as the original. There is no seam. That’s the whole problem: a mutable record has no way to prove it wasn’t mutated.

This isn’t a knock on observability tools. Datadog, Langfuse, Honeycomb, and the rest are excellent at the job they’re built for: helping you understand your system. That job assumes you trust the data because it’s yours. Evidence is for the case where trust is exactly what’s missing.

Why this gets worse with agents

When software only suggested things, a bad record was a debugging annoyance. When software acts — refunds a customer, cancels a subscription, files a claim, moves money — a bad record is a liability you can’t discharge. The action already happened in the real world. The only thing left to argue about is what was decided, by whom, and when. That argument is won or lost entirely on the quality of your record.

And the counterparty in that argument — a customer, an auditor, a regulator, opposing counsel — has no reason to accept “here’s a screenshot of our dashboard.” They shouldn’t. A screenshot of a mutable system is worth exactly as much as the trust you’ve already established, which in a dispute is approximately zero.

What evidence actually looks like

The fix isn’t a better log. It’s a different kind of artifact sitting alongside your logs at the few points that matter for disputes. At Marturia we make every recorded decision into a receipt that is:

Log Receipt
Can be edited after the fact Yes, silently Not without detection
Verifiable by an outside party No Yes, no account needed
Bound to identity + time + payload By convention Cryptographically
Survives a compromised account No Yes

The verification is the part that matters most, so it runs without us:

pip install marturia-verify
marturia-verify --receipt receipt.json --pubkey-hex <tenant-public-key>

That command walks the hash chain, checks the signatures, and confirms the Merkle anchor. If a single byte changed anywhere along the way, it fails loudly. We could disappear tomorrow and the receipts a customer already holds would still verify.

Keep your logs

None of this replaces observability. Keep your traces, keep your dashboards, keep debugging the way you always have. Logs answer “why was this slow?” and “what did the model see?” — questions you ask yourself. Receipts answer “can you prove this decision happened and hasn’t been altered?” — the question someone else asks you, usually on the worst possible day.

Emit a receipt at the points where a dispute would actually hurt. Leave everything else exactly as it is. The day the question arrives, you’ll be glad the answer is math instead of a screenshot.

Closed beta is open. You can start emitting receipts in about fifteen minutes:

Related Marturia resources - /guides/marturia-vs-langsmith-vs-sigstore.html - /docs/quickstart.html - /docs/