Every team I talk to says the same thing when an agent does something expensive: “we have logs.” They do. The logs are detailed, structured, searchable, and completely useless the moment the dispute becomes adversarial. Not because the data is wrong, but because nobody on the other side has any reason to believe it.
A log is a record you keep for yourself. Evidence is a record someone who distrusts you can still rely on. Those are different artifacts with different requirements, and most observability stacks only build the first one.
What makes a record evidence¶
Strip away the tooling and there are three properties that separate a log from evidence:
- Tamper-evidence. If a record is altered after the fact, that alteration is detectable. Not “unlikely” — detectable, by math, by anyone.
- Independent verifiability. Someone with no account on your system, no access to your database, and no trust in your company can confirm the record stands. They don’t have to take your dashboard’s word for it.
- Binding. The record is cryptographically tied to a specific identity, a specific moment, and a specific payload. You can’t quietly swap which model, which tenant, or which timestamp it belongs to.
Ordinary logs fail all three. A row in a database can be updated. A timestamp can drift or be set. A single compromised service account — or one well-meaning engineer running a cleanup script — can rewrite history, and the rewritten version looks exactly as authoritative as the original. There is no seam. That’s the whole problem: a mutable record has no way to prove it wasn’t mutated.
This isn’t a knock on observability tools. Datadog, Langfuse, Honeycomb, and the rest are excellent at the job they’re built for: helping you understand your system. That job assumes you trust the data because it’s yours. Evidence is for the case where trust is exactly what’s missing.
Why this gets worse with agents¶
When software only suggested things, a bad record was a debugging annoyance. When software acts — refunds a customer, cancels a subscription, files a claim, moves money — a bad record is a liability you can’t discharge. The action already happened in the real world. The only thing left to argue about is what was decided, by whom, and when. That argument is won or lost entirely on the quality of your record.
And the counterparty in that argument — a customer, an auditor, a regulator, opposing counsel — has no reason to accept “here’s a screenshot of our dashboard.” They shouldn’t. A screenshot of a mutable system is worth exactly as much as the trust you’ve already established, which in a dispute is approximately zero.
What evidence actually looks like¶
The fix isn’t a better log. It’s a different kind of artifact sitting alongside your logs at the few points that matter for disputes. At Marturia we make every recorded decision into a receipt that is:
- signed with an Ed25519 key the tenant controls,
- chained to the previous receipt, so nothing can be inserted, deleted, or reordered without breaking the chain,
- independently verifiable by anyone with a public command and no Marturia account, and
- anchored — the chain head is periodically folded into a Merkle root and cosigned by witnesses, so even we can’t quietly rewrite the past.
| Log | Receipt | |
|---|---|---|
| Can be edited after the fact | Yes, silently | Not without detection |
| Verifiable by an outside party | No | Yes, no account needed |
| Bound to identity + time + payload | By convention | Cryptographically |
| Survives a compromised account | No | Yes |
The verification is the part that matters most, so it runs without us:
pip install marturia-verify
marturia-verify --receipt receipt.json --pubkey-hex <tenant-public-key>
That command walks the hash chain, checks the signatures, and confirms the Merkle anchor. If a single byte changed anywhere along the way, it fails loudly. We could disappear tomorrow and the receipts a customer already holds would still verify.
Keep your logs¶
None of this replaces observability. Keep your traces, keep your dashboards, keep debugging the way you always have. Logs answer “why was this slow?” and “what did the model see?” — questions you ask yourself. Receipts answer “can you prove this decision happened and hasn’t been altered?” — the question someone else asks you, usually on the worst possible day.
Emit a receipt at the points where a dispute would actually hurt. Leave everything else exactly as it is. The day the question arrives, you’ll be glad the answer is math instead of a screenshot.
Closed beta is open. You can start emitting receipts in about fifteen minutes:
pip install marturia-verifyfor the public verifier/docs/quickstart.htmlfor the agent-side integration/guides/marturia-vs-langsmith-vs-sigstore.htmlif you’re comparing tools
Related Marturia resources - /guides/marturia-vs-langsmith-vs-sigstore.html - /docs/quickstart.html - /docs/