Ninety Days With an Autonomous Agent in Production

We turned on the MCP autonomous agent in our own engineering org on February 4th. Today is May 6th. The agent has been running continuously against our JIRA backlog for ninety days. It has drafted 1,247 ticket replies. We have approved 71% of them as written, edited 22%, and rejected 7%.

This is the longest production run of an autonomous agent against a real engineering workflow that I have personally been close to. I want to share what changed, what surprised me, what I would not do again, and the specific places where the agent has earned trust versus where it has not.

The Setup

The agent runs against three JIRA projects: payments, billing, and the internal tools project. The configuration is intentionally narrow:

A JQL filter capturing tickets with no assigned engineer.
A response team list of nine engineers — anyone on this list automatically opts the ticket out of agent drafting.
A 5-minute polling interval.

The agent reads each new ticket, runs it through the product knowledge base — generated by Claude Code across our 31 repos and 4 databases — and drafts a contextual reply. Drafts land in an approval queue. A human (usually the on-duty engineering manager that day) reviews and approves, edits, or rejects.

This is the entire loop. There is no auto-posting. There is no creative behavior. The agent has exactly one thing it does: write a draft reply that a human will decide whether to send.

Week 1–2: Calibration

The first two weeks were rough. About half the drafts were unusable.

The failure modes were predictable. The agent was generating replies that sounded slightly off — too formal, with phrases like "we appreciate you bringing this to our attention." We do not talk like that. The agent had picked up the tone from training data, and our JIRA history, and it skewed toward customer support flavor rather than engineering team flavor.

We adjusted the drafting prompt twice in the first week. The change that mattered most was three lines:

"Match the team's existing tone. Read the last 5 comments on the ticket to calibrate."
"Use 'we' for the team. Avoid 'reach out,' 'circle back,' 'appreciate.'"
"If the ticket lacks reproduction steps, ask for them. Do not speculate."

By the end of week 2, the unusable rate dropped from 50% to about 15%. The same agent. The same model. Three lines of prompt adjustment.

This is the work that does not get demoed. You can ship an agent in an afternoon. Calibrating an agent for your team's voice takes two weeks and the willingness to write down a tone guide your team has, until that point, only had implicitly.

Week 3–6: Trust Build-up

Once the drafts read like our team, the question shifted from "is this an okay reply" to "is this the right thing to say."

This is where the knowledge base earned its keep. About 60% of incoming tickets touch a small set of patterns: "this feature does not work," "this customer's data looks wrong," "we need an export of X," "is feature Y on the roadmap." For all of these, the agent reads the relevant repo's docs, the product KB, and the recent ticket history, and drafts a reply that is — measurably — more contextual than what a human triage manager would produce in the first sixty seconds.

The specific places the agent earned trust during weeks 3-6:

Duplicate detection. Of the 1,247 drafts, 86 correctly identified the ticket as a duplicate of something filed in the last 90 days. Manual duplicate detection is the most thankless part of triage; humans miss it constantly. The agent does not.

Repo routing. The agent's drafts started naming the right repo for the issue. Not just "this is a backend problem" — "this lives in payments-api, in the refund handler module." That kind of specificity is what allows the manager to assign the ticket in one click rather than guessing.

Linking docs. Tickets that asked questions whose answers were in our auto-generated CLAUDE.md got drafts that linked to the specific section of the doc. This sounds trivial; it is not. Documentation is read at a much higher rate when something inside a workflow puts a link to it in front of the reader.

Asking for missing info. A third of inbound tickets had insufficient detail. The agent's drafts politely asked for the missing information — org ID, browser, exact URL — without sounding like a template. Customers replied at higher rates than they did to our previous canned responses.

By week six, the on-duty manager was approving drafts faster than they could read them, because the drafts were predictably good for these patterns.

Week 7–12: The Edge Cases

The interesting weeks were the ones where the agent was not earning trust — and where we learned to recognize the patterns where the human had to intervene.

Political tickets. A customer success manager filed a ticket on behalf of a customer who was visibly panicking about a contract renewal. The agent's draft was technically correct and emotionally tone-deaf. We learned to identify these tickets — usually flagged by the CS manager's username — and route them to a human-only queue.

Cross-team blame. A ticket that said "this is broken because the data team did X" required a careful response that did not throw the data team under the bus. The agent's draft did not throw anyone under the bus, but it also did not navigate the politics. Humans took over.

Anything with the word "urgent." Tickets with "urgent" in the title get handled by humans, full stop. Even when the agent's draft is fine, a panicked human reading the response wants to feel that another panicked human read the ticket. The agent does not communicate panic. We do not want it to.

Prior commitments we had forgotten. The agent does not know what we promised in a meeting last quarter. When a ticket referenced a prior commitment, the agent's draft was confidently wrong. Humans intervene; we update the KB when this happens.

The pattern across all of these is that the agent is bad at things that require context outside the codebase. Customer relationship history, internal politics, prior verbal commitments, urgency signaling — these all live in human heads and conversations. The agent does not see them. It should not be asked to.

The 7% Rejection Rate

Of the 1,247 drafts, 87 were rejected entirely. Reading those 87 in aggregate is the most useful exercise I did in the 90 days. The patterns:

About 30 of them were the agent confidently saying "this is fixed in the next release" when it was not. The agent had read the release notes for a planned release and assumed the fix would ship — but the planned release had been pushed back twice. The KB had not been updated. The fix is to update the KB; the agent only knows what it reads.

About 20 were drafts that asked for information the customer had already provided in an earlier comment. The agent read the ticket but did not weight the comments correctly. We adjusted the prompt to read the most recent five comments first; the rate of these dropped.

About 15 were drafts where the agent guessed at the cause when it should have said "we will investigate." Engineering tickets often need the answer "let us look into it" not "I have already diagnosed it from your bug report." We added an explicit rule: do not propose a cause unless the KB or the comment history supports it.

The remaining 22 were genuine misses — drafts that were just wrong, in ways that did not generalize. We rejected them and moved on.

What Surprised Me

The biggest surprise after 90 days is how little I think about the agent.

Most automation that promises to "save time" requires babysitting. You spend the morning watching it run. You jump in when it fails. The maintenance overhead is half of what was supposed to be saved.

The MCP agent is closer to a junior teammate who has been on the team for six months — competent on the patterns, occasionally wrong, asks good clarifying questions. After ninety days I check the approval queue twice a day and approve in batches. The cognitive overhead is roughly five minutes a day.

That five-minutes-a-day cost replaces what used to be roughly thirty minutes of morning triage. Across our team that is about five engineering hours a week reclaimed. We did not have a triage budget; the work just happened in the cracks of the day. The cracks are now bigger and the work that fills them is more substantive.

What I Would Not Do Again

Two things, honestly.

I would not have turned on the agent for all three projects at once. The right path is to start with one project, calibrate for two weeks, then expand. We did all three at once and the calibration was three times harder than it needed to be — different tone conventions per project, different KB completeness per project, different escalation patterns.

I would not have skipped the response team list configuration. We initially left it empty, figuring the agent would draft on every ticket and we would just route. Wrong. The agent drafting on top of a teammate who was already replying created confusion. The response team list — keep the agent off tickets where a human is already engaging — was the single most important configuration we made.

The Recommendation

If you are considering an autonomous agent loop for your team, here is what I would tell you after 90 days:

Pick one project. Calibrate the prompt for two weeks. Do not auto-post; keep the human in the approval loop indefinitely. Update the KB whenever the agent confidently says something wrong. Watch for the specific patterns where the agent is structurally bad — politics, urgency, prior commitments — and route those to humans by rule.

If you do this, the agent earns its keep. If you skip the calibration and ship an autonomous loop on day one, the agent embarrasses you in front of customers and the project gets quietly killed.

Ninety days in, this has been the most leveraged piece of automation we have shipped. Not because the agent is brilliant. Because the loop — autonomous draft, human approval, audit log, KB feedback — is the right shape.

See KavrynOS →