AI Agent Customer Support

Sajawal Khan Sadozai

Last year, one of our FinTech clients came to us with a painful problem: their support team was drowning. 1,200+ tickets a day, 4-hour average response times, and a churn rate that was climbing because customers couldn't get answers fast enough. Six months later, their AI agent handles 90% of all incoming queries autonomously — with a customer satisfaction score higher than their human team ever achieved. Here's exactly how we built it.

The Problem We Were Actually Solving

Before writing a single line of code, we spent two weeks doing something most teams skip: deeply understanding what "customer support" actually meant for this client. We pulled 3 months of support tickets, tagged them by type, and built a frequency map.

What we found was striking. Out of 14 distinct query categories, just 4 accounted for 78% of all tickets:

Transaction status inquiries — "Where is my payment?" (31%)
Account verification questions — "Why is my account still pending?" (22%)
Fee explanations — "Why was I charged this amount?" (14%)
Password and access issues — "I can't log in" (11%)

This told us something important: most of the pain wasn't coming from complex, unique problems — it was coming from simple, repetitive ones that needed fast, accurate answers. That's exactly the kind of problem AI agents are built for.

Choosing the Right Architecture

The first major decision was architecture. There are roughly three approaches to building a support AI agent, and they have very different tradeoffs:

Pure LLM approach — You feed a large language model your documentation and let it answer. Fast to build, but hallucination risk is too high for a financial product where wrong information can cause real harm.
Rule-based chatbot — Decision trees and scripted responses. Zero hallucination risk, but brittle. Falls apart the moment a user asks something slightly off-script.
RAG (Retrieval-Augmented Generation) — The agent retrieves relevant context from a verified knowledge base before generating a response. Combines the fluency of LLMs with the accuracy of structured data. This is what we chose.

On top of RAG, we added a tool-use layer. For queries that required live data — like transaction status — the agent could call our client's internal APIs directly, look up real-time information, and return accurate answers instead of generic guidance. This was the difference between an agent that says "please contact your bank" and one that says "your transfer of $250 to James is currently being processed and will arrive by 3pm today."

Building the Knowledge Base

A RAG system is only as good as the knowledge it retrieves from. We spent three weeks building and structuring the knowledge base — and this was arguably the most important phase of the entire project.

We gathered content from five sources:

The client's existing help center articles (180 documents)
Internal SOPs used by the human support team
Annotated past tickets — particularly the highest-quality human responses
Product documentation and API references
Regulatory and compliance FAQs specific to FinTech

Each document was chunked, cleaned, and embedded using OpenAI's text-embedding-3-large model. We stored vectors in Pinecone with careful metadata tagging — by category, product area, and last-verified date — so retrieval could be filtered and ranked intelligently.

One thing we got wrong initially: chunk size. Our first pass used 512-token chunks, which often split concepts awkwardly mid-explanation. After testing, we settled on 256-token chunks with a 64-token overlap. The retrieval precision improved significantly with this adjustment.

Designing the Conversation Flow

Most AI agents fail not because of bad AI — but because of bad flow design. The agent needs to understand user intent, manage context across a multi-turn conversation, know when it can answer confidently, and know when to escalate gracefully.

We designed four distinct response modes:

Direct answer — High-confidence retrieval, clear question. Agent responds directly with cited information from the knowledge base.
API-grounded answer — Query requires live data. Agent calls the relevant internal tool, retrieves real-time data, and constructs a personalized response.
Clarification request — Intent is ambiguous. Agent asks a focused clarifying question before attempting to answer, rather than guessing.
Graceful escalation — Query falls outside the agent's knowledge scope, or the user is clearly frustrated. Agent acknowledges, summarizes the conversation context, and hands off to a human agent with full context pre-loaded.

That last mode — graceful escalation — was something we spent disproportionate time on. The worst thing an AI agent can do is make a frustrated customer feel like they're hitting a wall. Our escalation flow was designed to feel like a warm handoff, not a rejection.

The Confidence Scoring System

One of our most important engineering decisions was building a confidence scoring layer on top of the LLM responses. We couldn't just let the model answer everything — we needed a way to programmatically decide when the agent was "sure enough" to respond and when it should escalate.

Our scoring system looked at three signals:

Retrieval similarity score — How closely did the retrieved chunks match the query? Below 0.72 cosine similarity, we flagged the answer as low-confidence.
Response consistency check — We ran the same query twice with slight temperature variation. If responses diverged significantly, we treated the answer as uncertain.
Semantic entailment check — A lightweight classifier verified that the generated response was actually entailed by the retrieved context, catching hallucinations before they reached the user.

Responses that failed any of these checks were automatically routed to human review rather than sent to the user. Initially this caught about 18% of responses. After two weeks of improving the knowledge base based on those failures, it dropped to under 4%.

Integration With the Client's Systems

The agent needed to plug into the client's existing infrastructure without a full platform overhaul. Their stack included a custom CRM, a core banking system, and a Zendesk-based ticketing setup.

We built a middleware layer that handled:

Authentication and identity resolution — Verifying the user's identity before any account-specific query was answered, pulling their profile from the CRM.
Real-time transaction lookups — Querying the core banking API for live payment status, balance information, and transaction history.
Ticket creation and routing — When escalation was triggered, automatically creating a Zendesk ticket pre-populated with the full conversation transcript, user context, and a suggested category.
Analytics logging — Every interaction was logged to a data warehouse for continuous improvement analysis.

We used a webhook-based event system rather than polling, which kept latency under 400ms for 95% of API-grounded responses — fast enough that users couldn't tell they were waiting for a live data call.

Results After 90 Days

Three months post-launch, here's where things stood:

90.3% of queries resolved autonomously — up from 0% with the old rule-based bot that customers had learned to immediately bypass.
Average response time: 8 seconds — down from 4.2 hours with the human-only team.
CSAT score: 4.6 / 5.0 — the human team's previous best was 4.1, achieved during low-volume periods.
Support team headcount: -40% — the client redeployed 6 agents to higher-value tasks like onboarding and relationship management rather than cutting staff.
Escalation quality improved — The 9.7% of queries that did reach human agents were genuinely complex. Human agents reported spending less time gathering context and more time actually solving problems.

What We Got Wrong (And Fixed)

No project goes perfectly. Here's what we underestimated:

Tone consistency — Early versions of the agent sounded inconsistent. Sometimes formal, sometimes casual. We added a system prompt layer with strict persona guidelines and ran a tone audit across 500 sample responses before go-live.
Edge cases in financial compliance — There were certain questions the agent legally couldn't answer without proper disclaimers (investment-related queries, for example). We built a regulatory filter that detected these topics and returned compliant, pre-approved responses instead.
Knowledge base staleness — Product policies change. We underestimated how quickly the knowledge base would go stale. We solved this by building a weekly review process and a document versioning system so outdated chunks were automatically flagged for review.
Prompt injection attempts — Some users tried to manipulate the agent into revealing system prompts or behaving outside its scope. We added an input sanitization layer and adversarial testing as part of our QA process.

Key Lessons for Teams Building Support AI

If you're considering building something similar, here's what we'd tell ourselves on day one:

Spend more time on data than on models. The quality of your knowledge base matters more than which LLM you choose. A well-structured knowledge base with a weaker model will outperform a messy one with a state-of-the-art model every time.
Design for failure first. Build your escalation paths before you build your happy paths. The agent's behaviour when it doesn't know something is more important than its behaviour when it does.
Measure precision, not just deflection rate. An agent that deflects 95% of tickets but gives wrong answers is worse than useless — it destroys trust. Track accuracy relentlessly.
Involve the human support team early. The people who handle tickets every day know nuances that no documentation captures. Their input shaped our knowledge base and our escalation criteria in ways we couldn't have anticipated.
Ship small and iterate. We launched with coverage for just the top 4 query categories. Full coverage came 6 weeks later. This let us fix real problems with real users before scaling.

What's Next

The client is now working with us on phase two: a proactive agent that doesn't wait for users to ask — it reaches out when it detects anomalies. Unusual transaction patterns, approaching account limits, failed payment retries. Instead of waiting for a frustrated ticket, the system sends a personalised message before the user even realises there's an issue.

The shift from reactive to proactive support is where AI really starts to change the game. We'll write about that build when it launches.

If you're building something similar or thinking about deploying AI agents for your product, we'd love to talk. Get in touch with the DartSyn team →

How We Built an AI Agent That Handles 90% of Customer Support