Case study · Accounting
Near-zero-touch bank-feed categorization for a QuickBooks-driven accounting firm.
A 4-stage vendor matching pipeline collapsed hours of manual transaction work into seconds, with auditable shadow columns, transparent confidence scores, and zero loss of human oversight.
of transactions auto-categorized end-to-end
vendor matching pipeline
median categorization latency
audit-ready from day one
The problem
"Every month we lose two days to QuickBooks transaction classification."
A mid-market accounting practice serving small and mid-sized business clients was drowning in QuickBooks Online bank-feed data. Every month, staff accountants pulled bank feeds for each client, looked at each transaction's raw description ("AMZNMKTPL*5HG2J1XN3"), guessed the vendor, picked a category from a chart of accounts, and clicked accept. Repeat for thousands of transactions per client, across hundreds of clients.
The firm had tried QuickBooks' built-in rules engine. It worked for the obvious cases, a string-match rule for "STARBUCKS" categorizes to "Meals", but broke down on the long tail: unfamiliar vendors, payment processors that obscure the merchant, transaction descriptions that change month to month, and clients in industries with idiosyncratic vendor sets. The result: hours of manual work every month, inconsistency between staff accountants, and a steady backlog that pushed monthly closes later and later.
They wanted three things: drastically reduce the manual work, keep the staff accountant in the loop on uncertain cases, and have an audit trail that explained every categorization decision in case a client (or the IRS) ever asked.
The approach
A 4-stage matching pipeline, with the LLM as the last resort, not the first call.
Most teams reach for an LLM the moment AI is on the table. We built the opposite. Every transaction runs through a four-stage pipeline; only the transactions that need generative AI reach it. That keeps inference cost low, latency predictable, and the audit story clean.
Stage 1 · Sanitization
Strip the noise from raw bank descriptions.
Bank feeds are full of merchant-code prefixes, transaction-ID suffixes, store numbers, and processor codes that mean nothing semantically. A regex layer normalizes "TST* MERCHANT NAME #1234 SAN FRANCISCO CA" down to "merchant name", preserving the signal, dropping the noise, before any matching runs.
Stage 2 · Exact match
Did we see this exact vendor last month?
The fastest, most certain match. Sanitized vendor strings hit a per-client lookup table built from prior categorizations. If we matched it before with high confidence and the client accepted, we apply the same mapping. ~60% of transactions stop here.
Stage 3 · Fuzzy match (80% threshold)
Almost the same vendor, slightly different string.
A Fuse.js fuzzy match catches near-duplicates ("Amzn.com" vs "AMZN MKTP" vs "AMAZON.COM*5HG2"). Threshold tuned to 80% similarity, high enough to avoid false positives, low enough to catch the realistic variation in bank descriptions. Picks up another ~25%.
Stage 4 · Gemini 2.5 Flash inference
When everything else fails, ask the model.
For the remaining ~15%, genuinely new vendors, ambiguous descriptions, anything where the deterministic stages couldn't reach high confidence, we call Google Gemini 2.5 Flash with the sanitized description, the client's vendor history, and the chart of accounts. Structured JSON output, with a confidence score and reasoning string. Low-confidence outputs route to a staff accountant.
Why the LLM only reaches the last 15%: Inference is the most expensive layer and the most variable. Deterministic stages are fast, cheap, and explainable, and they get more of them right than people assume. Gemini handles the genuinely novel cases, with full audit trail. The pipeline order is the architecture.
Architecture
What's actually running in production.
- Frontend, React (Vite) + TailwindCSS. Admin console for the accounting team: per-client dashboards, per-transaction review queue, LLM-suggestion approval, drift alerts.
- Backend, Node.js (Express) + SQLite via
better-sqlite3. Per-client data isolation enforced at the query layer. Stateless API tier behind PM2 for auto-restart. - QuickBooks integration, OAuth 2.0 with the official Intuit API, plus Change Data Capture (CDC) polling for incremental sync. Every Intuit API response captures the
intuit_tidtrace header for support escalation. - LLM gateway, Google Gemini 2.5 Flash. Picked for the combination of low latency, low cost per call, and structured JSON output. Calls isolated in a single
server/services/ai.jsmodule for easy provider swap. - Shadow columns, Every LLM categorization writes to
llm_vendor_idandllm_vendor_namealongside the main fields. Allows non-destructive validation: compare agent picks to human picks for thousands of historical transactions before promoting AI to write the canonical field. - Hosting, Single EC2 instance (Ubuntu) managed by PM2. CI/CD via GitHub Actions. Cache-Control no-store on every response containing financial detail. Daily backups; point-in-time-recoverable.
- Error tracking, All QBO API failures land in an
error_logstable with theintuit_tid. Makes troubleshooting with Intuit support a matter of pulling the trace ID, not reverse-engineering what happened.
The result
Two days a month became 90 minutes.
After six weeks of staged rollout (one client first, then ten, then the full book), the firm now categorizes new transactions with ~95% auto-application. The remaining ~5%, low-confidence Gemini suggestions plus genuinely novel vendors, route to a daily review queue that one staff accountant clears in about 90 minutes per week, across every client.
Beyond raw time saved, the practice gained three things they hadn't fully scoped for at the start:
- Consistency between accountants. Two staff members no longer categorize the same vendor differently. The lookup table is the source of truth; every categorization decision compounds into firm-wide knowledge.
- Plain-English transaction explanations. A second Gemini call generates a one-sentence explanation per categorization for client-facing reports. Clients ask fewer "what's this charge?" questions, and the firm's reports got noticeably more readable.
- An audit story that holds up. Every categorization records: source description, sanitized form, match stage that fired, confidence score, model reasoning (when LLM was used), and approving staff member. The whole chain reconstructs in one query.
Tech stack
Application
React (Vite), TailwindCSS, Node.js/Express, better-sqlite3, Fuse.js, PM2
AI & integrations
Google Gemini 2.5 Flash via @google/genai SDK, QuickBooks Online OAuth 2.0 + CDC, Intuit OpenID Connect Discovery
Infrastructure & security
AWS EC2 (Ubuntu), GitHub Actions CI/CD, bcryptjs auth, express-rate-limit, helmet, Cache-Control no-store on sensitive endpoints
Patterns worth stealing
Three lessons from this engagement.
- Let the LLM be the last resort, not the first call. Deterministic stages are faster, cheaper, more explainable, and often more accurate for the bulk of cases. Use generative AI for the genuine long tail.
- Shadow columns are how you ship AI safely. Run the model's predictions next to the human's for weeks before flipping authority. The proof is in the diff, not the demo.
- Capture the trace ID from every external API. When something goes wrong, you want to walk into Intuit support (or any partner) with the exact request they need, not "I think it happened around 3pm yesterday."