Near-Zero-Touch Bank-Feed Categorization for a QuickBooks-Driven Accounting Firm | Case Study

~95%

of transactions auto-categorized end-to-end

4-stage

vendor matching pipeline

< 200ms

median categorization latency

SOC 2

audit-ready from day one

The problem

"Every month we lose two days to QuickBooks transaction classification."

A mid-market accounting practice serving small and mid-sized business clients was drowning in QuickBooks Online bank-feed data. Every month, staff accountants pulled bank feeds for each client, looked at each transaction's raw description ("AMZNMKTPL*5HG2J1XN3"), guessed the vendor, picked a category from a chart of accounts, and clicked accept. Repeat for thousands of transactions per client, across hundreds of clients.

The firm had tried QuickBooks' built-in rules engine. It worked for the obvious cases, a string-match rule for "STARBUCKS" categorizes to "Meals", but broke down on the long tail: unfamiliar vendors, payment processors that obscure the merchant, transaction descriptions that change month to month, and clients in industries with idiosyncratic vendor sets. The result: hours of manual work every month, inconsistency between staff accountants, and a steady backlog that pushed monthly closes later and later.

They wanted three things: drastically reduce the manual work, keep the staff accountant in the loop on uncertain cases, and have an audit trail that explained every categorization decision in case a client (or the IRS) ever asked.

The approach

A 4-stage matching pipeline, with the LLM as the last resort, not the first call.

Most teams reach for an LLM the moment AI is on the table. We built the opposite. Every transaction runs through a four-stage pipeline; only the transactions that need generative AI reach it. That keeps inference cost low, latency predictable, and the audit story clean.

Stage 1 · Sanitization

Strip the noise from raw bank descriptions.

Bank feeds are full of merchant-code prefixes, transaction-ID suffixes, store numbers, and processor codes that mean nothing semantically. A regex layer normalizes "TST* MERCHANT NAME #1234 SAN FRANCISCO CA" down to "merchant name", preserving the signal, dropping the noise, before any matching runs.

Stage 2 · Exact match

Did we see this exact vendor last month?

The fastest, most certain match. Sanitized vendor strings hit a per-client lookup table built from prior categorizations. If we matched it before with high confidence and the client accepted, we apply the same mapping. ~60% of transactions stop here.

Stage 3 · Fuzzy match (80% threshold)

Almost the same vendor, slightly different string.

A Fuse.js fuzzy match catches near-duplicates ("Amzn.com" vs "AMZN MKTP" vs "AMAZON.COM*5HG2"). Threshold tuned to 80% similarity, high enough to avoid false positives, low enough to catch the realistic variation in bank descriptions. Picks up another ~25%.

Stage 4 · Gemini 2.5 Flash inference

When everything else fails, ask the model.

For the remaining ~15%, genuinely new vendors, ambiguous descriptions, anything where the deterministic stages couldn't reach high confidence, we call Google Gemini 2.5 Flash with the sanitized description, the client's vendor history, and the chart of accounts. Structured JSON output, with a confidence score and reasoning string. Low-confidence outputs route to a staff accountant.

Why the LLM only reaches the last 15%: Inference is the most expensive layer and the most variable. Deterministic stages are fast, cheap, and explainable, and they get more of them right than people assume. Gemini handles the genuinely novel cases, with full audit trail. The pipeline order is the architecture.

Architecture

What's actually running in production.

Frontend, React (Vite) + TailwindCSS. Admin console for the accounting team: per-client dashboards, per-transaction review queue, LLM-suggestion approval, drift alerts.
Backend, Node.js (Express) + SQLite via better-sqlite3. Per-client data isolation enforced at the query layer. Stateless API tier behind PM2 for auto-restart.
QuickBooks integration, OAuth 2.0 with the official Intuit API, plus Change Data Capture (CDC) polling for incremental sync. Every Intuit API response captures the intuit_tid trace header for support escalation.
LLM gateway, Google Gemini 2.5 Flash. Picked for the combination of low latency, low cost per call, and structured JSON output. Calls isolated in a single server/services/ai.js module for easy provider swap.
Shadow columns, Every LLM categorization writes to llm_vendor_id and llm_vendor_name alongside the main fields. Allows non-destructive validation: compare agent picks to human picks for thousands of historical transactions before promoting AI to write the canonical field.
Hosting, Single EC2 instance (Ubuntu) managed by PM2. CI/CD via GitHub Actions. Cache-Control no-store on every response containing financial detail. Daily backups; point-in-time-recoverable.
Error tracking, All QBO API failures land in an error_logs table with the intuit_tid. Makes troubleshooting with Intuit support a matter of pulling the trace ID, not reverse-engineering what happened.

The result

Two days a month became 90 minutes.

After six weeks of staged rollout (one client first, then ten, then the full book), the firm now categorizes new transactions with ~95% auto-application. The remaining ~5%, low-confidence Gemini suggestions plus genuinely novel vendors, route to a daily review queue that one staff accountant clears in about 90 minutes per week, across every client.

Beyond raw time saved, the practice gained three things they hadn't fully scoped for at the start:

Consistency between accountants. Two staff members no longer categorize the same vendor differently. The lookup table is the source of truth; every categorization decision compounds into firm-wide knowledge.
Plain-English transaction explanations. A second Gemini call generates a one-sentence explanation per categorization for client-facing reports. Clients ask fewer "what's this charge?" questions, and the firm's reports got noticeably more readable.
An audit story that holds up. Every categorization records: source description, sanitized form, match stage that fired, confidence score, model reasoning (when LLM was used), and approving staff member. The whole chain reconstructs in one query.

Tech stack

Application

React (Vite), TailwindCSS, Node.js/Express, better-sqlite3, Fuse.js, PM2

AI & integrations

Google Gemini 2.5 Flash via @google/genai SDK, QuickBooks Online OAuth 2.0 + CDC, Intuit OpenID Connect Discovery

Infrastructure & security

AWS EC2 (Ubuntu), GitHub Actions CI/CD, bcryptjs auth, express-rate-limit, helmet, Cache-Control no-store on sensitive endpoints

Patterns worth stealing

Three lessons from this engagement.

Let the LLM be the last resort, not the first call. Deterministic stages are faster, cheaper, more explainable, and often more accurate for the bulk of cases. Use generative AI for the genuine long tail.
Shadow columns are how you ship AI safely. Run the model's predictions next to the human's for weeks before flipping authority. The proof is in the diff, not the demo.
Capture the trace ID from every external API. When something goes wrong, you want to walk into Intuit support (or any partner) with the exact request they need, not "I think it happened around 3pm yesterday."

Near-zero-touch bank-feed categorization for a QuickBooks-driven accounting firm.

"Every month we lose two days to QuickBooks transaction classification."

A 4-stage matching pipeline, with the LLM as the last resort, not the first call.

Strip the noise from raw bank descriptions.

Did we see this exact vendor last month?

Almost the same vendor, slightly different string.

When everything else fails, ask the model.

What's actually running in production.

Two days a month became 90 minutes.

Application

AI & integrations

Infrastructure & security

Three lessons from this engagement.

Bank feeds, expense reports, or a different workflow with the same shape?