← writing

From Direct Classification to Agentic Routing: Local vs Cloud AI

The Problem: Every AI Request Isn’t Equal

You’re building AI workflows. Maybe it’s a support ticket classifier, a document parser, or an MCP server that routes queries to the right data source. The question hits immediately: do I run this locally or send it to Azure/OpenAI?

Most people pick one and stick with it. Local for privacy, cloud for power. But that’s leaving performance and cost on the table. The real answer is both, routed intelligently.

I’ve run into this building MCP servers for SeaynicNet. Some queries need GPT-4’s reasoning. Others are trivial pattern matching that a local 7B model handles in milliseconds. The trick is knowing which is which before you make the expensive call.

Direct Classification: The Naive Approach

The simplest pattern: every request hits your cloud LLM. Azure OpenAI gets a ticket description, returns a category. Done.

response = azure_client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": f"Classify this ticket: {text}"}]
)
category = response.choices[0].message.content

This works. It’s also wasteful. You’re paying $0.03/1K tokens for queries like “reset my password” that a regex could handle. At scale, that’s real money. More importantly, it’s slow — every request waits for a network round-trip.

Local Models: Fast and Cheap, But Limited

Flip it: run everything through a local model. Llama 3.1 8B on a Mac Mini M4 Pro handles basic classification at ~50 tokens/second with zero API cost.

from llama_cpp import Llama

model = Llama(model_path="./models/llama-3.1-8b.gguf", n_ctx=4096)
response = model.create_chat_completion(
    messages=[{"role": "user", "content": f"Classify: {text}"}]
)

Latency drops to milliseconds. Cost drops to electricity. But now you’re stuck: the local model chokes on nuanced queries. “My login works on mobile but not desktop, only after VPN connects” needs reasoning power that 8B parameters don’t reliably deliver.

Agentic Routing: The Hybrid Pattern

Here’s the move: use a cheap local model as a router. It doesn’t classify the ticket — it decides which system should classify it.

router_prompt = f"""
Analyze this query and return ONLY 'local' or 'cloud':
- local: simple, common patterns (password reset, account locked, basic how-to)
- cloud: complex reasoning, edge cases, ambiguous intent

Query: {text}
"""

route = local_model(router_prompt).strip().lower()

if route == "local":
    result = local_classifier(text)
else:
    result = azure_classifier(text)

The router runs locally, completes in ~100ms, and costs nothing. It catches 70-80% of requests — the easy wins. The remaining 20-30% that genuinely need GPT-4’s horsepower hit Azure.

Result: lower cost, better latency for most requests, full power when you need it.

When This Actually Matters

This isn’t academic. I’ve seen enterprise support systems process 10K+ tickets daily. Direct classification at $0.03/1K tokens with average 500-token exchanges = $150/day, $4500/month. Agentic routing cuts that to ~$1000/month while improving p50 latency.

For MCP servers, it’s the difference between a chatbot that feels snappy vs one that makes you wait. If you’re routing to different knowledge bases or APIs based on query intent, the router pattern lets you keep complex orchestration local and only call out when truly necessary.

The Enterprise Reality Check

Running local models in production isn’t trivial. You need:

  • Inference infrastructure: Ollama, llama.cpp, or vLLM for serving
  • Model management: versioning, updates, A/B testing
  • Fallback logic: what happens when the router is wrong?

Azure OpenAI gives you SLAs, compliance, audit logs. Local gives you control and cost efficiency. The hybrid approach acknowledges both are valuable.

In a healthcare environment like mine, it also maps to data sensitivity. PHI-adjacent queries route locally by policy. General knowledge queries can hit Azure. The router becomes an enforcement point.

Implementation Notes

A few things I’ve learned:

  1. Router confidence matters: If the local model returns cloud 40% of the time, it’s not saving you much. Tune your prompt or fine-tune the router.

  2. Cache cloud responses: If 5% of queries are identical, don’t re-call Azure. Local Redis cache in front of the cloud route pays off fast.

  3. Monitor route decisions: Log every local vs cloud choice. You’ll find patterns — maybe all queries with specific keywords should skip routing entirely.

  4. Local doesn’t mean offline: I run my router model via Ollama on the same Docker network as the MCP server. No external calls, but also not single-process complexity.

When to Skip This

If you’re processing <1000 requests/day, direct cloud classification is fine. The cost is noise. The complexity isn’t worth it.

If your queries are genuinely all complex (legal document analysis, medical coding), there’s no easy 80% to route locally. Just use the big model.

But if you’re in that middle ground — enough volume to care about cost, enough variety to benefit from routing — the agentic pattern is worth building.

Takeaway

Start with direct classification to prove the workflow. When you understand your query distribution, add a local router. Monitor the split. Tune until 70%+ stay local. That’s when hybrid architecture stops being clever and starts being cost-effective.

The future of enterprise AI isn’t local or cloud. It’s local and cloud, with smart routing deciding which handle each request.

← back to writing