The Challenge
b6ai needed a cost-effective and scalable intent identification system that could handle high volumes of user queries without relying entirely on expensive large language model API calls for every request.
The Solution
We designed a three-tier fallback architecture: a lightweight custom NLP layer using the Compromise package handles the majority of intent identification locally, with progressively more powerful (and costly) fallbacks for edge cases.
Introduction
When b6ai approached BytesNBinary, they were facing a common challenge in the conversational AI space: how to accurately identify user intent at scale without incurring prohibitive infrastructure costs from constant LLM API calls. We designed and built a multi-tier intent identification system that starts with a custom, lightweight NLP layer and progressively falls back to more powerful systems only when needed. The result was a dramatic reduction in infrastructure costs while maintaining high accuracy.
The Challenge
b6ai's conversational AI platform was processing thousands of user queries daily. Every query was being routed to a large language model for intent identification, leading to high API costs and latency. They needed a solution that could handle the majority of common intents locally while still maintaining the accuracy of LLM-based classification for ambiguous or complex queries.
Key Requirements
The solution needed to meet several critical requirements to be viable for production deployment:
The Three-Tier Fallback Architecture
We designed a cascading three-tier system where each tier is progressively more capable (and more expensive). The system tries the cheapest, fastest tier first, and only escalates to higher tiers when confidence is low.
Tier 1: Custom NLP with Compromise.js
The first tier uses the Compromise natural language processing library to perform fast, rule-based intent identification. We built a custom intent matching engine on top of Compromise that uses pattern matching, entity extraction, and lightweight classification to identify user intents. This tier handles approximately 75-80% of all queries with high confidence, at near-zero marginal cost.
Tier 2: Semantic Search with pgvector
When the Compromise-based NLP layer cannot confidently identify an intent (confidence below threshold), the query is passed to Tier 2. This tier uses PostgreSQL with the pgvector extension to perform semantic similarity search against a pre-computed vector database of intent embeddings. By comparing the query embedding against known intent embeddings, we can identify the closest matching intent even for paraphrased or unusual phrasings.
Tier 3: Qwen 7B LLM Fallback
For the remaining 2-5% of queries that neither the custom NLP nor the embedding search can handle with sufficient confidence, we fall back to Qwen 7B — a small but capable language model that can run on CPU infrastructure. This avoids the need for expensive GPU-based LLM API calls while still providing intelligent classification for the most ambiguous queries.
Implementation Details
The system was implemented as a Node.js service with TypeScript, integrating seamlessly into b6ai's existing conversational AI pipeline. The Compromise NLP layer runs in-process, pgvector queries go to the existing PostgreSQL database, and Qwen 7B runs as a separate inference service.
async function identifyIntent(query: string): Promise<IntentResult> {
// Tier 1: Custom NLP with Compromise
const nlpResult = await compromiseNLP.classify(query);
if (nlpResult.confidence >= CONFIDENCE_THRESHOLD) {
return { intent: nlpResult.intent, tier: 1, confidence: nlpResult.confidence };
}
// Tier 2: Semantic search with pgvector
const embedding = await generateEmbedding(query);
const vectorResult = await pgvectorSearch(embedding);
if (vectorResult.similarity >= SIMILARITY_THRESHOLD) {
return { intent: vectorResult.intent, tier: 2, confidence: vectorResult.similarity };
}
// Tier 3: Qwen 7B fallback
const llmResult = await qwenClassify(query);
return { intent: llmResult.intent, tier: 3, confidence: llmResult.confidence };
}Simplified version of the three-tier intent identification pipeline
Results and Impact
After deploying the three-tier system, b6ai saw immediate and significant improvements in both cost efficiency and system performance. The custom NLP layer handled the vast majority of queries at near-zero marginal cost, while the fallback tiers ensured that accuracy remained high even for edge cases. The system proved that intelligent architecture design can dramatically reduce dependency on expensive LLM APIs without sacrificing quality.
Conclusion
The custom NLP project for b6ai demonstrates that not every conversational AI problem requires a large language model. By designing a tiered architecture with a lightweight NLP layer as the first line of defense, we reduced infrastructure costs by approximately 70% while maintaining 95%+ intent identification accuracy. This approach is applicable to any organization running high-volume conversational AI systems where the majority of queries fall into well-defined intent categories.
Interested in similar results?
Let's discuss how we can help your business.
