b6aiAI & Conversational Tech

Custom NLP Layer for Intelligent Intent Identification at b6ai

Built a custom NLP layer using the Compromise package for initial intent identification, reducing infrastructure costs by ~70% with a multi-tier fallback architecture using pgvector and Qwen 7B.

10 min read
~70%
Infrastructure Cost Reduction
Reduced LLM API costs by handling most intents with lightweight NLP
95%+
Intent Accuracy
Combined three-tier system achieves high accuracy across all query types
<50ms
Response Latency
Tier 1 NLP layer responds in under 50ms for matched intents
Compromise.jsPostgreSQLpgvectorQwen 7BNode.jsTypeScript

The Challenge

b6ai needed a cost-effective and scalable intent identification system that could handle high volumes of user queries without relying entirely on expensive large language model API calls for every request.

The Solution

We designed a three-tier fallback architecture: a lightweight custom NLP layer using the Compromise package handles the majority of intent identification locally, with progressively more powerful (and costly) fallbacks for edge cases.

Introduction

When b6ai approached BytesNBinary, they were facing a common challenge in the conversational AI space: how to accurately identify user intent at scale without incurring prohibitive infrastructure costs from constant LLM API calls. We designed and built a multi-tier intent identification system that starts with a custom, lightweight NLP layer and progressively falls back to more powerful systems only when needed. The result was a dramatic reduction in infrastructure costs while maintaining high accuracy.

The Challenge

b6ai's conversational AI platform was processing thousands of user queries daily. Every query was being routed to a large language model for intent identification, leading to high API costs and latency. They needed a solution that could handle the majority of common intents locally while still maintaining the accuracy of LLM-based classification for ambiguous or complex queries.

Key Requirements

The solution needed to meet several critical requirements to be viable for production deployment:

1
Handle 80%+ of intent identification without any LLM API call
2
Maintain 95%+ overall accuracy across all query types
3
Sub-100ms response time for the primary NLP layer
4
Graceful fallback to more powerful models for edge cases
5
Easy to update intent patterns without redeployment

The Three-Tier Fallback Architecture

We designed a cascading three-tier system where each tier is progressively more capable (and more expensive). The system tries the cheapest, fastest tier first, and only escalates to higher tiers when confidence is low.

Tier 1: Custom NLP with Compromise.js

The first tier uses the Compromise natural language processing library to perform fast, rule-based intent identification. We built a custom intent matching engine on top of Compromise that uses pattern matching, entity extraction, and lightweight classification to identify user intents. This tier handles approximately 75-80% of all queries with high confidence, at near-zero marginal cost.

1
Pattern-based intent matching using Compromise's lexicon and tagging system
2
Custom entity extraction for domain-specific terms
3
Confidence scoring to determine when to escalate to Tier 2
4
Configurable intent patterns loaded from JSON configuration files
5
Sub-50ms response time for matched intents

Tier 2: Semantic Search with pgvector

When the Compromise-based NLP layer cannot confidently identify an intent (confidence below threshold), the query is passed to Tier 2. This tier uses PostgreSQL with the pgvector extension to perform semantic similarity search against a pre-computed vector database of intent embeddings. By comparing the query embedding against known intent embeddings, we can identify the closest matching intent even for paraphrased or unusual phrasings.

1
Pre-computed embeddings for all known intent examples stored in PostgreSQL
2
pgvector extension for efficient cosine similarity search
3
Handles paraphrased queries and unusual phrasings that rule-based NLP misses
4
Response time of 100-200ms including embedding generation
5
Covers an additional 15-18% of queries that Tier 1 cannot handle

Tier 3: Qwen 7B LLM Fallback

For the remaining 2-5% of queries that neither the custom NLP nor the embedding search can handle with sufficient confidence, we fall back to Qwen 7B — a small but capable language model that can run on CPU infrastructure. This avoids the need for expensive GPU-based LLM API calls while still providing intelligent classification for the most ambiguous queries.

1
Qwen 7B model deployed on CPU-only infrastructure for cost efficiency
2
Handles complex, ambiguous, or novel query patterns
3
Structured prompt engineering for consistent intent classification output
4
Response time of 1-3 seconds (acceptable for the small percentage of queries reaching this tier)
5
Continuous learning: misclassified queries are reviewed and added to Tier 1 and Tier 2 databases

Implementation Details

The system was implemented as a Node.js service with TypeScript, integrating seamlessly into b6ai's existing conversational AI pipeline. The Compromise NLP layer runs in-process, pgvector queries go to the existing PostgreSQL database, and Qwen 7B runs as a separate inference service.

typescript
async function identifyIntent(query: string): Promise<IntentResult> {
  // Tier 1: Custom NLP with Compromise
  const nlpResult = await compromiseNLP.classify(query);
  if (nlpResult.confidence >= CONFIDENCE_THRESHOLD) {
    return { intent: nlpResult.intent, tier: 1, confidence: nlpResult.confidence };
  }

  // Tier 2: Semantic search with pgvector
  const embedding = await generateEmbedding(query);
  const vectorResult = await pgvectorSearch(embedding);
  if (vectorResult.similarity >= SIMILARITY_THRESHOLD) {
    return { intent: vectorResult.intent, tier: 2, confidence: vectorResult.similarity };
  }

  // Tier 3: Qwen 7B fallback
  const llmResult = await qwenClassify(query);
  return { intent: llmResult.intent, tier: 3, confidence: llmResult.confidence };
}

Simplified version of the three-tier intent identification pipeline

Results and Impact

After deploying the three-tier system, b6ai saw immediate and significant improvements in both cost efficiency and system performance. The custom NLP layer handled the vast majority of queries at near-zero marginal cost, while the fallback tiers ensured that accuracy remained high even for edge cases. The system proved that intelligent architecture design can dramatically reduce dependency on expensive LLM APIs without sacrificing quality.

Conclusion

The custom NLP project for b6ai demonstrates that not every conversational AI problem requires a large language model. By designing a tiered architecture with a lightweight NLP layer as the first line of defense, we reduced infrastructure costs by approximately 70% while maintaining 95%+ intent identification accuracy. This approach is applicable to any organization running high-volume conversational AI systems where the majority of queries fall into well-defined intent categories.

Interested in similar results?

Let's discuss how we can help your business.

Get in Touch