Local Integration

ignitionstack.pro supports local LLM inference through Ollama, enabling zero-cost AI operations with complete data privacy. This is ideal for development, privacy-sensitive applications, or air-gapped environments.

Why Local LLMs?

Benefit	Description
Zero Cost	No API fees - run unlimited inference
Privacy	Data never leaves your infrastructure
Offline	Works without internet connection
Low Latency	No network round-trip for local GPU
Custom Models	Run fine-tuned or proprietary models
Development	Test without burning API credits

Ollama Setup

Installation


# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
 
# Or via Homebrew (macOS)
brew install ollama
 
# Windows
# Download from https://ollama.com/download

Start the Server


# Start Ollama daemon
ollama serve
 
# Verify it's running
curl http://localhost:11434/api/tags

Pull Models


# General purpose
ollama pull llama3.1:8b        # Fast, 8B parameters
ollama pull llama3.1:70b       # High quality, 70B parameters
 
# Code generation
ollama pull codellama:7b       # Code-specialized
ollama pull codellama:34b      # Larger code model
 
# Embeddings (for RAG)
ollama pull nomic-embed-text   # Local embeddings
 
# Lightweight
ollama pull mistral            # 7B, good balance
ollama pull phi3               # 3.8B, very fast
 
# List installed models
ollama list

Configuration

Environment Variables


# .env.local
OLLAMA_BASE_URL=http://localhost:11434
 
# Optional: Set as default provider
DEFAULT_AI_PROVIDER=ollama
DEFAULT_AI_MODEL=llama3.1:8b

Docker Deployment


# docker-compose.yml
services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
 
volumes:
  ollama_data:

Remote Ollama Server


# If Ollama runs on a different machine
OLLAMA_BASE_URL=http://192.168.1.100:11434
 
# With authentication (if configured)
OLLAMA_BASE_URL=http://user:pass@ollama.internal:11434

Using Local Models

Via Strategy Router


// The router can automatically select Ollama for certain tasks
const router = new StrategyRouter()
 
const decision = await router.route({
  preferredProvider: 'ollama', // Prefer local
  task: 'chat',
  planTier: 'free', // Cost-conscious routing
})

Direct Provider Usage


import { ProviderFactory } from '@/lib/ai/factory/provider-factory'
 
// Create Ollama provider
const ollama = await ProviderFactory.create('ollama', {
  baseUrl: process.env.OLLAMA_BASE_URL,
})
 
// Chat completion
const response = await ollama.chat(
  [{ role: 'user', content: 'Hello!' }],
  { model: 'llama3.1:8b', stream: false }
)
 
// Streaming
for await (const chunk of ollama.chatStream(messages, options)) {
  process.stdout.write(chunk.content)
}

Local Embeddings


import { EmbeddingService } from '@/lib/ai/rag/embedding-service'
 
// Use local embeddings for RAG
const embeddingService = new EmbeddingService({
  provider: 'ollama',
  model: 'nomic-embed-text',
})
 
const vector = await embeddingService.embed('Your text here')
// Returns 768-dimensional vector (vs 1536 for OpenAI)

Model Selection Guide

By Use Case

Use Case	Recommended Model	VRAM	Speed
Quick chat	`mistral`	4GB	Fast
General purpose	`llama3.1:8b`	8GB	Medium
Complex reasoning	`llama3.1:70b`	48GB	Slow
Code generation	`codellama:34b`	24GB	Medium
Embeddings	`nomic-embed-text`	2GB	Fast

By Hardware


┌─────────────────────────────────────────────────────────────┐
│                    Hardware Requirements                     │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  4GB VRAM:   mistral, phi3                                 │
│  8GB VRAM:   llama3.1:8b, codellama:7b                     │
│  16GB VRAM:  llama3.1:8b (faster), codellama:13b           │
│  24GB VRAM:  codellama:34b, mixtral                        │
│  48GB+ VRAM: llama3.1:70b                                  │
│                                                             │
│  CPU-only:   Any model (slower, uses RAM instead)          │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Performance Optimization

GPU Acceleration


# Check GPU is detected
ollama ps
 
# Nvidia (automatic with CUDA)
nvidia-smi  # Verify GPU
 
# AMD ROCm
# Ollama auto-detects compatible AMD GPUs
 
# Apple Silicon
# M1/M2/M3 chips use Metal automatically

Model Loading


# Pre-load model for faster first response
ollama run llama3.1:8b ""
 
# Keep model in memory
curl http://localhost:11434/api/generate \
  -d '{"model": "llama3.1:8b", "keep_alive": "1h"}'

Concurrent Requests


# Set parallel request limit
OLLAMA_NUM_PARALLEL=4 ollama serve
 
# Or in environment
export OLLAMA_NUM_PARALLEL=4

Hybrid Architecture

Combine local and remote providers for optimal cost/quality balance:


// Strategy: Use local for development, remote for production
const getProvider = async () => {
  if (process.env.NODE_ENV === 'development') {
    return ProviderFactory.create('ollama', {
      baseUrl: process.env.OLLAMA_BASE_URL,
    })
  }
  return ProviderFactory.create('openai', {
    apiKey: process.env.OPENAI_API_KEY,
  })
}
 
// Or: Local for simple tasks, remote for complex
const router = new StrategyRouter({
  rules: [
    { task: 'chat', provider: 'ollama', model: 'llama3.1:8b' },
    { task: 'code', provider: 'openai', model: 'gpt-4o' },
    { task: 'analysis', provider: 'gemini', model: 'gemini-1.5-pro' },
  ],
})

Error Handling

Connection Issues


// The adapter handles common errors gracefully
try {
  const response = await ollama.chat(messages, options)
} catch (error) {
  if (error.code === 'ECONNREFUSED') {
    // Ollama not running - fall back to remote
    console.log('Ollama unavailable, using OpenAI')
    return openai.chat(messages, options)
  }
  throw error
}

Model Not Found


// Check if model is available
const models = await fetch(`${OLLAMA_BASE_URL}/api/tags`)
  .then(r => r.json())
 
const available = models.models.map(m => m.name)
if (!available.includes('llama3.1:8b')) {
  console.log('Model not found, pulling...')
  await fetch(`${OLLAMA_BASE_URL}/api/pull`, {
    method: 'POST',
    body: JSON.stringify({ name: 'llama3.1:8b' }),
  })
}

Testing with Ollama

Development Setup


# Start Ollama
ollama serve &
 
# Pull test model
ollama pull mistral
 
# Set environment
export OLLAMA_BASE_URL=http://localhost:11434
export DEFAULT_AI_PROVIDER=ollama
 
# Run tests
npm run test:ai

Integration Tests


// src/app/test/integration/ollama.test.ts
describe('Ollama Integration', () => {
  it('should complete chat', async () => {
    const provider = await ProviderFactory.create('ollama')
    const response = await provider.chat([
      { role: 'user', content: 'Say hello' }
    ], { model: 'mistral' })
 
    expect(response.content).toBeTruthy()
  })
 
  it('should generate embeddings', async () => {
    const embedding = new EmbeddingService({ provider: 'ollama' })
    const vector = await embedding.embed('test')
 
    expect(vector).toHaveLength(768)
  })
})

Limitations

Feature	Ollama	Remote Providers
Vision/Images	Limited	Full support
Function Calling	Basic	Full support
Context Length	Model-dependent	128K+
Speed	Hardware-dependent	Consistent
Availability	Local uptime	99.9% SLA

Troubleshooting

Slow Responses

Check GPU is being used: nvidia-smi or Activity Monitor
Use smaller model: llama3.1:8b instead of 70b
Increase GPU memory allocation

Out of Memory


# Reduce context size
ollama run llama3.1:8b --ctx-size 2048
 
# Use quantized model
ollama pull llama3.1:8b-q4_0  # 4-bit quantization

Connection Refused


# Check Ollama is running
ps aux | grep ollama
 
# Restart service
ollama serve
 
# Check port is available
lsof -i :11434

LLM Providers
Ollama Remote - Cloud-hosted Ollama with API key auth
Remote Integration (OpenAI, Gemini)
RAG Implementation
Circuit Breaker