ignitionstack.pro v1.0 is out! Read the announcement →
Skip to Content

Local Integration

ignitionstack.pro supports local LLM inference through Ollama, enabling zero-cost AI operations with complete data privacy. This is ideal for development, privacy-sensitive applications, or air-gapped environments.

Why Local LLMs?

BenefitDescription
Zero CostNo API fees - run unlimited inference
PrivacyData never leaves your infrastructure
OfflineWorks without internet connection
Low LatencyNo network round-trip for local GPU
Custom ModelsRun fine-tuned or proprietary models
DevelopmentTest without burning API credits

Ollama Setup

Installation

# macOS / Linux curl -fsSL https://ollama.com/install.sh | sh # Or via Homebrew (macOS) brew install ollama # Windows # Download from https://ollama.com/download

Start the Server

# Start Ollama daemon ollama serve # Verify it's running curl http://localhost:11434/api/tags

Pull Models

# General purpose ollama pull llama3.1:8b # Fast, 8B parameters ollama pull llama3.1:70b # High quality, 70B parameters # Code generation ollama pull codellama:7b # Code-specialized ollama pull codellama:34b # Larger code model # Embeddings (for RAG) ollama pull nomic-embed-text # Local embeddings # Lightweight ollama pull mistral # 7B, good balance ollama pull phi3 # 3.8B, very fast # List installed models ollama list

Configuration

Environment Variables

# .env.local OLLAMA_BASE_URL=http://localhost:11434 # Optional: Set as default provider DEFAULT_AI_PROVIDER=ollama DEFAULT_AI_MODEL=llama3.1:8b

Docker Deployment

# docker-compose.yml services: ollama: image: ollama/ollama:latest ports: - "11434:11434" volumes: - ollama_data:/root/.ollama deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu] volumes: ollama_data:

Remote Ollama Server

# If Ollama runs on a different machine OLLAMA_BASE_URL=http://192.168.1.100:11434 # With authentication (if configured) OLLAMA_BASE_URL=http://user:pass@ollama.internal:11434

Using Local Models

Via Strategy Router

// The router can automatically select Ollama for certain tasks const router = new StrategyRouter() const decision = await router.route({ preferredProvider: 'ollama', // Prefer local task: 'chat', planTier: 'free', // Cost-conscious routing })

Direct Provider Usage

import { ProviderFactory } from '@/lib/ai/factory/provider-factory' // Create Ollama provider const ollama = await ProviderFactory.create('ollama', { baseUrl: process.env.OLLAMA_BASE_URL, }) // Chat completion const response = await ollama.chat( [{ role: 'user', content: 'Hello!' }], { model: 'llama3.1:8b', stream: false } ) // Streaming for await (const chunk of ollama.chatStream(messages, options)) { process.stdout.write(chunk.content) }

Local Embeddings

import { EmbeddingService } from '@/lib/ai/rag/embedding-service' // Use local embeddings for RAG const embeddingService = new EmbeddingService({ provider: 'ollama', model: 'nomic-embed-text', }) const vector = await embeddingService.embed('Your text here') // Returns 768-dimensional vector (vs 1536 for OpenAI)

Model Selection Guide

By Use Case

Use CaseRecommended ModelVRAMSpeed
Quick chatmistral4GBFast
General purposellama3.1:8b8GBMedium
Complex reasoningllama3.1:70b48GBSlow
Code generationcodellama:34b24GBMedium
Embeddingsnomic-embed-text2GBFast

By Hardware

┌─────────────────────────────────────────────────────────────┐ │ Hardware Requirements │ ├─────────────────────────────────────────────────────────────┤ │ │ │ 4GB VRAM: mistral, phi3 │ │ 8GB VRAM: llama3.1:8b, codellama:7b │ │ 16GB VRAM: llama3.1:8b (faster), codellama:13b │ │ 24GB VRAM: codellama:34b, mixtral │ │ 48GB+ VRAM: llama3.1:70b │ │ │ │ CPU-only: Any model (slower, uses RAM instead) │ │ │ └─────────────────────────────────────────────────────────────┘

Performance Optimization

GPU Acceleration

# Check GPU is detected ollama ps # Nvidia (automatic with CUDA) nvidia-smi # Verify GPU # AMD ROCm # Ollama auto-detects compatible AMD GPUs # Apple Silicon # M1/M2/M3 chips use Metal automatically

Model Loading

# Pre-load model for faster first response ollama run llama3.1:8b "" # Keep model in memory curl http://localhost:11434/api/generate \ -d '{"model": "llama3.1:8b", "keep_alive": "1h"}'

Concurrent Requests

# Set parallel request limit OLLAMA_NUM_PARALLEL=4 ollama serve # Or in environment export OLLAMA_NUM_PARALLEL=4

Hybrid Architecture

Combine local and remote providers for optimal cost/quality balance:

// Strategy: Use local for development, remote for production const getProvider = async () => { if (process.env.NODE_ENV === 'development') { return ProviderFactory.create('ollama', { baseUrl: process.env.OLLAMA_BASE_URL, }) } return ProviderFactory.create('openai', { apiKey: process.env.OPENAI_API_KEY, }) } // Or: Local for simple tasks, remote for complex const router = new StrategyRouter({ rules: [ { task: 'chat', provider: 'ollama', model: 'llama3.1:8b' }, { task: 'code', provider: 'openai', model: 'gpt-4o' }, { task: 'analysis', provider: 'gemini', model: 'gemini-1.5-pro' }, ], })

Error Handling

Connection Issues

// The adapter handles common errors gracefully try { const response = await ollama.chat(messages, options) } catch (error) { if (error.code === 'ECONNREFUSED') { // Ollama not running - fall back to remote console.log('Ollama unavailable, using OpenAI') return openai.chat(messages, options) } throw error }

Model Not Found

// Check if model is available const models = await fetch(`${OLLAMA_BASE_URL}/api/tags`) .then(r => r.json()) const available = models.models.map(m => m.name) if (!available.includes('llama3.1:8b')) { console.log('Model not found, pulling...') await fetch(`${OLLAMA_BASE_URL}/api/pull`, { method: 'POST', body: JSON.stringify({ name: 'llama3.1:8b' }), }) }

Testing with Ollama

Development Setup

# Start Ollama ollama serve & # Pull test model ollama pull mistral # Set environment export OLLAMA_BASE_URL=http://localhost:11434 export DEFAULT_AI_PROVIDER=ollama # Run tests npm run test:ai

Integration Tests

// src/app/test/integration/ollama.test.ts describe('Ollama Integration', () => { it('should complete chat', async () => { const provider = await ProviderFactory.create('ollama') const response = await provider.chat([ { role: 'user', content: 'Say hello' } ], { model: 'mistral' }) expect(response.content).toBeTruthy() }) it('should generate embeddings', async () => { const embedding = new EmbeddingService({ provider: 'ollama' }) const vector = await embedding.embed('test') expect(vector).toHaveLength(768) }) })

Limitations

FeatureOllamaRemote Providers
Vision/ImagesLimitedFull support
Function CallingBasicFull support
Context LengthModel-dependent128K+
SpeedHardware-dependentConsistent
AvailabilityLocal uptime99.9% SLA

Troubleshooting

Slow Responses

  1. Check GPU is being used: nvidia-smi or Activity Monitor
  2. Use smaller model: llama3.1:8b instead of 70b
  3. Increase GPU memory allocation

Out of Memory

# Reduce context size ollama run llama3.1:8b --ctx-size 2048 # Use quantized model ollama pull llama3.1:8b-q4_0 # 4-bit quantization

Connection Refused

# Check Ollama is running ps aux | grep ollama # Restart service ollama serve # Check port is available lsof -i :11434