ignitionstack.pro supports local LLM inference through Ollama, enabling zero-cost AI operations with complete data privacy. This is ideal for development, privacy-sensitive applications, or air-gapped environments.
| Benefit | Description |
|---|---|
| Zero Cost | No API fees - run unlimited inference |
| Privacy | Data never leaves your infrastructure |
| Offline | Works without internet connection |
| Low Latency | No network round-trip for local GPU |
| Custom Models | Run fine-tuned or proprietary models |
| Development | Test without burning API credits |
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
# Or via Homebrew (macOS)
brew install ollama
# Windows
# Download from https://ollama.com/download# Start Ollama daemon
ollama serve
# Verify it's running
curl http://localhost:11434/api/tags# General purpose
ollama pull llama3.1:8b # Fast, 8B parameters
ollama pull llama3.1:70b # High quality, 70B parameters
# Code generation
ollama pull codellama:7b # Code-specialized
ollama pull codellama:34b # Larger code model
# Embeddings (for RAG)
ollama pull nomic-embed-text # Local embeddings
# Lightweight
ollama pull mistral # 7B, good balance
ollama pull phi3 # 3.8B, very fast
# List installed models
ollama list# .env.local
OLLAMA_BASE_URL=http://localhost:11434
# Optional: Set as default provider
DEFAULT_AI_PROVIDER=ollama
DEFAULT_AI_MODEL=llama3.1:8b# docker-compose.yml
services:
ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
volumes:
ollama_data:# If Ollama runs on a different machine
OLLAMA_BASE_URL=http://192.168.1.100:11434
# With authentication (if configured)
OLLAMA_BASE_URL=http://user:pass@ollama.internal:11434// The router can automatically select Ollama for certain tasks
const router = new StrategyRouter()
const decision = await router.route({
preferredProvider: 'ollama', // Prefer local
task: 'chat',
planTier: 'free', // Cost-conscious routing
})import { ProviderFactory } from '@/lib/ai/factory/provider-factory'
// Create Ollama provider
const ollama = await ProviderFactory.create('ollama', {
baseUrl: process.env.OLLAMA_BASE_URL,
})
// Chat completion
const response = await ollama.chat(
[{ role: 'user', content: 'Hello!' }],
{ model: 'llama3.1:8b', stream: false }
)
// Streaming
for await (const chunk of ollama.chatStream(messages, options)) {
process.stdout.write(chunk.content)
}import { EmbeddingService } from '@/lib/ai/rag/embedding-service'
// Use local embeddings for RAG
const embeddingService = new EmbeddingService({
provider: 'ollama',
model: 'nomic-embed-text',
})
const vector = await embeddingService.embed('Your text here')
// Returns 768-dimensional vector (vs 1536 for OpenAI)| Use Case | Recommended Model | VRAM | Speed |
|---|---|---|---|
| Quick chat | mistral | 4GB | Fast |
| General purpose | llama3.1:8b | 8GB | Medium |
| Complex reasoning | llama3.1:70b | 48GB | Slow |
| Code generation | codellama:34b | 24GB | Medium |
| Embeddings | nomic-embed-text | 2GB | Fast |
┌─────────────────────────────────────────────────────────────┐
│ Hardware Requirements │
├─────────────────────────────────────────────────────────────┤
│ │
│ 4GB VRAM: mistral, phi3 │
│ 8GB VRAM: llama3.1:8b, codellama:7b │
│ 16GB VRAM: llama3.1:8b (faster), codellama:13b │
│ 24GB VRAM: codellama:34b, mixtral │
│ 48GB+ VRAM: llama3.1:70b │
│ │
│ CPU-only: Any model (slower, uses RAM instead) │
│ │
└─────────────────────────────────────────────────────────────┘# Check GPU is detected
ollama ps
# Nvidia (automatic with CUDA)
nvidia-smi # Verify GPU
# AMD ROCm
# Ollama auto-detects compatible AMD GPUs
# Apple Silicon
# M1/M2/M3 chips use Metal automatically# Pre-load model for faster first response
ollama run llama3.1:8b ""
# Keep model in memory
curl http://localhost:11434/api/generate \
-d '{"model": "llama3.1:8b", "keep_alive": "1h"}'# Set parallel request limit
OLLAMA_NUM_PARALLEL=4 ollama serve
# Or in environment
export OLLAMA_NUM_PARALLEL=4Combine local and remote providers for optimal cost/quality balance:
// Strategy: Use local for development, remote for production
const getProvider = async () => {
if (process.env.NODE_ENV === 'development') {
return ProviderFactory.create('ollama', {
baseUrl: process.env.OLLAMA_BASE_URL,
})
}
return ProviderFactory.create('openai', {
apiKey: process.env.OPENAI_API_KEY,
})
}
// Or: Local for simple tasks, remote for complex
const router = new StrategyRouter({
rules: [
{ task: 'chat', provider: 'ollama', model: 'llama3.1:8b' },
{ task: 'code', provider: 'openai', model: 'gpt-4o' },
{ task: 'analysis', provider: 'gemini', model: 'gemini-1.5-pro' },
],
})// The adapter handles common errors gracefully
try {
const response = await ollama.chat(messages, options)
} catch (error) {
if (error.code === 'ECONNREFUSED') {
// Ollama not running - fall back to remote
console.log('Ollama unavailable, using OpenAI')
return openai.chat(messages, options)
}
throw error
}// Check if model is available
const models = await fetch(`${OLLAMA_BASE_URL}/api/tags`)
.then(r => r.json())
const available = models.models.map(m => m.name)
if (!available.includes('llama3.1:8b')) {
console.log('Model not found, pulling...')
await fetch(`${OLLAMA_BASE_URL}/api/pull`, {
method: 'POST',
body: JSON.stringify({ name: 'llama3.1:8b' }),
})
}# Start Ollama
ollama serve &
# Pull test model
ollama pull mistral
# Set environment
export OLLAMA_BASE_URL=http://localhost:11434
export DEFAULT_AI_PROVIDER=ollama
# Run tests
npm run test:ai// src/app/test/integration/ollama.test.ts
describe('Ollama Integration', () => {
it('should complete chat', async () => {
const provider = await ProviderFactory.create('ollama')
const response = await provider.chat([
{ role: 'user', content: 'Say hello' }
], { model: 'mistral' })
expect(response.content).toBeTruthy()
})
it('should generate embeddings', async () => {
const embedding = new EmbeddingService({ provider: 'ollama' })
const vector = await embedding.embed('test')
expect(vector).toHaveLength(768)
})
})| Feature | Ollama | Remote Providers |
|---|---|---|
| Vision/Images | Limited | Full support |
| Function Calling | Basic | Full support |
| Context Length | Model-dependent | 128K+ |
| Speed | Hardware-dependent | Consistent |
| Availability | Local uptime | 99.9% SLA |
nvidia-smi or Activity Monitorllama3.1:8b instead of 70b# Reduce context size
ollama run llama3.1:8b --ctx-size 2048
# Use quantized model
ollama pull llama3.1:8b-q4_0 # 4-bit quantization# Check Ollama is running
ps aux | grep ollama
# Restart service
ollama serve
# Check port is available
lsof -i :11434