ignitionstack.pro v1.0 is out! Read the announcement →
Skip to Content

Ollama Remote

ignitionstack.pro supports connecting to remote Ollama servers - enabling cloud-hosted or centralized LLM inference with optional API key authentication. This is ideal for teams sharing a GPU server, cloud deployments, or managed Ollama services.

Why Ollama Remote?

BenefitDescription
Centralized GPUShare expensive GPU resources across team/apps
Cloud DeploymentRun Ollama on cloud VMs (AWS, GCP, Azure)
API Key AuthSecure access with Bearer token authentication
Zero Local SetupNo need to install Ollama locally
ScalabilityScale GPU instances independently
Same APIIdentical interface to local Ollama

Configuration

Environment Variables

# .env.local / .env.production # Required: Remote Ollama server URL OLLAMA_REMOTE_BASE_URL=https://ollama.yourcompany.com # Optional: API key for authentication (Bearer token) OLLAMA_REMOTE_API_KEY=your-api-key-here

Provider Factory Usage

import { ProviderFactory } from '@/lib/ai/factory/provider-factory' // Using global configuration (env vars) const provider = ProviderFactory.createGlobalProvider('ollama_remote') // Or with custom configuration const provider = await ProviderFactory.createProvider('ollama_remote', apiKey, { baseUrl: 'https://ollama.yourcompany.com', }) // Check availability if (ProviderFactory.isGlobalProviderAvailable('ollama_remote')) { const models = await provider.listModels() console.log('Available models:', models) }

API Reference (Ollama Official)

The remote server exposes the standard Ollama HTTP API. Below are the key endpoints with request/response samples.

List Models

Check available models on the remote server.

GET /api/tags

Request:

curl https://ollama.yourcompany.com/api/tags \ -H "Authorization: Bearer your-api-key"

Response:

{ "models": [ { "name": "llama3.2:latest", "size": 2019393189, "digest": "a80c4f17aae5...", "details": { "format": "gguf", "family": "llama", "parameter_size": "3.2B", "quantization_level": "Q4_0" } }, { "name": "gemma3:12b", "size": 8012345678, "digest": "b91d2e3f4c5a...", "details": { "parameter_size": "12B" } } ] }

Chat Completion

Send a conversation and receive a response.

POST /api/chat

Request:

curl https://ollama.yourcompany.com/api/chat \ -H "Authorization: Bearer your-api-key" \ -H "Content-Type: application/json" \ -d '{ "model": "llama3.2", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Why is the sky blue?"} ], "stream": false }'

Response (non-streaming):

{ "model": "llama3.2", "created_at": "2024-01-15T10:30:00.123456Z", "message": { "role": "assistant", "content": "The sky appears blue due to Rayleigh scattering..." }, "done": true, "total_duration": 5043500667, "load_duration": 3013500, "prompt_eval_count": 26, "prompt_eval_duration": 325953000, "eval_count": 290, "eval_duration": 4709213000 }

Streaming Response:

curl https://ollama.yourcompany.com/api/chat \ -H "Authorization: Bearer your-api-key" \ -d '{"model": "llama3.2", "messages": [{"role": "user", "content": "Hello"}]}'

Each chunk:

{"model":"llama3.2","message":{"role":"assistant","content":"The"},"done":false} {"model":"llama3.2","message":{"role":"assistant","content":" sky"},"done":false} {"model":"llama3.2","message":{"role":"assistant","content":" is"},"done":false} ... {"model":"llama3.2","message":{"role":"assistant","content":""},"done":true,"total_duration":5043500667}

Text Generation

Generate text from a prompt (simpler than chat).

POST /api/generate

Request:

curl https://ollama.yourcompany.com/api/generate \ -H "Authorization: Bearer your-api-key" \ -d '{ "model": "llama3.2", "prompt": "Write a haiku about programming", "stream": false, "options": { "temperature": 0.7, "num_predict": 100 } }'

Response:

{ "model": "llama3.2", "response": "Code flows like water\nBugs emerge from the shadows\nDebugging begins", "done": true, "context": [1, 2, 3, ...], "total_duration": 2048576000, "load_duration": 1024000, "prompt_eval_count": 8, "eval_count": 24 }

Generate Embeddings

Create vector embeddings for RAG or semantic search.

POST /api/embed

Request:

curl https://ollama.yourcompany.com/api/embed \ -H "Authorization: Bearer your-api-key" \ -d '{ "model": "nomic-embed-text", "input": ["Why is the sky blue?", "What causes rainbows?"] }'

Response:

{ "model": "nomic-embed-text", "embeddings": [ [0.010071029, -0.0017594862, 0.05007221, ...], [0.012345678, -0.0023456789, 0.04567890, ...] ], "total_duration": 14143917, "load_duration": 1019500 }

Legacy endpoint (single prompt):

POST /api/embeddings { "model": "nomic-embed-text", "prompt": "Here is an article about llamas..." }

Model Information

Get details about a specific model.

POST /api/show

Request:

curl https://ollama.yourcompany.com/api/show \ -H "Authorization: Bearer your-api-key" \ -d '{"name": "llama3.2"}'

Response:

{ "license": "MIT License...", "modelfile": "FROM llama3.2...", "parameters": "stop \"<|start_header_id|>\"...", "template": "{{ if .System }}...", "details": { "format": "gguf", "family": "llama", "parameter_size": "3.2B", "quantization_level": "Q4_0" } }

Pull Model

Download a model to the remote server.

POST /api/pull

Request:

curl https://ollama.yourcompany.com/api/pull \ -H "Authorization: Bearer your-api-key" \ -d '{"name": "mistral:7b", "stream": false}'

Streaming Response:

{"status": "pulling manifest"} {"status": "downloading sha256:a80c4f17...", "digest": "sha256:a80c4f17...", "total": 2019393189, "completed": 524288000} ... {"status": "success"}

Running Models

Check which models are currently loaded in memory.

GET /api/ps

Response:

{ "models": [ { "name": "llama3.2:latest", "size": 2019393189, "digest": "a80c4f17...", "expires_at": "2024-01-15T11:30:00Z" } ] }

Options Reference

Control model behavior with these options in /api/generate or /api/chat:

OptionTypeDefaultDescription
temperaturefloat0.8Creativity (0.0-2.0)
num_predictint128Max tokens to generate
top_pfloat0.9Nucleus sampling
top_kint40Top-k sampling
repeat_penaltyfloat1.1Repetition penalty
seedint-Random seed for reproducibility
num_ctxint2048Context window size
stopstring[]-Stop sequences

Example with options:

{ "model": "llama3.2", "prompt": "Explain quantum computing", "stream": false, "options": { "temperature": 0.3, "num_predict": 500, "top_p": 0.95, "seed": 42 } }

Authentication Setup

Reverse Proxy with Auth (Nginx)

# nginx.conf server { listen 443 ssl; server_name ollama.yourcompany.com; ssl_certificate /etc/ssl/certs/ollama.crt; ssl_certificate_key /etc/ssl/private/ollama.key; location / { # Validate Bearer token if ($http_authorization != "Bearer your-secret-api-key") { return 401; } proxy_pass http://localhost:11434; proxy_http_version 1.1; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; # For streaming proxy_buffering off; proxy_cache off; proxy_read_timeout 300s; } }

Caddy with Basic Auth

ollama.yourcompany.com { basicauth / { admin $2a$14$hashedpassword } reverse_proxy localhost:11434 { flush_interval -1 } }

Docker Compose with Traefik

# docker-compose.yml services: ollama: image: ollama/ollama:latest deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu] labels: - "traefik.enable=true" - "traefik.http.routers.ollama.rule=Host(`ollama.yourcompany.com`)" - "traefik.http.routers.ollama.middlewares=auth" - "traefik.http.middlewares.auth.basicauth.users=admin:$$apr1$$..." traefik: image: traefik:v2.10 ports: - "443:443" volumes: - /var/run/docker.sock:/var/run/docker.sock

Cloud Deployment Examples

AWS EC2 with GPU

# Launch g4dn.xlarge (T4 GPU) or g5.xlarge (A10G) # Ubuntu 22.04 AMI # Install NVIDIA drivers sudo apt update && sudo apt install -y nvidia-driver-535 # Install Ollama curl -fsSL https://ollama.com/install.sh | sh # Configure for remote access sudo systemctl edit ollama # Add: Environment="OLLAMA_HOST=0.0.0.0:11434" # Start and pull models sudo systemctl start ollama ollama pull llama3.2 ollama pull nomic-embed-text # Configure security group to allow 11434 from your app

Google Cloud with GPU

# Create VM with NVIDIA T4/A100 gcloud compute instances create ollama-server \ --zone=us-central1-a \ --machine-type=n1-standard-8 \ --accelerator=type=nvidia-tesla-t4,count=1 \ --image-family=ubuntu-2204-lts \ --image-project=ubuntu-os-cloud \ --boot-disk-size=200GB # SSH and install Ollama gcloud compute ssh ollama-server curl -fsSL https://ollama.com/install.sh | sh

RunPod / Vast.ai

# Use their GPU marketplace templates # Many offer pre-configured Ollama images # Example RunPod environment OLLAMA_HOST=0.0.0.0:11434

Usage in ignitionstack.pro

Direct Chat

import { ProviderFactory } from '@/lib/ai/factory/provider-factory' const ollama = ProviderFactory.createGlobalProvider('ollama_remote') const response = await ollama.chat({ messages: [ { role: 'system', content: 'You are a helpful assistant.' }, { role: 'user', content: 'Explain TypeScript generics.' } ], model: 'llama3.2', temperature: 0.7, }) console.log(response.content)

Streaming Chat

const stream = await ollama.streamChat({ messages: [{ role: 'user', content: 'Write a poem' }], model: 'llama3.2', }) for await (const chunk of stream) { if (chunk.type === 'token') { process.stdout.write(chunk.delta) } }

Dynamic Model Selection

// List available models from remote server const models = await ollama.listModels() // Returns: ['llama3.2:latest', 'gemma3:12b', 'mistral:7b'] // Use in model selector UI const modelOptions = models.map(name => ({ id: `ollama_remote/${name}`, label: name, provider: 'ollama_remote', }))

Strategy Router Integration

const router = new StrategyRouter({ rules: [ // Route complex tasks to remote Ollama with larger model { task: 'analysis', provider: 'ollama_remote', model: 'llama3.1:70b' }, // Local for quick tasks { task: 'chat', provider: 'ollama', model: 'llama3.2:8b' }, // OpenAI for vision { task: 'vision', provider: 'openai', model: 'gpt-4o' }, ], })

Performance Metrics

All responses include timing metrics (in nanoseconds):

MetricDescription
total_durationTotal request time
load_durationTime to load model into memory
prompt_eval_countTokens in the prompt
prompt_eval_durationTime to process prompt
eval_countTokens generated
eval_durationTime to generate response
const response = await ollama.chat(request) console.log({ totalMs: response.metadata.totalDuration / 1_000_000, tokensPerSecond: response.usage.completionTokens / (response.metadata.evalDuration / 1_000_000_000), })

Error Handling

try { const response = await ollama.chat(request) } catch (error) { if (error.message.includes('ECONNREFUSED')) { console.error('Remote Ollama server is unreachable') // Fallback to local or cloud provider } if (error.message.includes('401') || error.message.includes('403')) { console.error('Invalid API key for remote Ollama') // Prompt user to update credentials } if (error.message.includes('not found')) { console.error('Model not available on remote server') // Suggest pulling the model } }

Comparison: Local vs Remote

AspectLocal OllamaRemote Ollama
SetupInstall locallyConfigure URL + key
CostHardware costServer rental
PrivacyData stays localData goes to server
Latency~50-200msNetwork dependent
ScalingLimited by local GPUScale server independently
AvailabilityYour uptimeServer uptime
OfflineWorks offlineRequires network

Troubleshooting

Connection Refused

# Check server is running curl https://ollama.yourcompany.com/api/tags # Verify firewall rules # Ensure port 11434 (or your custom port) is open

Authentication Failed

# Test with curl curl -H "Authorization: Bearer your-key" \ https://ollama.yourcompany.com/api/tags # Check proxy configuration for Bearer token validation

Model Not Found

# List available models curl https://ollama.yourcompany.com/api/tags # Pull missing model curl -X POST https://ollama.yourcompany.com/api/pull \ -H "Authorization: Bearer your-key" \ -d '{"name": "llama3.2"}'

Slow Responses

  1. Check GPU utilization on server: nvidia-smi
  2. Verify model is loaded: GET /api/ps
  3. Reduce num_ctx for faster inference
  4. Use quantized models (llama3.2:q4_0)