Ollama Remote

ignitionstack.pro supports connecting to remote Ollama servers - enabling cloud-hosted or centralized LLM inference with optional API key authentication. This is ideal for teams sharing a GPU server, cloud deployments, or managed Ollama services.

Why Ollama Remote?

Benefit	Description
Centralized GPU	Share expensive GPU resources across team/apps
Cloud Deployment	Run Ollama on cloud VMs (AWS, GCP, Azure)
API Key Auth	Secure access with Bearer token authentication
Zero Local Setup	No need to install Ollama locally
Scalability	Scale GPU instances independently
Same API	Identical interface to local Ollama

Configuration

Environment Variables


# .env.local / .env.production
 
# Required: Remote Ollama server URL
OLLAMA_REMOTE_BASE_URL=https://ollama.yourcompany.com
 
# Optional: API key for authentication (Bearer token)
OLLAMA_REMOTE_API_KEY=your-api-key-here

Provider Factory Usage


import { ProviderFactory } from '@/lib/ai/factory/provider-factory'
 
// Using global configuration (env vars)
const provider = ProviderFactory.createGlobalProvider('ollama_remote')
 
// Or with custom configuration
const provider = await ProviderFactory.createProvider('ollama_remote', apiKey, {
  baseUrl: 'https://ollama.yourcompany.com',
})
 
// Check availability
if (ProviderFactory.isGlobalProviderAvailable('ollama_remote')) {
  const models = await provider.listModels()
  console.log('Available models:', models)
}

API Reference (Ollama Official)

The remote server exposes the standard Ollama HTTP API. Below are the key endpoints with request/response samples.

List Models

Check available models on the remote server.


GET /api/tags

Request:


curl https://ollama.yourcompany.com/api/tags \
  -H "Authorization: Bearer your-api-key"

Response:


{
  "models": [
    {
      "name": "llama3.2:latest",
      "size": 2019393189,
      "digest": "a80c4f17aae5...",
      "details": {
        "format": "gguf",
        "family": "llama",
        "parameter_size": "3.2B",
        "quantization_level": "Q4_0"
      }
    },
    {
      "name": "gemma3:12b",
      "size": 8012345678,
      "digest": "b91d2e3f4c5a...",
      "details": {
        "parameter_size": "12B"
      }
    }
  ]
}

Chat Completion

Send a conversation and receive a response.


POST /api/chat

Request:


curl https://ollama.yourcompany.com/api/chat \
  -H "Authorization: Bearer your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Why is the sky blue?"}
    ],
    "stream": false
  }'

Response (non-streaming):


{
  "model": "llama3.2",
  "created_at": "2024-01-15T10:30:00.123456Z",
  "message": {
    "role": "assistant",
    "content": "The sky appears blue due to Rayleigh scattering..."
  },
  "done": true,
  "total_duration": 5043500667,
  "load_duration": 3013500,
  "prompt_eval_count": 26,
  "prompt_eval_duration": 325953000,
  "eval_count": 290,
  "eval_duration": 4709213000
}

Streaming Response:


curl https://ollama.yourcompany.com/api/chat \
  -H "Authorization: Bearer your-api-key" \
  -d '{"model": "llama3.2", "messages": [{"role": "user", "content": "Hello"}]}'

Each chunk:


{"model":"llama3.2","message":{"role":"assistant","content":"The"},"done":false}
{"model":"llama3.2","message":{"role":"assistant","content":" sky"},"done":false}
{"model":"llama3.2","message":{"role":"assistant","content":" is"},"done":false}
...
{"model":"llama3.2","message":{"role":"assistant","content":""},"done":true,"total_duration":5043500667}

Text Generation

Generate text from a prompt (simpler than chat).


POST /api/generate

Request:


curl https://ollama.yourcompany.com/api/generate \
  -H "Authorization: Bearer your-api-key" \
  -d '{
    "model": "llama3.2",
    "prompt": "Write a haiku about programming",
    "stream": false,
    "options": {
      "temperature": 0.7,
      "num_predict": 100
    }
  }'

Response:


{
  "model": "llama3.2",
  "response": "Code flows like water\nBugs emerge from the shadows\nDebugging begins",
  "done": true,
  "context": [1, 2, 3, ...],
  "total_duration": 2048576000,
  "load_duration": 1024000,
  "prompt_eval_count": 8,
  "eval_count": 24
}

Generate Embeddings

Create vector embeddings for RAG or semantic search.


POST /api/embed

Request:


curl https://ollama.yourcompany.com/api/embed \
  -H "Authorization: Bearer your-api-key" \
  -d '{
    "model": "nomic-embed-text",
    "input": ["Why is the sky blue?", "What causes rainbows?"]
  }'

Response:


{
  "model": "nomic-embed-text",
  "embeddings": [
    [0.010071029, -0.0017594862, 0.05007221, ...],
    [0.012345678, -0.0023456789, 0.04567890, ...]
  ],
  "total_duration": 14143917,
  "load_duration": 1019500
}

Legacy endpoint (single prompt):


POST /api/embeddings
 
{
  "model": "nomic-embed-text",
  "prompt": "Here is an article about llamas..."
}

Model Information

Get details about a specific model.


POST /api/show

Request:


curl https://ollama.yourcompany.com/api/show \
  -H "Authorization: Bearer your-api-key" \
  -d '{"name": "llama3.2"}'

Response:


{
  "license": "MIT License...",
  "modelfile": "FROM llama3.2...",
  "parameters": "stop \"<|start_header_id|>\"...",
  "template": "{{ if .System }}...",
  "details": {
    "format": "gguf",
    "family": "llama",
    "parameter_size": "3.2B",
    "quantization_level": "Q4_0"
  }
}

Pull Model

Download a model to the remote server.


POST /api/pull

Request:


curl https://ollama.yourcompany.com/api/pull \
  -H "Authorization: Bearer your-api-key" \
  -d '{"name": "mistral:7b", "stream": false}'

Streaming Response:


{"status": "pulling manifest"}
{"status": "downloading sha256:a80c4f17...", "digest": "sha256:a80c4f17...", "total": 2019393189, "completed": 524288000}
...
{"status": "success"}

Running Models

Check which models are currently loaded in memory.


GET /api/ps

Response:


{
  "models": [
    {
      "name": "llama3.2:latest",
      "size": 2019393189,
      "digest": "a80c4f17...",
      "expires_at": "2024-01-15T11:30:00Z"
    }
  ]
}

Options Reference

Control model behavior with these options in /api/generate or /api/chat:

Option	Type	Default	Description
`temperature`	float	0.8	Creativity (0.0-2.0)
`num_predict`	int	128	Max tokens to generate
`top_p`	float	0.9	Nucleus sampling
`top_k`	int	40	Top-k sampling
`repeat_penalty`	float	1.1	Repetition penalty
`seed`	int	-	Random seed for reproducibility
`num_ctx`	int	2048	Context window size
`stop`	string[]	-	Stop sequences

Example with options:


{
  "model": "llama3.2",
  "prompt": "Explain quantum computing",
  "stream": false,
  "options": {
    "temperature": 0.3,
    "num_predict": 500,
    "top_p": 0.95,
    "seed": 42
  }
}

Authentication Setup

Reverse Proxy with Auth (Nginx)


# nginx.conf
server {
    listen 443 ssl;
    server_name ollama.yourcompany.com;
 
    ssl_certificate /etc/ssl/certs/ollama.crt;
    ssl_certificate_key /etc/ssl/private/ollama.key;
 
    location / {
        # Validate Bearer token
        if ($http_authorization != "Bearer your-secret-api-key") {
            return 401;
        }
 
        proxy_pass http://localhost:11434;
        proxy_http_version 1.1;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
 
        # For streaming
        proxy_buffering off;
        proxy_cache off;
        proxy_read_timeout 300s;
    }
}

Caddy with Basic Auth


ollama.yourcompany.com {
    basicauth / {
        admin $2a$14$hashedpassword
    }

    reverse_proxy localhost:11434 {
        flush_interval -1
    }
}

Docker Compose with Traefik


# docker-compose.yml
services:
  ollama:
    image: ollama/ollama:latest
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.ollama.rule=Host(`ollama.yourcompany.com`)"
      - "traefik.http.routers.ollama.middlewares=auth"
      - "traefik.http.middlewares.auth.basicauth.users=admin:$$apr1$$..."
 
  traefik:
    image: traefik:v2.10
    ports:
      - "443:443"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock

Cloud Deployment Examples

AWS EC2 with GPU


# Launch g4dn.xlarge (T4 GPU) or g5.xlarge (A10G)
# Ubuntu 22.04 AMI
 
# Install NVIDIA drivers
sudo apt update && sudo apt install -y nvidia-driver-535
 
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
 
# Configure for remote access
sudo systemctl edit ollama
# Add: Environment="OLLAMA_HOST=0.0.0.0:11434"
 
# Start and pull models
sudo systemctl start ollama
ollama pull llama3.2
ollama pull nomic-embed-text
 
# Configure security group to allow 11434 from your app

Google Cloud with GPU


# Create VM with NVIDIA T4/A100
gcloud compute instances create ollama-server \
  --zone=us-central1-a \
  --machine-type=n1-standard-8 \
  --accelerator=type=nvidia-tesla-t4,count=1 \
  --image-family=ubuntu-2204-lts \
  --image-project=ubuntu-os-cloud \
  --boot-disk-size=200GB
 
# SSH and install Ollama
gcloud compute ssh ollama-server
curl -fsSL https://ollama.com/install.sh | sh

RunPod / Vast.ai


# Use their GPU marketplace templates
# Many offer pre-configured Ollama images
 
# Example RunPod environment
OLLAMA_HOST=0.0.0.0:11434

Usage in ignitionstack.pro

Direct Chat


import { ProviderFactory } from '@/lib/ai/factory/provider-factory'
 
const ollama = ProviderFactory.createGlobalProvider('ollama_remote')
 
const response = await ollama.chat({
  messages: [
    { role: 'system', content: 'You are a helpful assistant.' },
    { role: 'user', content: 'Explain TypeScript generics.' }
  ],
  model: 'llama3.2',
  temperature: 0.7,
})
 
console.log(response.content)

Streaming Chat


const stream = await ollama.streamChat({
  messages: [{ role: 'user', content: 'Write a poem' }],
  model: 'llama3.2',
})
 
for await (const chunk of stream) {
  if (chunk.type === 'token') {
    process.stdout.write(chunk.delta)
  }
}

Dynamic Model Selection


// List available models from remote server
const models = await ollama.listModels()
// Returns: ['llama3.2:latest', 'gemma3:12b', 'mistral:7b']
 
// Use in model selector UI
const modelOptions = models.map(name => ({
  id: `ollama_remote/${name}`,
  label: name,
  provider: 'ollama_remote',
}))

Strategy Router Integration


const router = new StrategyRouter({
  rules: [
    // Route complex tasks to remote Ollama with larger model
    { task: 'analysis', provider: 'ollama_remote', model: 'llama3.1:70b' },
    // Local for quick tasks
    { task: 'chat', provider: 'ollama', model: 'llama3.2:8b' },
    // OpenAI for vision
    { task: 'vision', provider: 'openai', model: 'gpt-4o' },
  ],
})

Performance Metrics

All responses include timing metrics (in nanoseconds):

Metric	Description
`total_duration`	Total request time
`load_duration`	Time to load model into memory
`prompt_eval_count`	Tokens in the prompt
`prompt_eval_duration`	Time to process prompt
`eval_count`	Tokens generated
`eval_duration`	Time to generate response


const response = await ollama.chat(request)
 
console.log({
  totalMs: response.metadata.totalDuration / 1_000_000,
  tokensPerSecond: response.usage.completionTokens /
    (response.metadata.evalDuration / 1_000_000_000),
})

Error Handling


try {
  const response = await ollama.chat(request)
} catch (error) {
  if (error.message.includes('ECONNREFUSED')) {
    console.error('Remote Ollama server is unreachable')
    // Fallback to local or cloud provider
  }
 
  if (error.message.includes('401') || error.message.includes('403')) {
    console.error('Invalid API key for remote Ollama')
    // Prompt user to update credentials
  }
 
  if (error.message.includes('not found')) {
    console.error('Model not available on remote server')
    // Suggest pulling the model
  }
}

Comparison: Local vs Remote

Aspect	Local Ollama	Remote Ollama
Setup	Install locally	Configure URL + key
Cost	Hardware cost	Server rental
Privacy	Data stays local	Data goes to server
Latency	~50-200ms	Network dependent
Scaling	Limited by local GPU	Scale server independently
Availability	Your uptime	Server uptime
Offline	Works offline	Requires network

Troubleshooting

Connection Refused


# Check server is running
curl https://ollama.yourcompany.com/api/tags
 
# Verify firewall rules
# Ensure port 11434 (or your custom port) is open

Authentication Failed


# Test with curl
curl -H "Authorization: Bearer your-key" \
  https://ollama.yourcompany.com/api/tags
 
# Check proxy configuration for Bearer token validation

Model Not Found


# List available models
curl https://ollama.yourcompany.com/api/tags
 
# Pull missing model
curl -X POST https://ollama.yourcompany.com/api/pull \
  -H "Authorization: Bearer your-key" \
  -d '{"name": "llama3.2"}'

Slow Responses

Check GPU utilization on server: nvidia-smi
Verify model is loaded: GET /api/ps
Reduce num_ctx for faster inference
Use quantized models (llama3.2:q4_0)

Ollama Remote

Why Ollama Remote?

Configuration

Environment Variables

Provider Factory Usage

API Reference (Ollama Official)

List Models

Chat Completion

Text Generation

Generate Embeddings

Model Information

Pull Model

Running Models

Options Reference

Authentication Setup

Reverse Proxy with Auth (Nginx)

Caddy with Basic Auth

Docker Compose with Traefik

Cloud Deployment Examples

AWS EC2 with GPU

Google Cloud with GPU

RunPod / Vast.ai

Usage in ignitionstack.pro

Direct Chat

Streaming Chat

Dynamic Model Selection

Strategy Router Integration

Performance Metrics

Error Handling

Comparison: Local vs Remote

Troubleshooting

Connection Refused

Authentication Failed

Model Not Found

Slow Responses

Related Documentation