ignitionstack.pro supports connecting to remote Ollama servers - enabling cloud-hosted or centralized LLM inference with optional API key authentication. This is ideal for teams sharing a GPU server, cloud deployments, or managed Ollama services.
| Benefit | Description |
|---|---|
| Centralized GPU | Share expensive GPU resources across team/apps |
| Cloud Deployment | Run Ollama on cloud VMs (AWS, GCP, Azure) |
| API Key Auth | Secure access with Bearer token authentication |
| Zero Local Setup | No need to install Ollama locally |
| Scalability | Scale GPU instances independently |
| Same API | Identical interface to local Ollama |
# .env.local / .env.production
# Required: Remote Ollama server URL
OLLAMA_REMOTE_BASE_URL=https://ollama.yourcompany.com
# Optional: API key for authentication (Bearer token)
OLLAMA_REMOTE_API_KEY=your-api-key-hereimport { ProviderFactory } from '@/lib/ai/factory/provider-factory'
// Using global configuration (env vars)
const provider = ProviderFactory.createGlobalProvider('ollama_remote')
// Or with custom configuration
const provider = await ProviderFactory.createProvider('ollama_remote', apiKey, {
baseUrl: 'https://ollama.yourcompany.com',
})
// Check availability
if (ProviderFactory.isGlobalProviderAvailable('ollama_remote')) {
const models = await provider.listModels()
console.log('Available models:', models)
}The remote server exposes the standard Ollama HTTP API. Below are the key endpoints with request/response samples.
Check available models on the remote server.
GET /api/tagsRequest:
curl https://ollama.yourcompany.com/api/tags \
-H "Authorization: Bearer your-api-key"Response:
{
"models": [
{
"name": "llama3.2:latest",
"size": 2019393189,
"digest": "a80c4f17aae5...",
"details": {
"format": "gguf",
"family": "llama",
"parameter_size": "3.2B",
"quantization_level": "Q4_0"
}
},
{
"name": "gemma3:12b",
"size": 8012345678,
"digest": "b91d2e3f4c5a...",
"details": {
"parameter_size": "12B"
}
}
]
}Send a conversation and receive a response.
POST /api/chatRequest:
curl https://ollama.yourcompany.com/api/chat \
-H "Authorization: Bearer your-api-key" \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Why is the sky blue?"}
],
"stream": false
}'Response (non-streaming):
{
"model": "llama3.2",
"created_at": "2024-01-15T10:30:00.123456Z",
"message": {
"role": "assistant",
"content": "The sky appears blue due to Rayleigh scattering..."
},
"done": true,
"total_duration": 5043500667,
"load_duration": 3013500,
"prompt_eval_count": 26,
"prompt_eval_duration": 325953000,
"eval_count": 290,
"eval_duration": 4709213000
}Streaming Response:
curl https://ollama.yourcompany.com/api/chat \
-H "Authorization: Bearer your-api-key" \
-d '{"model": "llama3.2", "messages": [{"role": "user", "content": "Hello"}]}'Each chunk:
{"model":"llama3.2","message":{"role":"assistant","content":"The"},"done":false}
{"model":"llama3.2","message":{"role":"assistant","content":" sky"},"done":false}
{"model":"llama3.2","message":{"role":"assistant","content":" is"},"done":false}
...
{"model":"llama3.2","message":{"role":"assistant","content":""},"done":true,"total_duration":5043500667}Generate text from a prompt (simpler than chat).
POST /api/generateRequest:
curl https://ollama.yourcompany.com/api/generate \
-H "Authorization: Bearer your-api-key" \
-d '{
"model": "llama3.2",
"prompt": "Write a haiku about programming",
"stream": false,
"options": {
"temperature": 0.7,
"num_predict": 100
}
}'Response:
{
"model": "llama3.2",
"response": "Code flows like water\nBugs emerge from the shadows\nDebugging begins",
"done": true,
"context": [1, 2, 3, ...],
"total_duration": 2048576000,
"load_duration": 1024000,
"prompt_eval_count": 8,
"eval_count": 24
}Create vector embeddings for RAG or semantic search.
POST /api/embedRequest:
curl https://ollama.yourcompany.com/api/embed \
-H "Authorization: Bearer your-api-key" \
-d '{
"model": "nomic-embed-text",
"input": ["Why is the sky blue?", "What causes rainbows?"]
}'Response:
{
"model": "nomic-embed-text",
"embeddings": [
[0.010071029, -0.0017594862, 0.05007221, ...],
[0.012345678, -0.0023456789, 0.04567890, ...]
],
"total_duration": 14143917,
"load_duration": 1019500
}Legacy endpoint (single prompt):
POST /api/embeddings
{
"model": "nomic-embed-text",
"prompt": "Here is an article about llamas..."
}Get details about a specific model.
POST /api/showRequest:
curl https://ollama.yourcompany.com/api/show \
-H "Authorization: Bearer your-api-key" \
-d '{"name": "llama3.2"}'Response:
{
"license": "MIT License...",
"modelfile": "FROM llama3.2...",
"parameters": "stop \"<|start_header_id|>\"...",
"template": "{{ if .System }}...",
"details": {
"format": "gguf",
"family": "llama",
"parameter_size": "3.2B",
"quantization_level": "Q4_0"
}
}Download a model to the remote server.
POST /api/pullRequest:
curl https://ollama.yourcompany.com/api/pull \
-H "Authorization: Bearer your-api-key" \
-d '{"name": "mistral:7b", "stream": false}'Streaming Response:
{"status": "pulling manifest"}
{"status": "downloading sha256:a80c4f17...", "digest": "sha256:a80c4f17...", "total": 2019393189, "completed": 524288000}
...
{"status": "success"}Check which models are currently loaded in memory.
GET /api/psResponse:
{
"models": [
{
"name": "llama3.2:latest",
"size": 2019393189,
"digest": "a80c4f17...",
"expires_at": "2024-01-15T11:30:00Z"
}
]
}Control model behavior with these options in /api/generate or /api/chat:
| Option | Type | Default | Description |
|---|---|---|---|
temperature | float | 0.8 | Creativity (0.0-2.0) |
num_predict | int | 128 | Max tokens to generate |
top_p | float | 0.9 | Nucleus sampling |
top_k | int | 40 | Top-k sampling |
repeat_penalty | float | 1.1 | Repetition penalty |
seed | int | - | Random seed for reproducibility |
num_ctx | int | 2048 | Context window size |
stop | string[] | - | Stop sequences |
Example with options:
{
"model": "llama3.2",
"prompt": "Explain quantum computing",
"stream": false,
"options": {
"temperature": 0.3,
"num_predict": 500,
"top_p": 0.95,
"seed": 42
}
}# nginx.conf
server {
listen 443 ssl;
server_name ollama.yourcompany.com;
ssl_certificate /etc/ssl/certs/ollama.crt;
ssl_certificate_key /etc/ssl/private/ollama.key;
location / {
# Validate Bearer token
if ($http_authorization != "Bearer your-secret-api-key") {
return 401;
}
proxy_pass http://localhost:11434;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
# For streaming
proxy_buffering off;
proxy_cache off;
proxy_read_timeout 300s;
}
}ollama.yourcompany.com {
basicauth / {
admin $2a$14$hashedpassword
}
reverse_proxy localhost:11434 {
flush_interval -1
}
}# docker-compose.yml
services:
ollama:
image: ollama/ollama:latest
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
labels:
- "traefik.enable=true"
- "traefik.http.routers.ollama.rule=Host(`ollama.yourcompany.com`)"
- "traefik.http.routers.ollama.middlewares=auth"
- "traefik.http.middlewares.auth.basicauth.users=admin:$$apr1$$..."
traefik:
image: traefik:v2.10
ports:
- "443:443"
volumes:
- /var/run/docker.sock:/var/run/docker.sock# Launch g4dn.xlarge (T4 GPU) or g5.xlarge (A10G)
# Ubuntu 22.04 AMI
# Install NVIDIA drivers
sudo apt update && sudo apt install -y nvidia-driver-535
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Configure for remote access
sudo systemctl edit ollama
# Add: Environment="OLLAMA_HOST=0.0.0.0:11434"
# Start and pull models
sudo systemctl start ollama
ollama pull llama3.2
ollama pull nomic-embed-text
# Configure security group to allow 11434 from your app# Create VM with NVIDIA T4/A100
gcloud compute instances create ollama-server \
--zone=us-central1-a \
--machine-type=n1-standard-8 \
--accelerator=type=nvidia-tesla-t4,count=1 \
--image-family=ubuntu-2204-lts \
--image-project=ubuntu-os-cloud \
--boot-disk-size=200GB
# SSH and install Ollama
gcloud compute ssh ollama-server
curl -fsSL https://ollama.com/install.sh | sh# Use their GPU marketplace templates
# Many offer pre-configured Ollama images
# Example RunPod environment
OLLAMA_HOST=0.0.0.0:11434import { ProviderFactory } from '@/lib/ai/factory/provider-factory'
const ollama = ProviderFactory.createGlobalProvider('ollama_remote')
const response = await ollama.chat({
messages: [
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: 'Explain TypeScript generics.' }
],
model: 'llama3.2',
temperature: 0.7,
})
console.log(response.content)const stream = await ollama.streamChat({
messages: [{ role: 'user', content: 'Write a poem' }],
model: 'llama3.2',
})
for await (const chunk of stream) {
if (chunk.type === 'token') {
process.stdout.write(chunk.delta)
}
}// List available models from remote server
const models = await ollama.listModels()
// Returns: ['llama3.2:latest', 'gemma3:12b', 'mistral:7b']
// Use in model selector UI
const modelOptions = models.map(name => ({
id: `ollama_remote/${name}`,
label: name,
provider: 'ollama_remote',
}))const router = new StrategyRouter({
rules: [
// Route complex tasks to remote Ollama with larger model
{ task: 'analysis', provider: 'ollama_remote', model: 'llama3.1:70b' },
// Local for quick tasks
{ task: 'chat', provider: 'ollama', model: 'llama3.2:8b' },
// OpenAI for vision
{ task: 'vision', provider: 'openai', model: 'gpt-4o' },
],
})All responses include timing metrics (in nanoseconds):
| Metric | Description |
|---|---|
total_duration | Total request time |
load_duration | Time to load model into memory |
prompt_eval_count | Tokens in the prompt |
prompt_eval_duration | Time to process prompt |
eval_count | Tokens generated |
eval_duration | Time to generate response |
const response = await ollama.chat(request)
console.log({
totalMs: response.metadata.totalDuration / 1_000_000,
tokensPerSecond: response.usage.completionTokens /
(response.metadata.evalDuration / 1_000_000_000),
})try {
const response = await ollama.chat(request)
} catch (error) {
if (error.message.includes('ECONNREFUSED')) {
console.error('Remote Ollama server is unreachable')
// Fallback to local or cloud provider
}
if (error.message.includes('401') || error.message.includes('403')) {
console.error('Invalid API key for remote Ollama')
// Prompt user to update credentials
}
if (error.message.includes('not found')) {
console.error('Model not available on remote server')
// Suggest pulling the model
}
}| Aspect | Local Ollama | Remote Ollama |
|---|---|---|
| Setup | Install locally | Configure URL + key |
| Cost | Hardware cost | Server rental |
| Privacy | Data stays local | Data goes to server |
| Latency | ~50-200ms | Network dependent |
| Scaling | Limited by local GPU | Scale server independently |
| Availability | Your uptime | Server uptime |
| Offline | Works offline | Requires network |
# Check server is running
curl https://ollama.yourcompany.com/api/tags
# Verify firewall rules
# Ensure port 11434 (or your custom port) is open# Test with curl
curl -H "Authorization: Bearer your-key" \
https://ollama.yourcompany.com/api/tags
# Check proxy configuration for Bearer token validation# List available models
curl https://ollama.yourcompany.com/api/tags
# Pull missing model
curl -X POST https://ollama.yourcompany.com/api/pull \
-H "Authorization: Bearer your-key" \
-d '{"name": "llama3.2"}'nvidia-smiGET /api/psnum_ctx for faster inferencellama3.2:q4_0)