Ollama Environment Variables — Version Drift & Today's Truth (2026)

Contents

The Open-Source Convention — read this first
TL;DR
The Ghost Variables (silently ignored) + bonus: GGML_METAL_RESIDENCY_KEEP_ALIVE_S
The Official 25 Variables
Context Length: 3-Tier Priority
num_gpu vs OLLAMA_NUM_GPU
Outdated Tutorials, Updated
Correct macOS Server Configs
Verify Your Config (server log dump)
Sources

§0 The Open-Source Convention — A Universal Reading Order

Before you read any specific fact on this page, internalize the meta-rule it depends on: open-source documentation, including this page, will go stale. That's not a flaw — it's the convention.

The OSS contract, in plain words

The code ships first; the docs follow late, if at all. Maintainers prioritize features, fixes, and refactors. README updates, blog posts, and tutorials trail the code by weeks, months, or forever.
Renames, removals, and silent breakage are normal. No one is obligated to keep your old config working. Deprecation warnings are a courtesy, not a guarantee.
Tutorials are time capsules. A blog post from last quarter wasn't wrong — it was correct on the day it was written. The project moved; the post didn't. The author isn't at fault. Neither is the AI that quoted it.
Bare-hands collision is the entry fee. When something doesn't work the way the docs say, you check the source, the issues, the recent commits, the startup logs — in that order. This is the basic philosophy of OSS, not a punishment.

The one habit that survives all version drift

For every OSS tool you depend on, learn how it tells you what it actually loaded — startup config dumps, --version, --help flags, the equivalent of ollama serve's map[...] line. That output is the only source of truth that can't lie, because it's the program reporting its own state. Documents (this one included) are commentary; the program itself is the ground truth.

This page is dated 2026-04-17, Ollama v0.20.7. Treat every claim below as "true on that date, in that build" — not as eternal scripture. When something stops working months from now, your first move isn't to assume the page lied; it's to ollama serve, watch the dump, and check the recent commits in envconfig/config.go. That habit will outlive any specific name on this page.

§1 TL;DR

Several Ollama "tuning variables" you've seen in blog posts, Stack Overflow answers, and AI assistant outputs were real settings in earlier versions — but the current server (v0.20.7) doesn't read them. They were renamed, replaced, or quietly removed as the project evolved, and Ollama emits no deprecation warning when it encounters them.

Why this isn't anyone's fault — it's the cost of velocity

Ollama is one of the fastest-moving open-source LLM projects. Env vars get renamed for clarity, removed when superseded by API options, or dropped when a backend is restructured. None of the old tutorials are lying — they're just frozen in time. This is the chronic tax of riding a fast OSS train: the only ground truth is the version you have running today.

⚠ The Silent-Failure Trap

Ollama doesn't warn on unrecognized OLLAMA_* variables — it just ignores them. If you set OLLAMA_NUM_CTX=131072 and the model "works fine," the server actually used the VRAM-auto default (or your OLLAMA_CONTEXT_LENGTH value, if also set). You credited a setting that was never read. Correlation ≠ causation.

Five names whose meaning has drifted out from under tutorials, and what works in the current build:

Name (no longer recognized)	What it used to do / was meant to do	What works in v0.20.7
OLLAMA_NUM_CTX	Server-wide context length	OLLAMA_CONTEXT_LENGTH (env) or `options.num_ctx` (per request)
OLLAMA_NUM_GPU	Force layers to GPU at server level	Per-request `options.num_gpu` or Modelfile `PARAMETER num_gpu` (discrete GPUs only)
OLLAMA_N_GPU_LAYERS	llama.cpp-style layer offload	Same as above
OLLAMA_MLX	Toggle MLX backend on Apple Silicon (preview era)	Not present in current envconfig. Apple Silicon ships on the Metal backend; for MLX-native, run `mlx-lm` directly.
OLLAMA_MAX_CONTEXT	Cap server context	OLLAMA_CONTEXT_LENGTH

§2 Ghost Variables — Names That Drifted Out

These names appear in many tutorials, screenshots, and AI-generated configs. They were valid at some point in Ollama's history (or were inherited from llama.cpp by reasonable analogy), but the current server doesn't read them. No deprecation warning, no log line — they're just inert.

Three reasons names go ghost

Renamed for clarity — e.g. OLLAMA_NUM_CTX → OLLAMA_CONTEXT_LENGTH (PR #8938, Feb 2025).
Demoted to per-request API option — e.g. server-wide OLLAMA_NUM_GPU never made it into the env config; layer count became an API/Modelfile parameter (options.num_gpu).
Backend restructure — e.g. an MLX-backend toggle that may have surfaced briefly during preview work but is not in the current envconfig; Apple Silicon ships on Metal today.

Name	Why people set it	Status in v0.20.7 / what to use today
OLLAMA_NUM_CTX	"Sets default context window"	Renamed. Maintainer note in issue #14130 confirms it's not a current config var. Use `OLLAMA_CONTEXT_LENGTH`.
OLLAMA_NUM_GPU	"Force all layers to GPU at server level"	Not in current envconfig. The layer-count knob lives at request time: API `options.num_gpu` or Modelfile `PARAMETER num_gpu`.
OLLAMA_N_GPU_LAYERS	llama.cpp's `n_gpu_layers` with an `OLLAMA_` prefix	Never transferred to Ollama's env layer. Same fix as above.
OLLAMA_MLX	Toggle MLX backend on Mac (preview-era assumption)	Not present in current envconfig. Apple Silicon runs on the Metal backend today; no backend switch is exposed. For MLX-native inference, use `mlx-lm` directly.
OLLAMA_MAX_CONTEXT	"Hard cap on context length"	Maintainer note: not a current config var. Use `OLLAMA_CONTEXT_LENGTH`.
OLLAMA_GPU_LAYERS	Variant naming you'll see in mixed tutorials	Not in the env config. Same fix as `OLLAMA_NUM_GPU`.
OLLAMA_CACHE_SIZE	"Limit KV cache memory"	Not in the env config. The current memory knob is `OLLAMA_KV_CACHE_TYPE=q8_0` (quantizes the cache).
OLLAMA_MEMORY_MODE	Generic "memory tuning" suggestion in various tutorials	Not in the env config.

How to tell if a name is current

Compare it against the live list (next section), and against the dump that ollama serve prints at startup. The startup dump is the only authoritative source — even this page goes stale eventually.

Bonus: the inverse case — a non-`OLLAMA_` env that does* reach the runner

Ghost vars are OLLAMA_*-prefixed names the server doesn't read. The mirror case is more interesting: a non-OLLAMA_*-prefixed env that nonetheless takes effect because of plain POSIX inheritance — and that no public Ollama doc mentions.

🧪 Field discovery: GGML_METAL_RESIDENCY_KEEP_ALIVE_S

Origin. Not an Ollama env at all — it belongs to the embedded ggml library that powers Ollama's Metal backend on Apple Silicon. The ggml C code reads it directly via getenv("GGML_METAL_RESIDENCY_KEEP_ALIVE_S"). It controls how long Metal residency-set buffers stay resident after the model is idle.

Why it works without Ollama "supporting" it. No special passthrough is needed. The chain is plain POSIX:

LaunchAgent (or shell) sets the env on the ollama serve process.
serve spawns model-runner children via fork/exec; children inherit the full env.
Inside the runner, ggml's Metal backend reads the var directly. No Ollama code touches it.

Why you'd care. The default is 180 s. After any idle period longer than that, the residency set is torn down and the next request pays a brutal first-token latency rebuilding it ("처음에만 드럽게 느려" — "only the first one is painfully slow"). Bumping it to 86400 (24 h) keeps buffers warm across normal idle gaps for an always-on local server.

How to verify it actually landed. The runner log emits a single line per model load:

grep -m1 'residency set collection' ~/Library/Logs/ollama.log
# expected when env set: ... (keep_alive = 86400 s)
# default (env missing):  ... (keep_alive = 180 s)

Why this is in §2 at all. It's the live counter-example to the OSS-convention warning above: not every effective env is in the project's "official" list, and not every name in the official list is necessarily the only knob that matters. The only way to know what actually took effect is the program's self-report — log lines and startup dumps. That's the §0 lesson, in concrete form: this var is undocumented in any Ollama-side source, learned by log inspection on a production fleet. Future Ollama or ggml releases may rename or retire it; re-verify when in doubt.

§3 The Official 25 Variables

Sourced from envconfig/config.go in Ollama v0.20.7. These are the only OLLAMA_* env vars the server actually parses.

OLLAMA_AUTH

OLLAMA_CONTEXT_LENGTH

OLLAMA_DEBUG

OLLAMA_DEBUG_LOG_REQUESTS

OLLAMA_EDITOR

OLLAMA_FLASH_ATTENTION

OLLAMA_GPU_OVERHEAD

OLLAMA_HOST

OLLAMA_KEEP_ALIVE

OLLAMA_KV_CACHE_TYPE

OLLAMA_LLM_LIBRARY

OLLAMA_LOAD_TIMEOUT

OLLAMA_MAX_LOADED_MODELS

OLLAMA_MAX_QUEUE

OLLAMA_MODELS

OLLAMA_MULTIUSER_CACHE

OLLAMA_NEW_ENGINE

OLLAMA_NOHISTORY

OLLAMA_NOPRUNE

OLLAMA_NO_CLOUD

OLLAMA_NUM_PARALLEL

OLLAMA_ORIGINS

OLLAMA_REMOTES

OLLAMA_SCHED_SPREAD

OLLAMA_VULKAN

Anything else with the OLLAMA_ prefix that doesn't appear above isn't being read by this version. Good news: the names you'll reach for in production are here — flash attention, KV cache type, keep-alive, host, parallel requests, max loaded models.

This list is a snapshot of v0.20.7. Future releases will add and remove. Re-verify against your local ollama serve startup dump rather than trusting any document — including this one — indefinitely.

§4 Context Length: 3-Tier Priority

This is the most-misunderstood Ollama setting. As of v0.15.5+, the server picks a context length using a three-tier priority chain. Understanding this kills 90% of "why is my context wrong?" confusion.

The Three Tiers (highest priority wins)

API request — options.num_ctx in your request body. Beats everything below.
Server env var — OLLAMA_CONTEXT_LENGTH at server start. Beats VRAM auto.
VRAM auto-detect — server-side default based on available VRAM.

VRAM Auto-Detect Defaults (when nothing else is set)

Available VRAM	Default `num_ctx`
< 24 GiB	4,096 tokens
24 – 48 GiB	32,768 tokens (32K)
≥ 48 GiB	262,144 tokens (256K)

Apple Silicon big-RAM example

On an M3 Ultra with 512 GB unified memory, server reports total_vram="464.0 GiB" and default_num_ctx=262144 — you get 256K context for free, no env var needed. If your client also sends options.num_ctx=65536, that 65K wins per request and the server quietly under-uses the model.

Pre-v0.15.5 hardcoded default

Older docs, blog posts, and AI-generated tutorials still say "Ollama defaults to 2048 tokens." That hasn't been true since early 2025 (PR #8938 introduced OLLAMA_CONTEXT_LENGTH; the VRAM auto-detect followed in Feb 2026). If a tutorial tells you to set OLLAMA_NUM_CTX=131072 "to escape the 2K default," it's wrong on both counts: wrong variable name, wrong default.

§5 `num_gpu` vs `OLLAMA_NUM_GPU`

The name collision causes endless confusion. Here's the full picture:

Where	Name	Scope	Exists?
Server env var	OLLAMA_NUM_GPU	Would be a server default	❌ Not in current `envconfig/config.go`. Use the per-request route below.
API request	options.num_gpu	Per request	✅ Defined in `api/types.go`.
Modelfile	PARAMETER num_gpu	Per model	✅ Valid.

What it actually does

On a discrete-GPU system (NVIDIA / AMD), num_gpu controls how many model layers are offloaded to the GPU. Set it to 999 and Ollama puts all layers on GPU (or as many as fit). Lower it to spill layers back to system RAM.

On Apple Silicon, you don't need it

Unified memory means there's no CPU/GPU split to manage. The Metal backend processes all layers on the GPU automatically. Setting num_gpu on a Mac is a no-op at best, confusing at worst.

Per-request example (discrete GPU)

Python · ollama API

import httpx

httpx.post("http://localhost:11434/api/chat", json={
    "model": "llama3.1:70b",
    "messages": [{"role": "user", "content": "Hi"}],
    "options": {
        "num_gpu": 999,        # all layers to GPU
        "num_ctx": 8192,       # context for THIS request
    }
})

§6 Outdated Tutorials, Updated

Each of these was good advice at some point in Ollama's history. The project moved; the advice didn't. None of these are anyone's fault — they're snapshots that aged out.

Outdated 1: "Set OLLAMA_NUM_CTX to override the context window"

Today: that name was renamed to OLLAMA_CONTEXT_LENGTH (PR #8938). Set the old name and Ollama silently uses the VRAM-auto default — your model "works" because it fell back, not because the var was read. Set the current name, or pass options.num_ctx per request.

Outdated 2: "Ollama defaults to 2048 tokens, you must override"

Today: pre-v0.15.5 advice. Current Ollama auto-sizes from VRAM (4K / 32K / 256K tiers). On a 48GB+ machine the default is already 256K — you may not need to override at all.

Outdated 3: "Set OLLAMA_NUM_GPU=999 to use all GPU layers"

Today: a server-level env var with that name isn't in the current envconfig. The layer count became a per-request setting (options.num_gpu) or a Modelfile PARAMETER. On Apple Silicon you don't need it at all — Metal handles every layer automatically over unified memory.

Outdated 4: "OLLAMA_MLX=1 switches to the MLX backend"

Today: no MLX-backend toggle is exposed. Apple Silicon ships on the Metal backend. There may have been preview builds or community patches that hinted at MLX integration; the current released server has none of that surface area. For MLX-native inference run mlx-lm as its own process.

Still true 1: OLLAMA_FLASH_ATTENTION=1 is one of the biggest wins

In the official list. Enables flash attention. Cheap to set, large measurable gain.

Still true 2: OLLAMA_KEEP_ALIVE controls model unload timing

Accepts -1 (never unload), or a duration like 5m, 1h, 24h. -1 for always-on local AI.

Still true 3: OLLAMA_KV_CACHE_TYPE=q8_0 saves memory

Quantizes the KV cache: q8_0 halves cache memory with negligible quality loss; q4_0 quarters it with mild quality loss. The right knob when you want bigger context per request.

A short rule of thumb

If your config came from a tutorial, AI assistant, or your own old notes more than a few months back: don't trust it on faith. Run ollama serve, watch the startup dump, and only believe what's in the map[...] line. Everything else is hopeful typing.

§7 Correct macOS Server Configs

A. Homebrew `ollama serve` from your shell

~/.zshrc or one-shot

export OLLAMA_HOST=0.0.0.0:11434
export OLLAMA_CONTEXT_LENGTH=131072      # explicit 128K (omit for VRAM-auto)
export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE=q8_0
export OLLAMA_KEEP_ALIVE=-1              # never unload
export OLLAMA_NUM_PARALLEL=2             # concurrent requests
export OLLAMA_MAX_LOADED_MODELS=3

# Bonus (Apple Silicon, see §2 bonus box): not an OLLAMA_* var, but
# inherits to the Metal runner via fork/exec. Default 180s causes
# painful first-token latency after idle. 86400 = 24h warm buffers.
export GGML_METAL_RESIDENCY_KEEP_ALIVE_S=86400

ollama serve

B. LaunchAgent plist (auto-start at login)

~/Library/LaunchAgents/com.ollama.server.plist

<key>EnvironmentVariables</key>
<dict>
  <key>OLLAMA_HOST</key>                          <string>0.0.0.0:11434</string>
  <key>OLLAMA_CONTEXT_LENGTH</key>                <string>131072</string>  <!-- NOT OLLAMA_NUM_CTX -->
  <key>OLLAMA_FLASH_ATTENTION</key>               <string>1</string>
  <key>OLLAMA_KV_CACHE_TYPE</key>                 <string>q8_0</string>
  <key>OLLAMA_KEEP_ALIVE</key>                    <string>-1</string>
  <key>GGML_METAL_RESIDENCY_KEEP_ALIVE_S</key>    <string>86400</string>  <!-- ggml-layer (§2 bonus) -->
</dict>

C. The Ollama.app GUI gotcha

GUI apps don't inherit your shell env

If you run the macOS Ollama.app instead of ollama serve, none of your shell exports reach it. You have two choices:

Option 1: launchctl setenv (system-wide)

launchctl setenv OLLAMA_CONTEXT_LENGTH 131072
launchctl setenv OLLAMA_FLASH_ATTENTION 1
launchctl setenv OLLAMA_KV_CACHE_TYPE q8_0

# Then quit and relaunch Ollama.app

Option 2: use the Settings UI inside recent Ollama.app builds — there's a context-length slider and a few performance toggles.

§8 Verify Your Config

Ollama dumps every config it recognizes at server startup. This is the ground truth.

Watching ollama serve output

$ ollama serve

time=2026-04-17T10:32:15Z level=INFO source=routes.go:1742 \
  msg="server config" \
  env="map[OLLAMA_CONTEXT_LENGTH:131072 \
       OLLAMA_DEBUG:INFO \
       OLLAMA_FLASH_ATTENTION:true \
       OLLAMA_GPU_OVERHEAD:0 \
       OLLAMA_HOST:http://0.0.0.0:11434 \
       OLLAMA_KEEP_ALIVE:-1 \
       OLLAMA_KV_CACHE_TYPE:q8_0 \
       OLLAMA_MAX_LOADED_MODELS:3 \
       OLLAMA_NUM_PARALLEL:2 \
       ...]"

The one verification rule that survives version drift

If a variable you exported isn't in the map[...] dump, this version of Ollama isn't reading it. Either it was renamed, or it never landed in the env layer. Either way: drop it from your config, or look up the current name. The dump is the only source of truth that can't lie.

Quick one-liner to check defaults

bash

# Restart server, capture first 50 lines
ollama serve 2>&1 | head -n 50 | grep -E "server config|total_vram|default_num_ctx"

You'll see your env vars and the auto-detected VRAM-based context default in one shot.

Bonus: verify the ggml-layer env (§2 bonus)

The map[...] dump only covers OLLAMA_* vars. To confirm a non-Ollama env like GGML_METAL_RESIDENCY_KEEP_ALIVE_S reached the runner, check two places:

bash

# 1. Is the env on the live server process?
ps eww $(pgrep -f 'ollama serve' | head -1) | tr ' ' '\n' | grep -E 'OLLAMA_|GGML_'

# 2. Did the runner actually use it on model load?
grep -m1 'residency set collection' ~/Library/Logs/ollama.log
# expected: ... (keep_alive = 86400 s)
# default:  ... (keep_alive = 180 s)

Same principle generalizes: any env from a library Ollama links against (ggml, Metal, CUDA) won't show up in Ollama's own startup dump but can still take effect via standard POSIX fork/exec inheritance. ps eww on the live process and the runner's own log lines are the only way to confirm.

§9 Sources

envconfig/config.go — definitive list of recognized env vars
api/types.go — definitive list of API request options (num_ctx, num_gpu, etc.)
Issue #14130 — maintainer confirmation that OLLAMA_NUM_GPU/OLLAMA_NUM_CTX/OLLAMA_MAX_CONTEXT are not config variables
PR #8938 — introduction of OLLAMA_CONTEXT_LENGTH (Feb 2025)
docs.ollama.com/faq
docs.ollama.com/context-length
ggml-org/ggml — embedded library that owns the Metal backend and reads GGML_METAL_RESIDENCY_KEEP_ALIVE_S directly
Field discovery (§2 bonus) — GGML_METAL_RESIDENCY_KEEP_ALIVE_S behavior was learned by runner-log inspection on a production Apple Silicon fleet (Mac Studio M3 Ultra). Not documented in any Ollama-side source as of v0.20.7. Re-verify on future releases.

Last verification

This page was fact-checked on 2026-04-17 against Ollama v0.20.7. If you're reading this much later, re-verify by running ollama serve and checking the config dump — that's the source of truth that can't lie.

§0 The Open-Source Convention — A Universal Reading Order

§1 TL;DR

§2 Ghost Variables — Names That Drifted Out

Bonus: the inverse case — a non-OLLAMA_* env that does reach the runner

§3 The Official 25 Variables

§4 Context Length: 3-Tier Priority

The Three Tiers (highest priority wins)

VRAM Auto-Detect Defaults (when nothing else is set)

§5 num_gpu vs OLLAMA_NUM_GPU

What it actually does

Per-request example (discrete GPU)

§6 Outdated Tutorials, Updated

§7 Correct macOS Server Configs

A. Homebrew ollama serve from your shell

B. LaunchAgent plist (auto-start at login)

C. The Ollama.app GUI gotcha

§8 Verify Your Config

Quick one-liner to check defaults

Bonus: verify the ggml-layer env (§2 bonus)

§9 Sources

Bonus: the inverse case — a non-`OLLAMA_` env that does* reach the runner

§5 `num_gpu` vs `OLLAMA_NUM_GPU`

A. Homebrew `ollama serve` from your shell