โ† Boot Camp Hub REFERENCE ยท FACT-CHECK ยท 2026-04-17
๐Ÿ‘ป

Ollama Environment Variables

Version Drift, Ghost Vars, and Why Your Old Config Still Says It "Works"

Ollama is a fast-moving open-source project. Variables that worked a year or two ago โ€” OLLAMA_NUM_CTX, OLLAMA_NUM_GPU, OLLAMA_MLX โ€” were renamed, replaced, or removed without deprecation warnings. Today's server silently ignores them. This page is a snapshot of what's recognized right now, and the discipline you need to keep up.

Verified: 2026-04-17 Ollama v0.20.7 Source: envconfig/config.go
Contents
  1. The Open-Source Convention โ€” read this first
  2. TL;DR
  3. The Ghost Variables (silently ignored) + bonus: GGML_METAL_RESIDENCY_KEEP_ALIVE_S
  4. The Official 25 Variables
  5. Context Length: 3-Tier Priority
  6. num_gpu vs OLLAMA_NUM_GPU
  7. Outdated Tutorials, Updated
  8. Correct macOS Server Configs
  9. Verify Your Config (server log dump)
  10. Sources

ยง0 The Open-Source Convention โ€” A Universal Reading Order

Before you read any specific fact on this page, internalize the meta-rule it depends on: open-source documentation, including this page, will go stale. That's not a flaw โ€” it's the convention.

The OSS contract, in plain words
  • The code ships first; the docs follow late, if at all. Maintainers prioritize features, fixes, and refactors. README updates, blog posts, and tutorials trail the code by weeks, months, or forever.
  • Renames, removals, and silent breakage are normal. No one is obligated to keep your old config working. Deprecation warnings are a courtesy, not a guarantee.
  • Tutorials are time capsules. A blog post from last quarter wasn't wrong โ€” it was correct on the day it was written. The project moved; the post didn't. The author isn't at fault. Neither is the AI that quoted it.
  • Bare-hands collision is the entry fee. When something doesn't work the way the docs say, you check the source, the issues, the recent commits, the startup logs โ€” in that order. This is the basic philosophy of OSS, not a punishment.
The one habit that survives all version drift

For every OSS tool you depend on, learn how it tells you what it actually loaded โ€” startup config dumps, --version, --help flags, the equivalent of ollama serve's map[...] line. That output is the only source of truth that can't lie, because it's the program reporting its own state. Documents (this one included) are commentary; the program itself is the ground truth.

This page is dated 2026-04-17, Ollama v0.20.7. Treat every claim below as "true on that date, in that build" โ€” not as eternal scripture. When something stops working months from now, your first move isn't to assume the page lied; it's to ollama serve, watch the dump, and check the recent commits in envconfig/config.go. That habit will outlive any specific name on this page.

ยง1 TL;DR

Several Ollama "tuning variables" you've seen in blog posts, Stack Overflow answers, and AI assistant outputs were real settings in earlier versions โ€” but the current server (v0.20.7) doesn't read them. They were renamed, replaced, or quietly removed as the project evolved, and Ollama emits no deprecation warning when it encounters them.

Why this isn't anyone's fault โ€” it's the cost of velocity

Ollama is one of the fastest-moving open-source LLM projects. Env vars get renamed for clarity, removed when superseded by API options, or dropped when a backend is restructured. None of the old tutorials are lying โ€” they're just frozen in time. This is the chronic tax of riding a fast OSS train: the only ground truth is the version you have running today.

โš  The Silent-Failure Trap

Ollama doesn't warn on unrecognized OLLAMA_* variables โ€” it just ignores them. If you set OLLAMA_NUM_CTX=131072 and the model "works fine," the server actually used the VRAM-auto default (or your OLLAMA_CONTEXT_LENGTH value, if also set). You credited a setting that was never read. Correlation โ‰  causation.

Five names whose meaning has drifted out from under tutorials, and what works in the current build:

Name (no longer recognized)What it used to do / was meant to doWhat works in v0.20.7
OLLAMA_NUM_CTXServer-wide context lengthOLLAMA_CONTEXT_LENGTH (env) or options.num_ctx (per request)
OLLAMA_NUM_GPUForce layers to GPU at server levelPer-request options.num_gpu or Modelfile PARAMETER num_gpu (discrete GPUs only)
OLLAMA_N_GPU_LAYERSllama.cpp-style layer offloadSame as above
OLLAMA_MLXToggle MLX backend on Apple Silicon (preview era)Not present in current envconfig. Apple Silicon ships on the Metal backend; for MLX-native, run mlx-lm directly.
OLLAMA_MAX_CONTEXTCap server contextOLLAMA_CONTEXT_LENGTH

ยง2 Ghost Variables โ€” Names That Drifted Out

These names appear in many tutorials, screenshots, and AI-generated configs. They were valid at some point in Ollama's history (or were inherited from llama.cpp by reasonable analogy), but the current server doesn't read them. No deprecation warning, no log line โ€” they're just inert.

Three reasons names go ghost
  • Renamed for clarity โ€” e.g. OLLAMA_NUM_CTX โ†’ OLLAMA_CONTEXT_LENGTH (PR #8938, Feb 2025).
  • Demoted to per-request API option โ€” e.g. server-wide OLLAMA_NUM_GPU never made it into the env config; layer count became an API/Modelfile parameter (options.num_gpu).
  • Backend restructure โ€” e.g. an MLX-backend toggle that may have surfaced briefly during preview work but is not in the current envconfig; Apple Silicon ships on Metal today.
NameWhy people set itStatus in v0.20.7 / what to use today
OLLAMA_NUM_CTX "Sets default context window" Renamed. Maintainer note in issue #14130 confirms it's not a current config var. Use OLLAMA_CONTEXT_LENGTH.
OLLAMA_NUM_GPU "Force all layers to GPU at server level" Not in current envconfig. The layer-count knob lives at request time: API options.num_gpu or Modelfile PARAMETER num_gpu.
OLLAMA_N_GPU_LAYERS llama.cpp's n_gpu_layers with an OLLAMA_ prefix Never transferred to Ollama's env layer. Same fix as above.
OLLAMA_MLX Toggle MLX backend on Mac (preview-era assumption) Not present in current envconfig. Apple Silicon runs on the Metal backend today; no backend switch is exposed. For MLX-native inference, use mlx-lm directly.
OLLAMA_MAX_CONTEXT "Hard cap on context length" Maintainer note: not a current config var. Use OLLAMA_CONTEXT_LENGTH.
OLLAMA_GPU_LAYERS Variant naming you'll see in mixed tutorials Not in the env config. Same fix as OLLAMA_NUM_GPU.
OLLAMA_CACHE_SIZE "Limit KV cache memory" Not in the env config. The current memory knob is OLLAMA_KV_CACHE_TYPE=q8_0 (quantizes the cache).
OLLAMA_MEMORY_MODE Generic "memory tuning" suggestion in various tutorials Not in the env config.
How to tell if a name is current

Compare it against the live list (next section), and against the dump that ollama serve prints at startup. The startup dump is the only authoritative source โ€” even this page goes stale eventually.

Bonus: the inverse case โ€” a non-OLLAMA_* env that does reach the runner

Ghost vars are OLLAMA_*-prefixed names the server doesn't read. The mirror case is more interesting: a non-OLLAMA_*-prefixed env that nonetheless takes effect because of plain POSIX inheritance โ€” and that no public Ollama doc mentions.

๐Ÿงช Field discovery: GGML_METAL_RESIDENCY_KEEP_ALIVE_S

Origin. Not an Ollama env at all โ€” it belongs to the embedded ggml library that powers Ollama's Metal backend on Apple Silicon. The ggml C code reads it directly via getenv("GGML_METAL_RESIDENCY_KEEP_ALIVE_S"). It controls how long Metal residency-set buffers stay resident after the model is idle.

Why it works without Ollama "supporting" it. No special passthrough is needed. The chain is plain POSIX:

  1. LaunchAgent (or shell) sets the env on the ollama serve process.
  2. serve spawns model-runner children via fork/exec; children inherit the full env.
  3. Inside the runner, ggml's Metal backend reads the var directly. No Ollama code touches it.

Why you'd care. The default is 180 s. After any idle period longer than that, the residency set is torn down and the next request pays a brutal first-token latency rebuilding it ("์ฒ˜์Œ์—๋งŒ ๋“œ๋Ÿฝ๊ฒŒ ๋А๋ ค" โ€” "only the first one is painfully slow"). Bumping it to 86400 (24 h) keeps buffers warm across normal idle gaps for an always-on local server.

How to verify it actually landed. The runner log emits a single line per model load:

grep -m1 'residency set collection' ~/Library/Logs/ollama.log
# expected when env set: ... (keep_alive = 86400 s)
# default (env missing):  ... (keep_alive = 180 s)

Why this is in ยง2 at all. It's the live counter-example to the OSS-convention warning above: not every effective env is in the project's "official" list, and not every name in the official list is necessarily the only knob that matters. The only way to know what actually took effect is the program's self-report โ€” log lines and startup dumps. That's the ยง0 lesson, in concrete form: this var is undocumented in any Ollama-side source, learned by log inspection on a production fleet. Future Ollama or ggml releases may rename or retire it; re-verify when in doubt.

ยง3 The Official 25 Variables

Sourced from envconfig/config.go in Ollama v0.20.7. These are the only OLLAMA_* env vars the server actually parses.

OLLAMA_AUTH
OLLAMA_CONTEXT_LENGTH
OLLAMA_DEBUG
OLLAMA_DEBUG_LOG_REQUESTS
OLLAMA_EDITOR
OLLAMA_FLASH_ATTENTION
OLLAMA_GPU_OVERHEAD
OLLAMA_HOST
OLLAMA_KEEP_ALIVE
OLLAMA_KV_CACHE_TYPE
OLLAMA_LLM_LIBRARY
OLLAMA_LOAD_TIMEOUT
OLLAMA_MAX_LOADED_MODELS
OLLAMA_MAX_QUEUE
OLLAMA_MODELS
OLLAMA_MULTIUSER_CACHE
OLLAMA_NEW_ENGINE
OLLAMA_NOHISTORY
OLLAMA_NOPRUNE
OLLAMA_NO_CLOUD
OLLAMA_NUM_PARALLEL
OLLAMA_ORIGINS
OLLAMA_REMOTES
OLLAMA_SCHED_SPREAD
OLLAMA_VULKAN

Anything else with the OLLAMA_ prefix that doesn't appear above isn't being read by this version. Good news: the names you'll reach for in production are here โ€” flash attention, KV cache type, keep-alive, host, parallel requests, max loaded models.

This list is a snapshot of v0.20.7. Future releases will add and remove. Re-verify against your local ollama serve startup dump rather than trusting any document โ€” including this one โ€” indefinitely.

ยง4 Context Length: 3-Tier Priority

This is the most-misunderstood Ollama setting. As of v0.15.5+, the server picks a context length using a three-tier priority chain. Understanding this kills 90% of "why is my context wrong?" confusion.

The Three Tiers (highest priority wins)

  1. API request โ€” options.num_ctx in your request body. Beats everything below.
  2. Server env var โ€” OLLAMA_CONTEXT_LENGTH at server start. Beats VRAM auto.
  3. VRAM auto-detect โ€” server-side default based on available VRAM.

VRAM Auto-Detect Defaults (when nothing else is set)

Available VRAMDefault num_ctx
< 24 GiB4,096 tokens
24 โ€“ 48 GiB32,768 tokens (32K)
โ‰ฅ 48 GiB262,144 tokens (256K)
Apple Silicon big-RAM example

On an M3 Ultra with 512 GB unified memory, server reports total_vram="464.0 GiB" and default_num_ctx=262144 โ€” you get 256K context for free, no env var needed. If your client also sends options.num_ctx=65536, that 65K wins per request and the server quietly under-uses the model.

Pre-v0.15.5 hardcoded default

Older docs, blog posts, and AI-generated tutorials still say "Ollama defaults to 2048 tokens." That hasn't been true since early 2025 (PR #8938 introduced OLLAMA_CONTEXT_LENGTH; the VRAM auto-detect followed in Feb 2026). If a tutorial tells you to set OLLAMA_NUM_CTX=131072 "to escape the 2K default," it's wrong on both counts: wrong variable name, wrong default.

ยง5 num_gpu vs OLLAMA_NUM_GPU

The name collision causes endless confusion. Here's the full picture:

WhereNameScopeExists?
Server env varOLLAMA_NUM_GPUWould be a server defaultโŒ Not in current envconfig/config.go. Use the per-request route below.
API requestoptions.num_gpuPer requestโœ… Defined in api/types.go.
ModelfilePARAMETER num_gpuPer modelโœ… Valid.

What it actually does

On a discrete-GPU system (NVIDIA / AMD), num_gpu controls how many model layers are offloaded to the GPU. Set it to 999 and Ollama puts all layers on GPU (or as many as fit). Lower it to spill layers back to system RAM.

On Apple Silicon, you don't need it

Unified memory means there's no CPU/GPU split to manage. The Metal backend processes all layers on the GPU automatically. Setting num_gpu on a Mac is a no-op at best, confusing at worst.

Per-request example (discrete GPU)

Python ยท ollama API
import httpx

httpx.post("http://localhost:11434/api/chat", json={
    "model": "llama3.1:70b",
    "messages": [{"role": "user", "content": "Hi"}],
    "options": {
        "num_gpu": 999,        # all layers to GPU
        "num_ctx": 8192,       # context for THIS request
    }
})

ยง6 Outdated Tutorials, Updated

Each of these was good advice at some point in Ollama's history. The project moved; the advice didn't. None of these are anyone's fault โ€” they're snapshots that aged out.

Outdated 1: "Set OLLAMA_NUM_CTX to override the context window"

Today: that name was renamed to OLLAMA_CONTEXT_LENGTH (PR #8938). Set the old name and Ollama silently uses the VRAM-auto default โ€” your model "works" because it fell back, not because the var was read. Set the current name, or pass options.num_ctx per request.

Outdated 2: "Ollama defaults to 2048 tokens, you must override"

Today: pre-v0.15.5 advice. Current Ollama auto-sizes from VRAM (4K / 32K / 256K tiers). On a 48GB+ machine the default is already 256K โ€” you may not need to override at all.

Outdated 3: "Set OLLAMA_NUM_GPU=999 to use all GPU layers"

Today: a server-level env var with that name isn't in the current envconfig. The layer count became a per-request setting (options.num_gpu) or a Modelfile PARAMETER. On Apple Silicon you don't need it at all โ€” Metal handles every layer automatically over unified memory.

Outdated 4: "OLLAMA_MLX=1 switches to the MLX backend"

Today: no MLX-backend toggle is exposed. Apple Silicon ships on the Metal backend. There may have been preview builds or community patches that hinted at MLX integration; the current released server has none of that surface area. For MLX-native inference run mlx-lm as its own process.

Still true 1: OLLAMA_FLASH_ATTENTION=1 is one of the biggest wins

In the official list. Enables flash attention. Cheap to set, large measurable gain.

Still true 2: OLLAMA_KEEP_ALIVE controls model unload timing

Accepts -1 (never unload), or a duration like 5m, 1h, 24h. -1 for always-on local AI.

Still true 3: OLLAMA_KV_CACHE_TYPE=q8_0 saves memory

Quantizes the KV cache: q8_0 halves cache memory with negligible quality loss; q4_0 quarters it with mild quality loss. The right knob when you want bigger context per request.

A short rule of thumb

If your config came from a tutorial, AI assistant, or your own old notes more than a few months back: don't trust it on faith. Run ollama serve, watch the startup dump, and only believe what's in the map[...] line. Everything else is hopeful typing.

ยง7 Correct macOS Server Configs

A. Homebrew ollama serve from your shell

~/.zshrc or one-shot
export OLLAMA_HOST=0.0.0.0:11434
export OLLAMA_CONTEXT_LENGTH=131072      # explicit 128K (omit for VRAM-auto)
export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE=q8_0
export OLLAMA_KEEP_ALIVE=-1              # never unload
export OLLAMA_NUM_PARALLEL=2             # concurrent requests
export OLLAMA_MAX_LOADED_MODELS=3

# Bonus (Apple Silicon, see ยง2 bonus box): not an OLLAMA_* var, but
# inherits to the Metal runner via fork/exec. Default 180s causes
# painful first-token latency after idle. 86400 = 24h warm buffers.
export GGML_METAL_RESIDENCY_KEEP_ALIVE_S=86400

ollama serve

B. LaunchAgent plist (auto-start at login)

~/Library/LaunchAgents/com.ollama.server.plist
<key>EnvironmentVariables</key>
<dict>
  <key>OLLAMA_HOST</key>                          <string>0.0.0.0:11434</string>
  <key>OLLAMA_CONTEXT_LENGTH</key>                <string>131072</string>  <!-- NOT OLLAMA_NUM_CTX -->
  <key>OLLAMA_FLASH_ATTENTION</key>               <string>1</string>
  <key>OLLAMA_KV_CACHE_TYPE</key>                 <string>q8_0</string>
  <key>OLLAMA_KEEP_ALIVE</key>                    <string>-1</string>
  <key>GGML_METAL_RESIDENCY_KEEP_ALIVE_S</key>    <string>86400</string>  <!-- ggml-layer (ยง2 bonus) -->
</dict>

C. The Ollama.app GUI gotcha

GUI apps don't inherit your shell env

If you run the macOS Ollama.app instead of ollama serve, none of your shell exports reach it. You have two choices:

Option 1: launchctl setenv (system-wide)
launchctl setenv OLLAMA_CONTEXT_LENGTH 131072
launchctl setenv OLLAMA_FLASH_ATTENTION 1
launchctl setenv OLLAMA_KV_CACHE_TYPE q8_0

# Then quit and relaunch Ollama.app

Option 2: use the Settings UI inside recent Ollama.app builds โ€” there's a context-length slider and a few performance toggles.

ยง8 Verify Your Config

Ollama dumps every config it recognizes at server startup. This is the ground truth.

Watching ollama serve output
$ ollama serve

time=2026-04-17T10:32:15Z level=INFO source=routes.go:1742 \
  msg="server config" \
  env="map[OLLAMA_CONTEXT_LENGTH:131072 \
       OLLAMA_DEBUG:INFO \
       OLLAMA_FLASH_ATTENTION:true \
       OLLAMA_GPU_OVERHEAD:0 \
       OLLAMA_HOST:http://0.0.0.0:11434 \
       OLLAMA_KEEP_ALIVE:-1 \
       OLLAMA_KV_CACHE_TYPE:q8_0 \
       OLLAMA_MAX_LOADED_MODELS:3 \
       OLLAMA_NUM_PARALLEL:2 \
       ...]"
The one verification rule that survives version drift

If a variable you exported isn't in the map[...] dump, this version of Ollama isn't reading it. Either it was renamed, or it never landed in the env layer. Either way: drop it from your config, or look up the current name. The dump is the only source of truth that can't lie.

Quick one-liner to check defaults

bash
# Restart server, capture first 50 lines
ollama serve 2>&1 | head -n 50 | grep -E "server config|total_vram|default_num_ctx"

You'll see your env vars and the auto-detected VRAM-based context default in one shot.

Bonus: verify the ggml-layer env (ยง2 bonus)

The map[...] dump only covers OLLAMA_* vars. To confirm a non-Ollama env like GGML_METAL_RESIDENCY_KEEP_ALIVE_S reached the runner, check two places:

bash
# 1. Is the env on the live server process?
ps eww $(pgrep -f 'ollama serve' | head -1) | tr ' ' '\n' | grep -E 'OLLAMA_|GGML_'

# 2. Did the runner actually use it on model load?
grep -m1 'residency set collection' ~/Library/Logs/ollama.log
# expected: ... (keep_alive = 86400 s)
# default:  ... (keep_alive = 180 s)

Same principle generalizes: any env from a library Ollama links against (ggml, Metal, CUDA) won't show up in Ollama's own startup dump but can still take effect via standard POSIX fork/exec inheritance. ps eww on the live process and the runner's own log lines are the only way to confirm.

ยง9 Sources

Last verification

This page was fact-checked on 2026-04-17 against Ollama v0.20.7. If you're reading this much later, re-verify by running ollama serve and checking the config dump โ€” that's the source of truth that can't lie.