On this page
ollama launch opencodeAs of early 2026, Ollama introduced the launch command. This automates the connection between the OpenCode agent and your models.
Local Power: You run Qwen 3.5 Coder (14B) locally for free. It’s ultra-fast on your M5 and handles 90% of your coding (refactoring, unit tests, explanations).
Cloud Bursting: When you hit a complex architectural problem, you switch OpenCode to Ollama Cloud (using your 20€/month sub) to access GPT-OSS (120B) or DeepSeek V3.
The M5’s Neural Accelerators allow OpenCode to “read” your entire repository into memory almost instantly.
Local Indexing: OpenCode uses your local RAM to index your files. With 24GB, it can hold a massive context of your code without slowing down.
The Switch: In your terminal, you simply type /model gpt-oss:120b-cloud within OpenCode. It sends only the necessary snippet to the cloud, solves the hard logic, and brings it back to your local M5.
Within OpenCode, use the command /config to set your M5 to “Local-First” mode. This ensures you only spend your cloud credits (your 20€) when the local model fails a task.
Error
Metal Backend Conflict or an Allocation Deadlock.
Since the M5 uses a new 2nm/3nm architecture, Ollama occasionally hits a “shader compilation error” where it tries to load a model using bfloat16 weights that conflict with the system’s active memory compression.
Here is how to fix it immediately so you can get back to OpenCode:
- The “Clean Slate” Fix (Most Likely)
Ollama sometimes “hangs” a model in the background that didn’t unload properly, taking up your 24GB.
Force Quit: Open Activity Monitor, search for ollama, and kill all processes (especially ollama_llama_server).
Restart the App: Re-open the Ollama dashboard.
Try a “Small” Test: Run ollama run phi4:mini. If this loads, your hardware is fine and the previous model was just too large or corrupted.
- Check your “Available” vs “Total” RAM
Even with 24GB, if you have Chrome with 50 tabs, Slack, and VS Code open, your “Available” memory might be under 10GB.
The M5 Rule: macOS requires about 20% of your RAM for “system overhead” to prevent the OS from freezing.
The Fix: Close your browser or heavy apps and try loading the model again. If you are trying to run a 30B+ model, it will fail on 24GB RAM. Stick to 14B or smaller for total stability.
- Fix the “Metal Shader” Internal Error
If the error persists even with tiny models, it’s a driver handshake issue.
Run this command in Terminal:
Bash OLLAMA_DEBUG=1 ollama serve Watch the logs. If you see ggml_metal_init: error, your M5 is having trouble compiling the GPU shaders for that specific model.
Solution: Update Ollama (ollama update). If you’re on a beta build of macOS, you may need to set export OLLAMA_GPU_OVERRIDE=0 to force CPU mode temporarily, though this will be slower.
- OpenCode Conflict
If you are using OpenCode, it might be trying to load a “secondary” embedding model (for indexing your files) at the same time as your main chat model.
The Fix: Go into your OpenCode settings and ensure “Vector Indexing” is set to Local but “Small Model.” If OpenCode tries to load a 14B model for indexing and a 14B model for chatting, it will hit 28GB+ and crash your 24GB system every time.
If you are using Ollama Cloud:
If this error is happening while trying to connect to Ollama Cloud:
Check your Internet: The “retrying in 3s” usually means the handshake with the Ollama servers failed.
Verify Subscription: Ensure your 20€ payment went through. If the “Free Tier” concurrency limit (1 model) is hit, the Cloud will reject your load request until the previous session times out.
## Gemma
The error you are seeing is almost certainly a RAM collision. Even though you have an M5 with 24GB, Gemma 4 (released April 2026) is a “trimodal” beast that eats memory differently than previous models.
Here is why your M5 (24GB) is retrying and how to fix it:
- The “Size” Problem
Google released Gemma 4 in four main sizes. With 24GB of RAM, here is the reality of what will fit:
Gemma 4 - E4B (~8B): Safe. This fits perfectly. It uses about 9.6GB.
Gemma 4 - 26B A4B (MoE): The Danger Zone. While only 3.8B parameters are “active” at a time, the entire 26B model must sit in your RAM to work. At a standard 4-bit quantization, this takes up roughly 17–19GB.
Why it’s failing: 19GB (Model) + 5GB (macOS) = 24GB. You have 0MB left for the context window or other apps. One Slack notification or Chrome tab will crash the model load.
- Immediate Fixes
If you are determined to run the 26B A4B or 31B versions on a 24GB Mac:
Lower the Context Window: Ollama 0.19+ defaults to a large context. Force it smaller so it takes up less RAM.
Run: ollama run gemma4:26b —context 8192
The “Ollama Cloud” Escape: Since you have 20€, this is exactly what Ollama Cloud is for.
Instead of crashing your Mac, run the 31B Dense model in the cloud. It will be 10x faster and leave your 24GB of RAM free for VS Code and your browser.
Use the “E4B” Variant locally: For 90% of tasks, the gemma4:e4b model is shockingly good and will run at 60+ tokens/sec on your M5 without ever crashing.
- Solving the “Internal Error”
If even the small e4b model fails to load:
Flash Attention Conflict: Gemma 4 uses a new “Alternating Attention” architecture. There is a known bug on M5 chips where Flash Attention hangs on models larger than 8B.
The Fix: Run this in your terminal before starting Ollama:
Bash export OLLAMA_FLASH_ATTENTION=0 ollama serve Summary Recommendation for your 20€
You are hitting the hardware limit of the basic M5.
Local: Use your M5 to run Gemma 4 E4B for fast, free, daily coding.
Cloud: Use your 20€ for an Ollama Cloud or OpenRouter sub to call the Gemma 4 31B or Claude 4.5 when you need a “genius” to look at your code.
Pro Tip: If you want to keep everything local, try ollama run gemma4:26b-a4b-it-q3_K_S. The “q3” (3-bit) version will fit comfortably in your 24GB RAM with room for a 32K context window.