Understanding Ollama: Installing, Managing, and Running Local AI Models
One of the biggest misconceptions beginners have when learning AI engineering is thinking that an AI model is the same thing as Ollama. They are not. Think about Java development.
Java Code
↓
JVM
↓
ExecutionThe JVM runs Java applications.
Similarly:
AI Model
↓
Ollama
↓
ExecutionOllama is a runtime that allows us to download, manage, and run Large Language Models (LLMs) locally on our machine.
This means we can build AI applications without relying on cloud APIs or paying per request.
Installing Ollama
The first step is installing Ollama.
For macOS:
brew install ollamaVerify the installation:
ollama --versionExample:
Warning: could not connect to a running Ollama instance
Warning: client version is 0.18.3This simply means Ollama is installed, but the Ollama server is not currently running.
Starting the Ollama Server
Before we can use any model, we must start the Ollama server.
ollama serveExpected output:
Listening on 127.0.0.1:11434This means Ollama is now listening for requests from applications.
Conceptually:
Python App ↓ localhost:11434 ↓ Ollama Server
Keep this terminal open. The server must remain running while we use models.
Installing a Model
A fresh Ollama installation contains no models. We need to download one.
For example:
ollama pull qwen3:4bWhat happens?
Ollama Registry
↓
Download Model
↓
Store On DiskAfter the download completes, the model is available locally. A common mistake is assuming that downloading a model means it is running. It is not.
Think:
Movie Downloaded
≠
Movie PlayingInstalling and running are different operations.
Viewing Installed Models
To see all downloaded models:
ollama listExample:
NAME
qwen3:4b
phi3:mini
nomic-embed-textThese models exist on your SSD. They are not necessarily consuming RAM.
Running a Model
To start a model:
ollama run qwen3:4bWhat happens internally?
SSD
↓
Load Model Into RAM
↓
Start Inference
↓
Ready For QuestionsYou can now ask:
What is Kotlin?the model generates an answer. This process is called:
InferenceInference is the act of using a trained model. Most AI application engineers perform inference rather than training models.
Viewing Running Models
To see which models are currently loaded into memory:
ollama psExample:
NAME
qwen3:4bThis command is different from:
ollama listRemember:
ollama list
=
Installed Models
ollama ps
=
Running ModelsThis distinction is important.
Stopping a Model
When you’re finished using a model:
ollama stop qwen3:4bWhat happens?
RAM
↓
Unload Model
↓
Free MemoryThe model remains installed and can be started again later.
Removing a Model
If you no longer need a model:
ollama rm qwen3:4bWhat happens?
Delete Model Files
↓
Free Disk SpaceThe model must be downloaded again before it can be used.
A Typical AI Engineering Workflow
A common workflow looks like this:
Start Ollama:
ollama serveDownload a model:
ollama pull qwen3:4bVerify installation:
ollama listRun the model:
ollama run qwen3:4bCheck running models:
ollama psStop the model:
ollama stop qwen3:4bRemove the model:
ollama rm qwen3:4b
Mental Model
Whenever you work with Ollama, think about three layers:
Storage Layer
│
├── qwen3:4b
├── phi3:mini
└── nomic-embed-text
↓
Memory Layer
│
└── Running Models
↓
Application Layer
│
├── Python
├── CLI Tools
├── Agents
└── AI ApplicationsUnderstanding these layers removes much of the mystery around local AI.
Once you know how to install, manage, run, and stop models, you’re ready to start building real AI applications on top of them.
