
Learn how to run LLMs locally with Ollama and C#. Free, private, offline AI with runnable code examples. Start building local AI in C# today.
Running large language models used to mean paying per token to OpenAI or Anthropic and sending your data to someone else's servers. Not anymore. With Ollama and C# you can run powerful LLMs locally on your own machine — completely free, fully private, and even offline. In this tutorial you'll learn how to set up Ollama, call it from .NET, stream responses, and build a production-ready local AI service in C#.
Whether you're a beginner searching "how to run LLMs locally" or a senior engineer evaluating self-hosted AI for compliance reasons, this guide gives you runnable code and the why behind every decision.
Why Run LLMs Locally with Ollama and C#?
Before writing code, it's worth understanding why local AI has exploded in popularity among .NET developers in the USA, UK, Canada, and Australia:
- Zero cost. No API keys, no per-token billing, no monthly subscriptions. Run as many requests as your hardware allows.
- Privacy and compliance. Your prompts and data never leave your machine — critical for healthcare (HIPAA), finance, and GDPR-sensitive workloads.
- Offline capability. Build AI features that work on a plane, in an air-gapped network, or in regions with poor connectivity.
- No rate limits. The only ceiling is your CPU/GPU.
- Low latency. For small models, local inference can beat a network round-trip to a cloud API.
Ollama is the key that makes this easy. It packages popular open models (Llama 3.2, Mistral, Phi-4, Gemma 3, Qwen, DeepSeek) behind a simple local REST API, so calling an LLM from C# is no harder than calling any other HTTP service.
Step 1: Install Ollama and Pull a Model
Download Ollama from ollama.com for Windows, macOS, or Linux. Once installed, it runs a local server on http://localhost:11434. Pull a model from your terminal:
// Run these in your terminal (PowerShell, bash, etc.), not in C#:
// ollama pull llama3.2 // 3B model, great default, ~2GB
// ollama pull phi4 // Microsoft's small but strong model
// ollama run llama3.2 "Hello" // quick smoke test
Pick a model that fits your RAM. A 3B model like llama3.2 runs comfortably on 8GB of RAM; 7B–8B models want 16GB; larger models benefit from a dedicated GPU. Start small — you can always scale up.
Step 2: Call Ollama from C# with HttpClient
The most transparent way to learn how Ollama works is to call its REST API directly. No NuGet packages required — just HttpClient and System.Text.Json. This is the foundation every higher-level library is built on.
using System.Net.Http.Json;
using System.Text.Json.Serialization;
// Define request/response shapes for Ollama's /api/generate endpoint
record OllamaRequest(
[property: JsonPropertyName("model")] string Model,
[property: JsonPropertyName("prompt")] string Prompt,
[property: JsonPropertyName("stream")] bool Stream);
record OllamaResponse(
[property: JsonPropertyName("response")] string Response,
[property: JsonPropertyName("done")] bool Done);
using var http = new HttpClient { BaseAddress = new Uri("http://localhost:11434") };
var request = new OllamaRequest(
Model: "llama3.2",
Prompt: "Explain dependency injection in one sentence.",
Stream: false);
var httpResponse = await http.PostAsJsonAsync("/api/generate", request);
httpResponse.EnsureSuccessStatusCode();
var result = await httpResponse.Content.ReadFromJsonAsync<OllamaResponse>();
Console.WriteLine(result?.Response);
That's it — you just ran an LLM locally from C# for free. The /api/generate endpoint handles single-prompt completions, while /api/chat (shown later) handles multi-turn conversations.
Step 3: Stream Responses for a Real-Time UX
Setting stream: false means you wait for the entire response before seeing anything. For a chatbot or assistant, that feels sluggish. Streaming sends tokens as they're generated — exactly the typewriter effect you see in ChatGPT. Ollama streams newline-delimited JSON (NDJSON), and C#'s IAsyncEnumerable handles it beautifully.
using System.Text.Json;
async IAsyncEnumerable<string> StreamCompletionAsync(
HttpClient http, string model, string prompt)
{
var payload = new OllamaRequest(model, prompt, Stream: true);
using var req = new HttpRequestMessage(HttpMethod.Post, "/api/generate")
{
Content = JsonContent.Create(payload)
};
using var resp = await http.SendAsync(
req, HttpCompletionOption.ResponseHeadersRead);
resp.EnsureSuccessStatusCode();
using var stream = await resp.Content.ReadAsStreamAsync();
using var reader = new StreamReader(stream);
while (!reader.EndOfStream)
{
var line = await reader.ReadLineAsync();
if (string.IsNullOrWhiteSpace(line)) continue;
var chunk = JsonSerializer.Deserialize<OllamaResponse>(line);
if (chunk is not null)
yield return chunk.Response;
}
}
// Usage: print tokens as they arrive
await foreach (var token in StreamCompletionAsync(http, "llama3.2", "Write a haiku about C#"))
{
Console.Write(token);
}
The key detail is HttpCompletionOption.ResponseHeadersRead. Without it, HttpClient buffers the entire response before returning, defeating the purpose of streaming. With it, you process each token the instant it arrives.
Step 4: Use OllamaSharp for Cleaner C# Code
Hand-rolling HTTP is great for understanding, but for real projects use OllamaSharp — the most popular Ollama C# client on NuGet. It handles streaming, chat history, model management, and embeddings, and it implements Microsoft's IChatClient abstraction.
// dotnet add package OllamaSharp
using OllamaSharp;
var ollama = new OllamaApiClient("http://localhost:11434")
{
SelectedModel = "llama3.2"
};
// Streaming chat with maintained conversation history
var chat = new Chat(ollama);
await foreach (var token in chat.SendAsync("What is async/await in C#?"))
{
Console.Write(token);
}
// Follow-up question — context is remembered automatically
Console.WriteLine();
await foreach (var token in chat.SendAsync("Now give me a code example."))
{
Console.Write(token);
}
The Chat helper automatically tracks message history, so follow-up questions have full context — no manual list management required.
Step 5: Integrate with Microsoft.Extensions.AI
For modern .NET apps, the cleanest approach is Microsoft.Extensions.AI, the official abstraction layer that lets you swap providers (Ollama, Azure OpenAI, OpenAI) without rewriting your code. OllamaSharp implements its IChatClient interface, so you get dependency injection, middleware, and logging for free.
// dotnet add package Microsoft.Extensions.AI
// dotnet add package OllamaSharp
using Microsoft.Extensions.AI;
using OllamaSharp;
IChatClient chatClient = new OllamaApiClient(
new Uri("http://localhost:11434"), "llama3.2");
var response = await chatClient.GetResponseAsync(
"Summarize the SOLID principles in 3 bullet points.");
Console.WriteLine(response.Text);
Because you're coding against IChatClient rather than a concrete Ollama type, moving to a cloud model later is a one-line change. This is the recommended pattern for new C# AI projects in 2026.
Best Practices for Production Local AI in C#
Running a quick demo is easy; running local AI reliably in a real app takes a little more care. Here are the practices that matter most.
Reuse a single HttpClient (or use IHttpClientFactory)
Creating a new HttpClient per request causes socket exhaustion. Register one client via dependency injection and reuse it. With ASP.NET Core, prefer IHttpClientFactory or register OllamaApiClient as a singleton.
Always use cancellation tokens and timeouts
Local inference can take seconds. Pass a CancellationToken through every call so a user navigating away or a request timeout actually stops the work.
using var cts = new CancellationTokenSource(TimeSpan.FromSeconds(60));
await foreach (var token in chat.SendAsync("Long prompt...", cts.Token))
{
Console.Write(token);
}
Keep models warm
The first request after Ollama loads a model into memory is slow (cold start). Send a tiny "warm-up" prompt at app startup, and use Ollama's keep_alive option to keep the model resident between requests.
Pick the right model size
Don't reach for a 70B model when a 3B model will do. Smaller models respond faster, use less RAM, and are often perfectly adequate for classification, summarization, and extraction tasks. Benchmark on your actual workload.
Common Pitfalls When Using Ollama with C#
- Forgetting
ResponseHeadersRead— your "streaming" code silently buffers the whole response. Always set it for streaming endpoints. - Connection refused errors — Ollama isn't running. Confirm the service is up by visiting
http://localhost:11434in a browser; it should say "Ollama is running". - Out-of-memory crashes — the model is too large for your RAM. Drop to a smaller quantized variant (e.g.
llama3.2:1b). - Expecting cloud-level quality — small local models hallucinate more than GPT-4-class models. Use clear prompts, add retrieval (RAG), and validate outputs.
- Blocking the UI thread — always
awaitcalls and stream into the UI asynchronously so your app stays responsive.
Going Further: Embeddings and RAG
Ollama isn't just for chat. It can generate embeddings for semantic search and Retrieval-Augmented Generation (RAG), letting your local LLM answer questions about your documents. The same OllamaApiClient exposes an embeddings endpoint:
var ollama = new OllamaApiClient("http://localhost:11434")
{
SelectedModel = "nomic-embed-text"
};
var embedding = await ollama.EmbedAsync("C# is a modern, type-safe language.");
float[] vector = embedding.Embeddings.First();
Console.WriteLine($"Vector length: {vector.Length}");
Store these vectors in a database like Qdrant, SQLite with sqlite-vec, or even an in-memory list, and you have the foundation of a private, offline RAG system — all running on your own machine for free.
Conclusion: Start Building Local AI in C# Today
You now have everything you need to run LLMs locally with Ollama and C#: from a raw HttpClient call, to streaming with IAsyncEnumerable, to clean integrations via OllamaSharp and Microsoft.Extensions.AI, plus embeddings for RAG. Local AI gives you free, private, offline inference with no rate limits — a genuine alternative to paid cloud APIs for many workloads.
Key takeaways:
- Ollama exposes a simple local REST API at
localhost:11434— calling an LLM from C# is just HTTP. - Use streaming with
HttpCompletionOption.ResponseHeadersReadfor a responsive, real-time UX. - Prefer OllamaSharp + Microsoft.Extensions.AI so you can swap providers without rewriting code.
- Reuse
HttpClient, pass cancellation tokens, keep models warm, and choose the smallest model that does the job. - Add embeddings and RAG to give your local model knowledge of your own data — completely offline and free.
The best next step? Run ollama pull llama3.2, paste the first code sample into a new console app, and watch your own machine generate text. Once you've run an LLM locally in C# for free, you'll wonder why you ever paid per token for prototypes.
Your go-to resource for C#, .NET, and modern software development. Follow along for daily tutorials, tips, and real-world examples.
Comments
Post a Comment