
Learn how to run local AI with Ollama and C#. Set up free local LLMs on your own machine with runnable code examples. Start building offline AI apps today.
Want to build AI features without paying per API call or shipping your users' data to a cloud provider? Running local AI with Ollama and C# lets you host large language models (LLMs) directly on your own machine — completely free, fully offline, and privacy-first. In this tutorial you'll learn how to run an LLM locally in C#, call it from your .NET applications, stream responses, and build production-ready patterns that scale. Whether you're a beginner searching for "how to run LLM locally" or a senior engineer evaluating self-hosted AI for C#, this guide has runnable code and the reasoning behind every decision.
Why Run Local AI with Ollama and C#?
For most of the last few years, adding AI to a .NET app meant calling a hosted API. That works, but it comes with three persistent problems: cost (you pay per token, forever), privacy (your prompts and data leave your network), and latency/availability (you depend on someone else's uptime and rate limits).
Running a local LLM in C# flips all three. Once a model is downloaded, inference is free no matter how many tokens you generate. Data never leaves your machine, which matters enormously for healthcare, finance, legal, and internal tooling. And there are no rate limits or outages to engineer around. The trade-off is that you supply the compute — but modern quantized models run surprisingly well on a laptop with 16GB of RAM.
Ollama is the tool that makes this practical. It's a lightweight runtime that downloads, manages, and serves open-source models (Llama 3, Mistral, Phi, Gemma, Qwen, and more) behind a simple local HTTP API. Because it exposes a REST endpoint, calling it from C# is straightforward — and there's even an OpenAI-compatible layer if you already have code written against that SDK.
What You'll Need
- Ollama installed (download from ollama.com — available for Windows, macOS, and Linux)
- .NET 8 or .NET 9 SDK
- At least 8GB RAM (16GB recommended for 7B+ parameter models)
- A few GB of disk space per model
Step 1: Install Ollama and Pull a Model
After installing Ollama, it runs as a background service listening on http://localhost:11434. Pull your first model from a terminal. A great starting point is Llama 3.2 (3B) — small enough to be fast, smart enough to be useful:
// Run these in your terminal (PowerShell, bash, etc.)
// ollama pull llama3.2
// ollama run llama3.2 "Explain dependency injection in one sentence."
// Verify the server is up:
// curl http://localhost:11434/api/tags
Once the model is downloaded, the Ollama server is ready to accept HTTP requests. That's the entire backend — no API keys, no accounts, no cloud.
Step 2: Call Ollama from C# with HttpClient
The most transparent way to understand how local AI with Ollama and C# works is to call the raw REST API yourself. This has zero third-party dependencies and shows exactly what's on the wire. Here's a complete console app that sends a prompt and prints the response:
using System.Net.Http.Json;
using System.Text.Json.Serialization;
var http = new HttpClient { BaseAddress = new Uri("http://localhost:11434") };
var request = new OllamaRequest
{
Model = "llama3.2",
Prompt = "Write a haiku about C# and local AI.",
Stream = false // get the full answer in one response
};
var response = await http.PostAsJsonAsync("/api/generate", request);
response.EnsureSuccessStatusCode();
var result = await response.Content.ReadFromJsonAsync();
Console.WriteLine(result?.Response);
// Strongly-typed request/response models
public class OllamaRequest
{
[JsonPropertyName("model")] public string Model { get; set; } = "";
[JsonPropertyName("prompt")] public string Prompt { get; set; } = "";
[JsonPropertyName("stream")] public bool Stream { get; set; }
}
public class OllamaResponse
{
[JsonPropertyName("response")] public string Response { get; set; } = "";
[JsonPropertyName("done")] public bool Done { get; set; }
}
Notice Stream = false. By default Ollama streams tokens as a sequence of newline-delimited JSON objects. Setting stream to false tells it to buffer the entire generation and return one JSON object — simpler for a first example, but it means the user waits for the whole answer before seeing anything.
Step 3: Stream Responses for a Real-Time Feel
Users expect ChatGPT-style token-by-token output. Streaming is also better engineering: you start showing results immediately instead of holding the entire response in memory. To stream from a local LLM in C#, read the response body as a stream and parse each line as it arrives:
using System.Text.Json;
var request = new OllamaRequest
{
Model = "llama3.2",
Prompt = "Explain async/await in C# to a beginner.",
Stream = true
};
using var content = new StringContent(
JsonSerializer.Serialize(request),
System.Text.Encoding.UTF8,
"application/json");
using var req = new HttpRequestMessage(HttpMethod.Post, "/api/generate") { Content = content };
using var resp = await http.SendAsync(req, HttpCompletionOption.ResponseHeadersRead);
resp.EnsureSuccessStatusCode();
await using var stream = await resp.Content.ReadAsStreamAsync();
using var reader = new StreamReader(stream);
string? line;
while ((line = await reader.ReadLineAsync()) is not null)
{
if (string.IsNullOrWhiteSpace(line)) continue;
var chunk = JsonSerializer.Deserialize(line);
Console.Write(chunk?.Response); // print each token as it arrives
if (chunk?.Done == true) break;
}
The key is HttpCompletionOption.ResponseHeadersRead. Without it, HttpClient buffers the whole response before handing it to you — defeating the purpose of streaming. With it, you process tokens the moment they're generated.
Step 4: Use OllamaSharp for Cleaner Code
Hand-rolling HTTP is great for learning, but for real projects the community library OllamaSharp handles streaming, chat history, and model management for you. Install it with dotnet add package OllamaSharp. Here's an interactive chat loop that maintains conversation context:
using OllamaSharp;
var ollama = new OllamaApiClient("http://localhost:11434")
{
SelectedModel = "llama3.2"
};
var chat = new Chat(ollama);
Console.WriteLine("Chat with your local LLM (type 'exit' to quit):");
while (true)
{
Console.Write("\nYou: ");
var input = Console.ReadLine();
if (string.IsNullOrWhiteSpace(input) ||
input.Equals("exit", StringComparison.OrdinalIgnoreCase))
break;
Console.Write("AI: ");
await foreach (var token in chat.SendAsync(input))
Console.Write(token); // streams tokens AND remembers history
Console.WriteLine();
}
The Chat class automatically tracks the message history, so the model has context from earlier turns — exactly what you'd build manually with the raw API by maintaining a list of messages and posting to /api/chat.
Step 5: Inject Ollama into ASP.NET Core
For web apps and APIs, register the client with dependency injection and use IHttpClientFactory so you don't leak sockets. This is the idiomatic pattern for self-hosted AI in a C# backend:
// Program.cs
builder.Services.AddSingleton(sp =>
new OllamaApiClient("http://localhost:11434") { SelectedModel = "llama3.2" });
// A minimal API endpoint that streams to the browser
app.MapPost("/chat", async (ChatRequest body, OllamaApiClient ollama, HttpContext ctx) =>
{
ctx.Response.ContentType = "text/plain";
var chat = new Chat(ollama);
await foreach (var token in chat.SendAsync(body.Message))
{
await ctx.Response.WriteAsync(token);
await ctx.Response.Body.FlushAsync(); // push each token to the client
}
});
public record ChatRequest(string Message);
Best Practices for Local LLMs in C#
- Reuse a single HttpClient (or use IHttpClientFactory). Creating a new
HttpClientper request exhausts sockets. Register it once as a singleton. - Always pass a CancellationToken. LLM generation can run for many seconds. Wire cancellation through so a user closing the browser or a request timeout actually stops the work.
- Pick the right model size. Smaller models (1B–3B) are fast and fine for classification, extraction, and simple chat. Reach for 7B–8B models when you need stronger reasoning. Match the model to the hardware.
- Use quantized models. Tags like
llama3.2:3b-instruct-q4_K_Muse 4-bit quantization to cut RAM usage dramatically with minimal quality loss — essential for laptops. - Set a system prompt. Use the
/api/chatendpoint with asystemmessage to control tone, format, and guardrails instead of stuffing instructions into every user prompt. - Control determinism with options. Pass
temperature(lower = more deterministic) andnum_ctx(context window) in the request'soptionsobject to tune behavior.
Common Pitfalls to Avoid
- Forgetting to pull the model first. If the model name isn't downloaded, Ollama returns a 404. Run
ollama pull <model>before your app starts, or call/api/pullprogrammatically. - Buffering when you meant to stream. Omitting
HttpCompletionOption.ResponseHeadersReadsilently disables real streaming even thoughstream:trueis set. - Blocking the cold start. The first request after Ollama loads a model into memory is slow (model load time). Warm it up at startup with a tiny prompt so your first real user isn't penalized.
- Ignoring memory limits. Running a 70B model on 16GB RAM will swap to disk and crawl. Check the model's RAM requirement before pulling.
- Assuming the same output every time. LLMs are non-deterministic by default. For testing, set
temperatureto 0 and a fixedseed.
Bonus: OpenAI-Compatible Endpoint
If you already have C# code written against the OpenAI SDK, Ollama exposes a drop-in compatible endpoint at http://localhost:11434/v1. Point your existing client at it, use any string as the API key, and set the model name — your code keeps working while inference runs locally and free.
// Works with the OpenAI .NET SDK, pointed at Ollama
using OpenAI;
using OpenAI.Chat;
var client = new ChatClient(
model: "llama3.2",
credential: new System.ClientModel.ApiKeyCredential("ollama"), // any value
options: new OpenAIClientOptions { Endpoint = new Uri("http://localhost:11434/v1") });
ChatCompletion completion = await client.CompleteChatAsync("Summarize REST in one line.");
Console.WriteLine(completion.Content[0].Text);
Conclusion: Key Takeaways
Building local AI with Ollama and C# gives you free, private, offline LLM inference that fits naturally into the .NET ecosystem. You've seen how to run an LLM locally in C# three ways — raw HttpClient, the OllamaSharp library, and the OpenAI-compatible endpoint — plus how to stream tokens, inject the client into ASP.NET Core, and avoid the most common mistakes.
Here are the points worth remembering:
- Free and private: once a model is pulled, inference costs nothing and your data never leaves your machine.
- Streaming matters: use
ResponseHeadersReadandawait foreachfor a responsive, real-time experience. - Use the right tool: raw HTTP to learn, OllamaSharp for productivity, the
/v1endpoint to reuse OpenAI code. - Right-size your model: quantized 3B–8B models give the best balance of speed and quality on typical hardware.
- Engineer for production: reuse clients, pass cancellation tokens, warm up cold starts, and set system prompts.
Start small — pull llama3.2, run the console example above, and you'll have a working local LLM in C# in under ten minutes. From there you can layer in retrieval-augmented generation (RAG), structured JSON output, and tool calling to build full AI features that run entirely on your own infrastructure, for free.
Your go-to resource for C#, .NET, and modern software development. Follow along for daily tutorials, tips, and real-world examples.
Comments
Post a Comment