Getting Started with Semantic Kernel — Part 3: Running Local Models (LM Studio / Ollama)

Hey there, fellow .NET developers! Joche here, back with the final installment of our "Getting Started with Semantic Kernel" series. So far, we've explored building kernels and working with chat history and functions, always pointing to a cloud-based OpenAI model. But what if you want to experiment with LLMs without incurring cloud costs? Or perhaps you need to ensure data privacy by keeping everything local?

Today, I'm going to show you a neat trick to point your existing Semantic Kernel code at a local LLM like LM Studio or Ollama. The best part? Your core kernel logic remains virtually unchanged. This is the power of Semantic Kernel's excellent abstraction over the underlying AI service.

The Magic: CustomHttpMessageHandler

The secret sauce lies in a custom HttpClientHandler. Both LM Studio and Ollama, when running locally, expose an API endpoint that's compatible with OpenAI's API. This means we can trick Semantic Kernel into thinking it's talking to OpenAI, while in reality, it's sending requests to our local server.

Let's look at the CustomHttpMessageHandler first:

public class CustomHttpMessageHandler : HttpClientHandler
{
    string LocalModelUrl;
    public CustomHttpMessageHandler(string LocalModelUrl)
    {
        this.LocalModelUrl = LocalModelUrl;
    }
    protected override async Task<HttpResponseMessage> SendAsync(
        HttpRequestMessage request,
        CancellationToken cancellationToken)
    {

        request.RequestUri = new Uri(this.LocalModelUrl);
        // Custom logic before the request
        Console.WriteLine(quot;Sending request to: {request.RequestUri}");
        // Call base handler
        var response = await base.SendAsync(request, cancellationToken);

        // Custom logic after the request
        Console.WriteLine(quot;Response status: {response.StatusCode}");

        return response;
    }
}

What's happening here? This CustomHttpMessageHandler intercepts every HTTP request made by the HttpClient it's attached to. Inside the SendAsync method, which is where the magic happens, we completely rewrite the RequestUri to point to our local model's endpoint. For example, http://localhost:1234/v1/chat/completions is a common endpoint for LM Studio and Ollama (though the port might vary depending on your setup).

I've also added some Console.WriteLine statements to help you see where the requests are going and what status codes you're getting back. This is super handy for debugging when you're first setting things up.

Integrating with Semantic Kernel

Now, how do we plug this custom handler into our Semantic Kernel setup? It's surprisingly straightforward. Remember when we used AddOpenAIChatCompletion? That method has an overload that accepts an HttpClient. This is our entry point.

Here's how we set it up in Program.cs:

// Create custom HTTP handler
using BuildingKernels;
using Microsoft.SemanticKernel;

//this can redirect to any local model that uses the same API as Azure OpenAI or OpenAI
//you can use L.M studio or Ollama
var handler = new CustomHttpMessageHandler("http://localhost:1234/v1/chat/completions");
var httpClient = new HttpClient(handler);

string modelId = "this is going to be ignored";

// Create kernel builder with custom client
var kernel = Kernel.CreateBuilder()
    .AddOpenAIChatCompletion(modelId, "ApiKey", "OrgId", "ServiceId", httpClient)

    .Build();


var prompt = "Write a hello world program in C#";
var result = await kernel.InvokePromptAsync(prompt);

Console.WriteLine(result);
Console.ReadKey();

Notice the key line here: new HttpClient(handler). We create an HttpClient instance, providing our CustomHttpMessageHandler to it. Then, we pass this httpClient directly into AddOpenAIChatCompletion.

You might also notice that the modelId, ApiKey, OrgId, and ServiceId parameters are mostly placeholders. Because our CustomHttpMessageHandler is completely rewriting the request URI, the local server (LM Studio or Ollama) will be the one deciding which model to use based on its own configuration. So, you can put any dummy values there for the modelId and other credentials; they will be ignored by the local model server.

With this setup, when you invoke a prompt, Semantic Kernel will build its request as usual, but our CustomHttpMessageHandler will redirect it to your local LLM. Your kernel code, whether it's building simple prompts, using chat history, or invoking functions, doesn't need to know the difference.

Wrapping Up

This approach really highlights the power of abstraction in Semantic Kernel. By simply providing a different HttpClient, we can seamlessly switch our backend from a cloud service like OpenAI to a local LLM running on our machine, all without changing our core application logic. This opens up a ton of possibilities for local development, privacy-focused applications, and cost-effective experimentation.

I encourage you to try this out! Set up LM Studio or Ollama, load a model, and then run this code. You'll see your Semantic Kernel application interacting with your local model, giving you full control and zero cloud bills. Happy coding!