Getting Started with Microsoft.Extensions.AI — Part 3: Pipelines, Embeddings & Search

Welcome back to our journey with Microsoft.Extensions.AI! In the previous parts, we got a solid grasp of basic chat completions and handling structured outputs. Now, it's time to elevate our game by exploring two advanced, yet incredibly practical, features: building robust request pipelines using a middleware pattern, and leveraging embeddings for powerful semantic search.

These capabilities are crucial for building production-ready AI applications, allowing you to inject cross-cutting concerns into your chat interactions and unlock the true potential of vector-based understanding.

Chat Client Pipelines: Middleware for AI Requests

If you've ever worked with ASP.NET Core, you're familiar with the middleware pattern. It's a fantastic way to inject cross-cutting concerns like logging, authentication, or error handling into your request pipeline. Microsoft.Extensions.AI brings a very similar concept to IChatClient interactions through its AsBuilder() extension.

This allows you to wrap your core IChatClient implementation with custom behaviors, creating a chain of responsibility that processes your chat messages before they hit the LLM, and potentially after the response comes back.

Let's look at how we build an IChatClient with a pipeline:

private static IChatClient GetChatClientOpenAiImp(string ApiKey, string ModelId)
{
    OpenAIClient openAIClient = new OpenAIClient(ApiKey);

    return new OpenAIChatClient(openAIClient, ModelId)
        .AsBuilder()
        .UseFunctionInvocation()
        .UserLanguage("spanish")
        .UseRateLimitThreading(TimeSpan.FromSeconds(5))
        .Build();
}

Here, we start with a standard OpenAIChatClient, then call .AsBuilder(). This transforms our client into a ChatClientBuilder, which exposes Use() methods. Notice we're adding three distinct pieces of middleware:

.UseFunctionInvocation(): Enables the client to call functions (a topic for a future post!).
.UserLanguage("spanish"): A custom middleware we'll look at that augments the prompt.
.UseRateLimitThreading(TimeSpan.FromSeconds(5)): Another custom middleware to throttle requests.

Finally, .Build() gives us our IChatClient instance, complete with all these behaviors woven in.

Custom Middleware: Augmenting Prompts with Language

Let's dissect the UserLanguage middleware. This is a great example of how you can dynamically modify the input to the LLM.

public static class UseLanguageStep
{
    public static ChatClientBuilder UserLanguage(this ChatClientBuilder chatClientBuilder, string language)
    {
        chatClientBuilder.Use(inner => new UseLanguageClient(inner, language));
        return chatClientBuilder;
    }
    private class UseLanguageClient(IChatClient chatClient, string language) : DelegatingChatClient(chatClient)
    {
        public override async Task<ChatCompletion> CompleteAsync(IList<ChatMessage> chatMessages, ChatOptions options = null, CancellationToken cancellationToken = default)
        {
            ChatMessage promptAugmentation = new ChatMessage(ChatRole.System, quot;User language is {language}");
            chatMessages.Add(promptAugmentation);
            try
            {
                return await base.CompleteAsync(chatMessages, options, cancellationToken);
            }
            finally
            {
                chatMessages.Remove(promptAugmentation);
            }
        }
    }
}

The UserLanguage extension method simply calls .Use() on the builder, passing in a lambda that creates our UseLanguageClient. The UseLanguageClient inherits from DelegatingChatClient, which is key. This base class automatically forwards calls to the inner IChatClient it wraps, allowing us to override specific methods, like CompleteAsync, to inject our logic.

Inside CompleteAsync, we create a System message indicating the user's language and add it to the chatMessages collection before calling base.CompleteAsync. This means the language instruction will be part of the prompt sent to the LLM. The finally block ensures we clean up by removing our augmentation message, so it doesn't persist unnecessarily in the chatMessages list for subsequent calls if not desired. This pattern is incredibly flexible for injecting contextual information.

Custom Middleware: Rate Limiting

Another common requirement for interacting with external AI APIs is rate limiting. We don't want to accidentally hammer an API and get throttled or incur unexpected costs.

public static class UseRateLimitMiddleware
{
    public static ChatClientBuilder UseRateLimitThreading(this ChatClientBuilder chatClientBuilder, TimeSpan window)
    {
        chatClientBuilder.Use(inner => new UseRateLimitClientWindow(inner, window));
        return chatClientBuilder;
    }

    private class UseRateLimitClientWindow : DelegatingChatClient
    {
        RateLimiter rateLimiter;
        public UseRateLimitClientWindow(IChatClient innerClient, TimeSpan window) : base(innerClient)
        {
            FixedWindowRateLimiterOptions options = new FixedWindowRateLimiterOptions { Window = window, QueueLimit = 1, PermitLimit = 1 };
            rateLimiter = new FixedWindowRateLimiter(options);
        }
        public async override Task<ChatCompletion> CompleteAsync(IList<ChatMessage> chatMessages, ChatOptions options = null, CancellationToken cancellationToken = default)
        {
            var Leas = rateLimiter.AttemptAcquire();
            if (!Leas.IsAcquired)
            {
                return new ChatCompletion(new ChatMessage(ChatRole.Assistant, "Rate limit exceeded"));
            }
            return await base.CompleteAsync(chatMessages, options, cancellationToken);
        }
    }
}

Similar to the language middleware, UseRateLimitThreading creates a UseRateLimitClientWindow. This client uses System.Threading.RateLimiting.FixedWindowRateLimiter to control the flow. Before calling the inner client, it AttemptAcquire() a lease. If no lease is available (meaning we've hit our rate limit), it returns an immediate "Rate limit exceeded" message without even touching the external API. Otherwise, it proceeds to call base.CompleteAsync. This ensures our application respects API limits gracefully.

Embeddings, Tensors, and Semantic Search

Now, let's switch gears to another powerful concept: embeddings. An embedding is a numerical representation (a vector of floating-point numbers) of a piece of text (or image, or audio, etc.) that captures its semantic meaning. Texts with similar meanings will have embeddings that are "close" to each other in a multi-dimensional space. This is the foundation of many advanced AI features like semantic search, recommendation systems, and Retrieval Augmented Generation (RAG).

Microsoft.Extensions.AI provides the IEmbeddingGenerator interface for this purpose.

Generating Embeddings

To generate embeddings, we first need an IEmbeddingGenerator implementation. For local development and experimentation, Ollama is fantastic.

IEmbeddingGenerator<string, Embedding<float>> embeddingGenerator = new OllamaEmbeddingGenerator(new Uri("http://127.0.0.1:11434"), modelId: "all-minilm:latest");

var text = "Artificial Intelligence (AI) refers to the development of computer systems capable of performing tasks " +
    "that typically require human intelligence. " +
    "These tasks include reasoning, learning, problem-solving, perception, and language understanding.";

Console.WriteLine(quot;Original Text:{text}");

Embedding<float> result = await embeddingGenerator.GenerateEmbeddingAsync(text);

var VectorData= result.Vector.Span.ToArray();
foreach (float item in VectorData)
{
    Console.Write(item+" ");
}
Console.ForegroundColor = ConsoleColor.Green;
Console.WriteLine(quot;vector length:{result.Vector.Length}");

Here, we instantiate OllamaEmbeddingGenerator, pointing it to our local Ollama server and specifying the all-minilm:latest model, a good general-purpose embedding model. We then call GenerateEmbeddingAsync with our input text. The result contains an Embedding<float> which exposes a Vector property, a ReadOnlyMemory<float> containing the numerical representation of our text. We can inspect its length and values.

We can also generate embeddings for multiple strings efficiently:

var strings = new List<string>();
var TacosAlPastor = "Tacos al pastor are thin slices of pork marinated with spices and pineapple or orange juice." +
    " The meat is stacked on a vertical spit," +
    " cooked, and then thinly cut. It is served in corn tortillas with pineapple," +
    " green salsa, and lime wedges";

var MachineLearning = "Machine learning is a subset of artificial intelligence that focuses on the development of algorithms and statistical models that computer systems use to perform specific tasks without explicit instructions. " +
    "These tasks include pattern recognition, data analysis, and decision-making.";

strings.Add(text); // Our original AI definition
strings.Add(TacosAlPastor);
strings.Add(MachineLearning);

(string Value, Embedding<float> Embedding)[] ZipEmbeddings = await embeddingGenerator.GenerateAndZipAsync(strings);

The GenerateAndZipAsync method is super handy; it generates embeddings for a collection of strings and returns an array of tuples, pairing each original string with its generated embedding. This is perfect for building a small in-memory semantic search index.

Semantic Search with Cosine Similarity

Once we have embeddings, how do we find similar texts? We compare their vectors! A common metric for this is cosine similarity, which measures the cosine of the angle between two vectors. A value close to 1 indicates high similarity, 0 indicates no similarity, and -1 indicates complete dissimilarity (though with text embeddings, you typically see values between 0 and 1).

.NET's System.Numerics.Tensors.TensorPrimitives namespace provides optimized methods for tensor operations, including CosineSimilarity.

var parameterValue = "What is A.I?";

Embedding<float> InputEmbedding = await embeddingGenerator.GenerateEmbeddingAsync(parameterValue);

var Closest = from candidate in ZipEmbeddings
              let similarity = TensorPrimitives.CosineSimilarity(candidate.Embedding.Vector.Span, InputEmbedding.Vector.Span)
              orderby similarity descending
              select new { Text = candidate.Value, Similarity = similarity};

foreach (var item in Closest)
{
    Console.WriteLine(quot;Similarity:{item.Similarity} Text:{item.Text}");
    Console.WriteLine();
}

Here's the magic!

We generate an embedding for our query, parameterValue ("What is A.I?").
We then use a LINQ query to iterate through our ZipEmbeddings (our "documents").
For each candidate document, we calculate the TensorPrimitives.CosineSimilarity between its embedding vector and our query's embedding vector. Notice the use of .Span for efficient memory access.
Finally, we order the results by similarity in descending order and print them.

The output will clearly show that "Artificial Intelligence (AI) refers to the development..." and "Machine learning is a subset of artificial intelligence..." are much closer to "What is A.I?" than the "Tacos al pastor" description, even though the words themselves might not directly match. This is the power of semantic understanding!

Wrapping Up

Today, we explored two incredibly powerful features of Microsoft.Extensions.AI:

Pipelines (Middleware): How AsBuilder() and DelegatingChatClient allow you to inject custom logic and cross-cutting concerns into your IChatClient interactions, similar to ASP.NET Core middleware. We saw examples for prompt augmentation and rate limiting.
Embeddings & Semantic Search: We learned how IEmbeddingGenerator creates numerical representations of text, and how System.Numerics.Tensors.TensorPrimitives.CosineSimilarity helps us find semantically similar documents, forming the backbone of RAG and other intelligent systems.

These features enable you to build much more sophisticated and robust AI applications. In our next part, we'll dive into the world of Tools and Functions, where our AI models can interact with external systems and perform actions. Stay tuned!