Getting Started with Microsoft.Extensions.AI — Part 2: Messages & Strongly-Typed Output

Welcome back to "Getting Started with Microsoft.Extensions.AI"! In Part 1, we got our feet wet with basic chat completions. Today, we're diving deeper into building more sophisticated requests and, even better, getting structured, strongly-typed data back from our AI models. This is where the magic really starts to happen for building robust applications.

Building Conversational History with `ChatMessage`

When you're interacting with an AI model, especially in a chat scenario, context is everything. The Microsoft.Extensions.AI library helps us manage this with the ChatMessage class. Each ChatMessage has a ChatRole (like User or Assistant) and content.

Here's how you can send a simple prompt:

List<ChatMessage> Prompt = new List<ChatMessage>()
{
    new ChatMessage(ChatRole.User, "Describe what is C# in 100 words"),

};

ChatCompletion Result = await CurrentClient.CompleteAsync(Prompt);

But a single prompt isn't a conversation, is it? To maintain context and build a chat history, you simply add more ChatMessage objects to your list, alternating roles. The model then uses this entire list to understand the ongoing discussion.

List<ChatMessage> chatMessages = new List<ChatMessage>()
{
    new ChatMessage(ChatRole.User, "What is the capital of El Salvador?"),
    new ChatMessage(ChatRole.Assistant, "The capital of El Salvador is San Salvador."),
    new ChatMessage(ChatRole.User, "What is the capital of Dominican Republic?"),
    new ChatMessage(ChatRole.Assistant, "The capital of El Dominican Republic is Santo Domingo.")
};

chatMessages.Add(new ChatMessage(ChatRole.User, "Which countries have we mention on this conversation"));

Result = await CurrentClient.CompleteAsync(chatMessages);

Notice how we add the user's new question to the existing list. When CompleteAsync is called, the entire chatMessages list is sent, giving the model the full conversation history to generate a relevant response.

Multimodal Magic: Attaching Images to Messages

Modern large language models aren't just about text anymore; they're multimodal! This means they can process and understand information from various sources, including images. With Microsoft.Extensions.AI, attaching images to your messages is straightforward.

Each ChatMessage has a Contents property, which is a collection where you can add different types of content. For images, we use ImageContent. You just need the image's byte array and its MIME type.

Let's see an example from 3-ChatMessage/Program.cs where we describe a puppy:

ChatMessage Message = new ChatMessage(ChatRole.User, "Describe what is in the picture in 500 or less characters");
Console.WriteLine(Message.ToString()+Environment.NewLine);

ReadOnlyMemory<byte> Image = File.ReadAllBytes("puppy.jpg");
Message.Contents.Add(new ImageContent(Image, "image/jpg"));

Result = await CurrentClient.CompleteAsync(new List<ChatMessage>() { Message });

Here, we read puppy.jpg into a byte array and add it to the Message.Contents list. The model then processes both the text prompt and the image to generate its description. This capability opens up a whole new world of applications, from image analysis to visual Q&A.

Turning LLMs into Typed Functions: Strongly-Typed Output

This is, for me, one of the most powerful features. Getting raw text back from an LLM is fine for casual chat, but in an application, you often need structured data. Parsing strings is fragile and error-prone. This is where CompleteAsync<T>() shines.

The Microsoft.Extensions.AI library lets you define a C# class, and the AI model will attempt to return its response formatted as an instance of that class.

Consider this CatCollectionDescription class from 4-StructureOutput/CatCollectionDescription.cs:

public class CatCollectionDescription
{
    [JsonPropertyName("numberOfBlackCats")]
    [Description("Number of black cats")]
    public int NumberOfBlackCats { get; set; }
    [JsonPropertyName("numberOfWhiteCats")]
    [Description("Number of white cats")]
    public int NumberOfWhiteCats { get; set; }
    [JsonPropertyName("numberOfOtherAnimals")]
    [Description("Number of other animals that are NOT cats")]
    public int NumberOfOtherAnimals { get; set; }
    [JsonPropertyName("numberOfNotAnimals")]
    [Description("Number of other objects that are not animals")]
    public int NumberOfNotAnimals { get; set; }
    public CatCollectionDescription()
    {

    }
}

Notice the [JsonPropertyName] attributes. These guide the model on how to name the JSON properties in its output. The [Description] attributes are also incredibly helpful; they provide a clear human-readable explanation of each property, which the model uses to understand your intent and populate the properties accurately.

Now, let's combine multimodal input with strongly-typed output. In 4-StructureOutput/Program.cs, we send multiple images (cats, puppies, robots) and ask the model to analyze them, returning the results as a CatCollectionDescription object:

var Message = new ChatMessage(ChatRole.User,
    "Analyze this images to count the number of black cats, " +
    "white cats, other animals " +
    "and objects that are NOT animals");

//read the bytes of the image Cats.jpg
byte[] catsBytes = File.ReadAllBytes("Cats.jpg");

//read the bytes of the image Puppies.jpg
byte[] PuppiesBytes = File.ReadAllBytes("Puppies.jpg");

//read the bytes of the image Robots.jpg
byte[] RobotsBytes = File.ReadAllBytes("Robots.jpg");


Message.Contents.Add(new ImageContent(catsBytes, "image/jpg"));
Message.Contents.Add(new ImageContent(PuppiesBytes, "image/jpg"));
Message.Contents.Add(new ImageContent(RobotsBytes, "image/jpg"));

ChatCompletion<CatCollectionDescription> Answer = await CurrentClient.CompleteAsync<CatCollectionDescription>(new List<ChatMessage>() { Message });

After the model processes the request, the Answer object is a ChatCompletion<CatCollectionDescription>. This means its Result property is no longer a raw string but a fully hydrated CatCollectionDescription instance!

You can then access the data using familiar C# properties:

Console.WriteLine(quot;Number of black cats: {Answer.Result.NumberOfBlackCats}");
Console.WriteLine(quot;Number of white cats: {Answer.Result.NumberOfWhiteCats}");
Console.WriteLine(quot;Number of other animals: {Answer.Result.NumberOfOtherAnimals}");
Console.WriteLine(quot;Number of other objects that are NOT animals: {Answer.Result.NumberOfNotAnimals}");

This transforms the large language model from a text generator into a highly flexible, typed "function" that you can call from your C# code, significantly reducing complexity and increasing the reliability of your AI-powered applications. It's like having a smart API endpoint that you define dynamically with a C# class!

What's Next?

We've covered building rich, multimodal messages and getting structured data back. These are fundamental building blocks for more complex AI interactions. Next up, in Part 3, we'll explore how to manage multiple completions and apply configurations like function invocation and rate limiting using pipelines.

You can find all the code for these examples in the egarim/IntroductionToMsAiExtensions repository. Go ahead, clone it, and try it out yourself!