Structured RAG for Unknown and Mixed Languages

How I stopped my multilingual activity stream from turning RAG into chaos

In the previous article (RAG with PostgreSQL and C# (pros and cons) | Joche Ojeda) I explained how naïve RAG breaks when you run it over an activity stream.

Same UI language.
Totally unpredictable content language.
Spanish, Russian, Italian… sometimes all in the same message.

Humans handle that fine.
Vector retrieval… not so much.

This is the “silent failure” scenario: retrieval looks plausible, the LLM sounds confident, and you ship nonsense.

So I had to change the game.

The Idea: Structured RAG

Structured RAG means you don’t embed raw text and pray.

You add a step before retrieval:

Extract a structured representation from each activity record
Store it as metadata (JSON)
Use that metadata to filter, route, and rank
Then do vector similarity on a cleaner, more stable representation

Think of it like this:

Unstructured text is what users write.
Structured metadata is what your RAG system can trust.

Why This Fix Works for Mixed Languages

The core problem with activity streams is not “language”.

The core problem is: you have no stable shape.

When the shape is missing, everything becomes fuzzy:

Who is speaking?
What is this about?
Which entities are involved?
Is this a reply, a reaction, a mention, a task update?
What language(s) are in here?

Structured RAG forces you to answer those questions once, at write-time, and save the answers.

PostgreSQL: Add a JSONB Column (and Keep pgvector)

We keep the previous approach (pgvector) but we add a JSONB column for structured metadata.

ALTER TABLE activities
ADD COLUMN rag_meta jsonb NOT NULL DEFAULT '{}'::jsonb;

-- Optional: if you store embeddings per activity/chunk
-- you keep your existing embedding column(s) or chunk table.

Then index it.

CREATE INDEX activities_rag_meta_gin
ON activities
USING gin (rag_meta);

Now you can filter with JSON queries before you ever touch vector similarity.

A Proposed Schema (JSON Shape You Control)

The exact schema depends on your product, but for activity streams I want at least:

language: detected languages + confidence
actors: who did it
subjects: what object is involved (ticket, order, user, document)
topics: normalized tags
relationships: reply-to, mentions, references
summary: short canonical summary (ideally in one pivot language)
signals: sentiment/intent/type if you need it

Example JSON for one activity record:

{
  "schemaVersion": 1,
  "languages": [
    { "code": "es", "confidence": 0.92 },
    { "code": "ru", "confidence": 0.41 }
  ],
  "actor": {
    "id": "user:42",
    "displayName": "Joche"
  },
  "subjects": [
    { "type": "ticket", "id": "ticket:9831" }
  ],
  "topics": ["billing", "invoice", "error"],
  "relationships": {
    "replyTo": "activity:9912001",
    "mentions": ["user:7", "user:13"]
  },
  "intent": "support_request",
  "summary": {
    "pivotLanguage": "en",
    "text": "User reports an invoice calculation error and asks for help."
  }
}

Notice what happened here: the raw multilingual chaos got converted into a stable structure.

Write-Time Pipeline (The Part That Feels Expensive, But Saves You)

Structured RAG shifts work to ingestion time.

Yes, it costs tokens.
Yes, it adds steps.

But it gives you something you never had before: predictable retrieval.

Here’s the pipeline I recommend:

Store raw activity (as-is, don’t lose the original)
Detect language(s) (fast heuristic + LLM confirmation if needed)
Extract structured metadata into your JSON schema
Generate a canonical “summary” in a pivot language (often English)
Embed the summary + key fields (not the raw messy text)
Save JSON + embedding

The key decision: embed the stable representation, not the raw stream text.

C# Conceptual Implementation

I’m going to keep the code focused on the architecture. Provider details are swappable.

Entities

public sealed class Activity
{
    public long Id { get; set; }
    public string RawText { get; set; } = "";
    public string UiLanguage { get; set; } = "en";

    // JSONB column in Postgres
    public string RagMetaJson { get; set; } = "{}";

    // Vector (pgvector) - store via your pgvector mapping or raw SQL
    public float[] RagEmbedding { get; set; } = Array.Empty<float>();

    public DateTimeOffset CreatedAt { get; set; }
}

Metadata Contract (Strongly Typed in Code, Stored as JSONB)

public sealed class RagMeta
{
    public int SchemaVersion { get; set; } = 1;
    public List<DetectedLanguage> Languages { get; set; } = new();
    public ActorMeta Actor { get; set; } = new();
    public List<SubjectMeta> Subjects { get; set; } = new();
    public List<string> Topics { get; set; } = new();
    public RelationshipMeta Relationships { get; set; } = new();
    public string Intent { get; set; } = "unknown";
    public SummaryMeta Summary { get; set; } = new();
}

public sealed class DetectedLanguage
{
    public string Code { get; set; } = "und";
    public double Confidence { get; set; }
}

public sealed class ActorMeta
{
    public string Id { get; set; } = "";
    public string DisplayName { get; set; } = "";
}

public sealed class SubjectMeta
{
    public string Type { get; set; } = "";
    public string Id { get; set; } = "";
}

public sealed class RelationshipMeta
{
    public string? ReplyTo { get; set; }
    public List<string> Mentions { get; set; } = new();
}

public sealed class SummaryMeta
{
    public string PivotLanguage { get; set; } = "en";
    public string Text { get; set; } = "";
}

Extractor + Embeddings

You need two services:

Metadata extraction (LLM fills the schema)
Embeddings (Microsoft.Extensions.AI) for the stable text

public interface IRagMetaExtractor
{
    Task<RagMeta> ExtractAsync(Activity activity, CancellationToken ct);
}

Then the ingestion pipeline:

using System.Text.Json;
using Microsoft.Extensions.AI;

public sealed class StructuredRagIngestor
{
    private readonly IRagMetaExtractor _extractor;
    private readonly IEmbeddingGenerator<string, Embedding<float>> _embeddings;

    public StructuredRagIngestor(
        IRagMetaExtractor extractor,
        IEmbeddingGenerator<string, Embedding<float>> embeddings)
    {
        _extractor = extractor;
        _embeddings = embeddings;
    }

    public async Task ProcessAsync(Activity activity, CancellationToken ct)
    {
        // 1) Extract structured JSON
        RagMeta meta = await _extractor.ExtractAsync(activity, ct);

        // 2) Create stable text for embeddings (summary + keywords)
        string stableText =
            $"{meta.Summary.Text}\n" +
            $"Topics: {string.Join(", ", meta.Topics)}\n" +
            $"Intent: {meta.Intent}";

        // 3) Embed stable text
        var emb = await _embeddings.GenerateAsync(new[] { stableText }, ct);
        float[] vector = emb.First().Vector.ToArray();

        // 4) Save into activity record
        activity.RagMetaJson = JsonSerializer.Serialize(meta);
        activity.RagEmbedding = vector;

        // db.SaveChangesAsync(ct) happens outside (unit of work)
    }
}

This is the core move: you stop embedding chaos and start embedding structure.

Query Pipeline: JSON First, Vectors Second

When querying, you don’t jump into similarity search immediately.

You do:

Parse the user question
Decide filters (actor, subject type, topic)
Filter with JSONB (fast narrowing)
Then do vector similarity on the remaining set

Example: filter by topic + intent using JSONB:

SELECT id, raw_text
FROM activities
WHERE rag_meta @> '{"intent":"support_request"}'::jsonb
  AND rag_meta->'topics' ? 'invoice'
ORDER BY rag_embedding <=> @query_embedding
LIMIT 20;

That “JSON first” step is what keeps multilingual streams from poisoning your retrieval.

Tradeoffs (Because Nothing Is Free)

Structured RAG costs more at write-time:

more tokens
more latency
more moving parts

But it saves you at query-time:

less noise
better precision
more predictable answers
debuggable failures (because you can inspect metadata)

In real systems, I’ll take predictable and debuggable over “cheap but random” every day.

Final Thought

RAG over activity streams is hard because activity streams are messy by design.

If you want RAG to behave, you need structure.

Structured RAG is how you make retrieval boring again.
And boring retrieval is exactly what you want.

In the next article, I’ll go deeper into the exact pipeline details: language routing, mixed-language detection, pivot summaries, chunk policies, and how I made this production-friendly without turning it into a token-burning machine.

Let the year begin 🚀

“`