Understanding the N+1 Database Problem using Entity Framework Core

Understanding the N+1 Database Problem using Entity Framework Core

What is the N+1 Problem?

Imagine you’re running a blog website and want to display a list of all blogs along with how many posts each one has. The N+1 problem is a common database performance issue that happens when your application makes way too many database trips to get this simple information.

Our Test Database Setup

Our test suite creates a realistic blog scenario with:

  • 3 different blogs
  • Multiple posts for each blog
  • Comments on posts
  • Tags associated with blogs

This mirrors real-world applications where data is interconnected and needs to be loaded efficiently.

Test Case 1: The Classic N+1 Problem (Lazy Loading)

What it does: This test demonstrates how “lazy loading” can accidentally create the N+1 problem. Lazy loading sounds helpful – it automatically fetches related data when you need it. But this convenience comes with a hidden cost.

The Code:

[Test]
public void Test_N_Plus_One_Problem_With_Lazy_Loading()
{
    var blogs = _context.Blogs.ToList(); // Query 1: Load blogs
    
    foreach (var blog in blogs)
    {
        var postCount = blog.Posts.Count; // Each access triggers a query!
        TestLogger.WriteLine($"Blog: {blog.Title} - Posts: {postCount}");
    }
}

The SQL Queries Generated:

-- Query 1: Load all blogs
SELECT "b"."Id", "b"."CreatedDate", "b"."Description", "b"."Title"
FROM "Blogs" AS "b"

-- Query 2: Load posts for Blog 1 (triggered by lazy loading)
SELECT "p"."Id", "p"."BlogId", "p"."Content", "p"."PublishedDate", "p"."Title"
FROM "Posts" AS "p"
WHERE "p"."BlogId" = 1

-- Query 3: Load posts for Blog 2 (triggered by lazy loading)
SELECT "p"."Id", "p"."BlogId", "p"."Content", "p"."PublishedDate", "p"."Title"
FROM "Posts" AS "p"
WHERE "p"."BlogId" = 2

-- Query 4: Load posts for Blog 3 (triggered by lazy loading)
SELECT "p"."Id", "p"."BlogId", "p"."Content", "p"."PublishedDate", "p"."Title"
FROM "Posts" AS "p"
WHERE "p"."BlogId" = 3

The Problem: 4 total queries (1 + 3) – Each time you access blog.Posts.Count, lazy loading triggers a separate database trip.

Test Case 2: Alternative N+1 Demonstration

What it does: This test manually recreates the N+1 pattern to show exactly what’s happening, even if lazy loading isn’t working properly.

The Code:

[Test]
public void Test_N_Plus_One_Problem_Alternative_Approach()
{
    var blogs = _context.Blogs.ToList(); // Query 1
    
    foreach (var blog in blogs)
    {
        // This explicitly loads posts for THIS blog only (simulates lazy loading)
        var posts = _context.Posts.Where(p => p.BlogId == blog.Id).ToList();
        TestLogger.WriteLine($"Loaded {posts.Count} posts for blog {blog.Id}");
    }
}

The Lesson: This explicitly demonstrates the N+1 pattern with manual queries. The result is identical to lazy loading – one query per blog plus the initial blogs query.

Test Case 3: N+1 vs Include() – Side by Side Comparison

What it does: This is the money shot – a direct comparison showing the dramatic difference between the problematic approach and the solution.

The Bad Code (N+1):

// BAD: N+1 Problem
var blogsN1 = _context.Blogs.ToList(); // Query 1
foreach (var blog in blogsN1)
{
    var posts = _context.Posts.Where(p => p.BlogId == blog.Id).ToList(); // Queries 2,3,4...
}

The Good Code (Include):

// GOOD: Include() Solution  
var blogsInclude = _context.Blogs
    .Include(b => b.Posts)
    .ToList(); // Single query with JOIN

foreach (var blog in blogsInclude)
{
    // No additional queries needed - data is already loaded!
    var postCount = blog.Posts.Count;
}

The SQL Queries:

Bad Approach (Multiple Queries):

-- Same 4 separate queries as shown in Test Case 1

Good Approach (Single Query):

SELECT "b"."Id", "b"."CreatedDate", "b"."Description", "b"."Title", 
       "p"."Id", "p"."BlogId", "p"."Content", "p"."PublishedDate", "p"."Title"
FROM "Blogs" AS "b"
LEFT JOIN "Posts" AS "p" ON "b"."Id" = "p"."BlogId"
ORDER BY "b"."Id"

Results from our test:

  • Bad approach: 4 total queries (1 + 3)
  • Good approach: 1 total query
  • Performance improvement: 75% fewer database round trips!

Test Case 4: Guaranteed N+1 Problem

What it does: This test removes any doubt by explicitly demonstrating the N+1 pattern with clear step-by-step output.

The Code:

[Test]
public void Test_Guaranteed_N_Plus_One_Problem()
{
    var blogs = _context.Blogs.ToList(); // Query 1
    int queryCount = 1;

    foreach (var blog in blogs)
    {
        queryCount++;
        // This explicitly demonstrates the N+1 pattern
        var posts = _context.Posts.Where(p => p.BlogId == blog.Id).ToList();
        TestLogger.WriteLine($"Loading posts for blog '{blog.Title}' (Query #{queryCount})");
    }
}

Why it’s useful: This ensures we can always see the problem clearly by manually executing the problematic pattern, making it impossible to miss.

Test Case 5: Eager Loading with Include()

What it does: Shows the correct way to load related data upfront using Include().

The Code:

[Test]
public void Test_Eager_Loading_With_Include()
{
    var blogsWithPosts = _context.Blogs
        .Include(b => b.Posts)
        .ToList();

    foreach (var blog in blogsWithPosts)
    {
        // No additional queries - data already loaded!
        TestLogger.WriteLine($"Blog: {blog.Title} - Posts: {blog.Posts.Count}");
    }
}

The SQL Query:

SELECT "b"."Id", "b"."CreatedDate", "b"."Description", "b"."Title", 
       "p"."Id", "p"."BlogId", "p"."Content", "p"."PublishedDate", "p"."Title"
FROM "Blogs" AS "b"
LEFT JOIN "Posts" AS "p" ON "b"."Id" = "p"."BlogId"
ORDER BY "b"."Id"

The Benefit: One database trip loads everything. When you access blog.Posts.Count, the data is already there.

Test Case 6: Multiple Includes with ThenInclude()

What it does: Demonstrates loading deeply nested data – blogs → posts → comments – all in one query.

The Code:

[Test]
public void Test_Multiple_Includes_With_ThenInclude()
{
    var blogsWithPostsAndComments = _context.Blogs
        .Include(b => b.Posts)
            .ThenInclude(p => p.Comments)
        .ToList();

    foreach (var blog in blogsWithPostsAndComments)
    {
        foreach (var post in blog.Posts)
        {
            // All data loaded in one query!
            TestLogger.WriteLine($"Post: {post.Title} - Comments: {post.Comments.Count}");
        }
    }
}

The SQL Query:

SELECT "b"."Id", "b"."CreatedDate", "b"."Description", "b"."Title",
       "p"."Id", "p"."BlogId", "p"."Content", "p"."PublishedDate", "p"."Title",
       "c"."Id", "c"."Author", "c"."Content", "c"."CreatedDate", "c"."PostId"
FROM "Blogs" AS "b"
LEFT JOIN "Posts" AS "p" ON "b"."Id" = "p"."BlogId"
LEFT JOIN "Comments" AS "c" ON "p"."Id" = "c"."PostId"
ORDER BY "b"."Id", "p"."Id"

The Challenge: Loading three levels of data in one optimized query instead of potentially hundreds of separate queries.

Test Case 7: Projection with Select()

What it does: Shows how to load only the specific data you actually need instead of entire objects.

The Code:

[Test]
public void Test_Projection_With_Select()
{
    var blogData = _context.Blogs
        .Select(b => new
        {
            BlogTitle = b.Title,
            PostCount = b.Posts.Count(),
            RecentPosts = b.Posts
                .OrderByDescending(p => p.PublishedDate)
                .Take(2)
                .Select(p => new { p.Title, p.PublishedDate })
        })
        .ToList();
}

The SQL Query (from our test output):

SELECT "b"."Title", (
    SELECT COUNT(*)
    FROM "Posts" AS "p"
    WHERE "b"."Id" = "p"."BlogId"), "b"."Id", "t0"."Title", "t0"."PublishedDate", "t0"."Id"
FROM "Blogs" AS "b"
LEFT JOIN (
    SELECT "t"."Title", "t"."PublishedDate", "t"."Id", "t"."BlogId"
    FROM (
        SELECT "p0"."Title", "p0"."PublishedDate", "p0"."Id", "p0"."BlogId", 
               ROW_NUMBER() OVER(PARTITION BY "p0"."BlogId" ORDER BY "p0"."PublishedDate" DESC) AS "row"
        FROM "Posts" AS "p0"
    ) AS "t"
    WHERE "t"."row" <= 2
) AS "t0" ON "b"."Id" = "t0"."BlogId"
ORDER BY "b"."Id", "t0"."BlogId", "t0"."PublishedDate" DESC

Why it matters: This query only loads the specific fields needed, uses window functions for efficiency, and calculates counts in the database rather than loading full objects.

Test Case 8: Split Query Strategy

What it does: Demonstrates an alternative approach where one large JOIN is split into multiple optimized queries.

The Code:

[Test]
public void Test_Split_Query()
{
    var blogs = _context.Blogs
        .AsSplitQuery()
        .Include(b => b.Posts)
        .Include(b => b.Tags)
        .ToList();
}

The SQL Queries (from our test output):

-- Query 1: Load blogs
SELECT "b"."Id", "b"."CreatedDate", "b"."Description", "b"."Title"
FROM "Blogs" AS "b"
ORDER BY "b"."Id"

-- Query 2: Load posts (automatically generated)
SELECT "p"."Id", "p"."BlogId", "p"."Content", "p"."PublishedDate", "p"."Title", "b"."Id"
FROM "Blogs" AS "b"
INNER JOIN "Posts" AS "p" ON "b"."Id" = "p"."BlogId"
ORDER BY "b"."Id"

-- Query 3: Load tags (automatically generated)
SELECT "t"."Id", "t"."Name", "b"."Id"
FROM "Blogs" AS "b"
INNER JOIN "BlogTag" AS "bt" ON "b"."Id" = "bt"."BlogsId"
INNER JOIN "Tags" AS "t" ON "bt"."TagsId" = "t"."Id"
ORDER BY "b"."Id"

When to use it: When JOINing lots of related data creates one massive, slow query. Split queries break this into several smaller, faster queries.

Test Case 9: Filtered Include()

What it does: Shows how to load only specific related data – in this case, only recent posts from the last 15 days.

The Code:

[Test]
public void Test_Filtered_Include()
{
    var cutoffDate = DateTime.Now.AddDays(-15);
    var blogsWithRecentPosts = _context.Blogs
        .Include(b => b.Posts.Where(p => p.PublishedDate > cutoffDate))
        .ToList();
}

The SQL Query:

SELECT "b"."Id", "b"."CreatedDate", "b"."Description", "b"."Title",
       "p"."Id", "p"."BlogId", "p"."Content", "p"."PublishedDate", "p"."Title"
FROM "Blogs" AS "b"
LEFT JOIN "Posts" AS "p" ON "b"."Id" = "p"."BlogId" AND "p"."PublishedDate" > @cutoffDate
ORDER BY "b"."Id"

The Efficiency: Only loads posts that meet the criteria, reducing data transfer and memory usage.

Test Case 10: Explicit Loading

What it does: Demonstrates manually controlling when related data gets loaded.

The Code:

[Test]
public void Test_Explicit_Loading()
{
    var blogs = _context.Blogs.ToList(); // Load blogs only
    
    // Now explicitly load posts for all blogs
    foreach (var blog in blogs)
    {
        _context.Entry(blog)
            .Collection(b => b.Posts)
            .Load();
    }
}

The SQL Queries:

-- Query 1: Load blogs
SELECT "b"."Id", "b"."CreatedDate", "b"."Description", "b"."Title"
FROM "Blogs" AS "b"

-- Query 2: Explicitly load posts for blog 1
SELECT "p"."Id", "p"."BlogId", "p"."Content", "p"."PublishedDate", "p"."Title"
FROM "Posts" AS "p"
WHERE "p"."BlogId" = 1

-- Query 3: Explicitly load posts for blog 2
SELECT "p"."Id", "p"."BlogId", "p"."Content", "p"."PublishedDate", "p"."Title"
FROM "Posts" AS "p"
WHERE "p"."BlogId" = 2

-- ... and so on

When useful: When you sometimes need related data and sometimes don’t. You control exactly when the additional database trip happens.

Test Case 11: Batch Loading Pattern

What it does: Shows a clever technique to avoid N+1 by loading all related data in one query, then organizing it in memory.

The Code:

[Test]
public void Test_Batch_Loading_Pattern()
{
    var blogs = _context.Blogs.ToList(); // Query 1
    var blogIds = blogs.Select(b => b.Id).ToList();

    // Single query to get all posts for all blogs
    var posts = _context.Posts
        .Where(p => blogIds.Contains(p.BlogId))
        .ToList(); // Query 2

    // Group posts by blog in memory
    var postsByBlog = posts.GroupBy(p => p.BlogId).ToDictionary(g => g.Key, g => g.ToList());
}

The SQL Queries:

-- Query 1: Load all blogs
SELECT "b"."Id", "b"."CreatedDate", "b"."Description", "b"."Title"
FROM "Blogs" AS "b"

-- Query 2: Load ALL posts for ALL blogs in one query
SELECT "p"."Id", "p"."BlogId", "p"."Content", "p"."PublishedDate", "p"."Title"
FROM "Posts" AS "p"
WHERE "p"."BlogId" IN (1, 2, 3)

The Result: Just 2 queries total, regardless of how many blogs you have. Data organization happens in memory.

Test Case 12: Performance Comparison

What it does: Puts all the approaches head-to-head to show their relative performance.

The Code:

[Test]
public void Test_Performance_Comparison()
{
    // N+1 Problem (Multiple Queries)
    var blogs1 = _context.Blogs.ToList();
    foreach (var blog in blogs1)
    {
        var count = blog.Posts.Count(); // Triggers separate query
    }

    // Eager Loading (Single Query)
    var blogs2 = _context.Blogs
        .Include(b => b.Posts)
        .ToList();

    // Projection (Minimal Data)
    var blogSummaries = _context.Blogs
        .Select(b => new { b.Title, PostCount = b.Posts.Count() })
        .ToList();
}

The SQL Queries Generated:

N+1 Problem: 4 separate queries (as shown in previous examples)

Eager Loading: 1 JOIN query (as shown in Test Case 5)

Projection: 1 optimized query with subquery:

SELECT "b"."Title", (
    SELECT COUNT(*)
    FROM "Posts" AS "p"
    WHERE "b"."Id" = "p"."BlogId") AS "PostCount"
FROM "Blogs" AS "b"

Real-World Performance Impact

Let’s scale this up to see why it matters:

Small Application (10 blogs):

  • N+1 approach: 11 queries (≈110ms)
  • Optimized approach: 1 query (≈10ms)
  • Time saved: 100ms

Medium Application (100 blogs):

  • N+1 approach: 101 queries (≈1,010ms)
  • Optimized approach: 1 query (≈10ms)
  • Time saved: 1 second

Large Application (1000 blogs):

  • N+1 approach: 1001 queries (≈10,010ms)
  • Optimized approach: 1 query (≈10ms)
  • Time saved: 10 seconds

Key Takeaways

  1. The N+1 problem gets exponentially worse as your data grows
  2. Lazy loading is convenient but dangerous – it can hide performance problems
  3. Include() is your friend for loading related data efficiently
  4. Projection is powerful when you only need specific fields
  5. Different problems need different solutions – there’s no one-size-fits-all approach
  6. SQL query inspection is crucial – always check what queries your ORM generates

The Bottom Line

This test suite shows that small changes in how you write database queries can transform a slow, database-heavy operation into a fast, efficient one. The difference between a frustrated user waiting 10 seconds for a page to load and a happy user getting instant results often comes down to understanding and avoiding the N+1 problem.

The beauty of these tests is that they use real database queries with actual SQL output, so you can see exactly what’s happening under the hood. Understanding these patterns will make you a more effective developer and help you build applications that stay fast as they grow.

You can find the source for this article in my here 

SyncFramework Update: Now Supporting .NET 9 and EfCore 9!

SyncFramework Update: Now Supporting .NET 9 and EfCore 9!

SyncFramework Update: Now Supporting .NET 9!

SyncFramework is a C# library that simplifies data synchronization using delta encoding technology. Instead of transferring entire datasets, it efficiently synchronizes by tracking and transmitting only the changes between data versions, significantly reducing bandwidth and processing overhead.

What’s New

  • All packages now target .NET 9
  • BIT.Data.Sync packages updated to support the latest framework
  • Entity Framework Core packages upgraded to EF Core 9
  • Various minor fixes and improvements

Available Implementations

  • SyncFramework for XPO: For DevExpress XPO users
  • SyncFramework for Entity Framework Core: For EF Core users

Package Statistics

Our packages have been serving the community well, with steady adoption:

  • BIT.Data.Sync: 2,142 downloads
  • BIT.Data.Sync.AspNetCore: 1,064 downloads
  • BIT.Data.Sync.AspNetCore.Xpo: 521 downloads
  • BIT.Data.Sync.EfCore: 1,691 downloads
  • BIT.Data.Sync.EfCore.Npgsql: 1,120 downloads
  • BIT.Data.Sync.EfCore.Pomelo.MySql: 1,172 downloads
  • BIT.Data.Sync.EfCore.Sqlite: 887 downloads
  • BIT.Data.Sync.EfCore.SqlServer: 982 downloads

Resources

NuGet Packages
Source Code

As always, you can compile the source code yourself from our GitHub repository. The framework continues to provide reliable data synchronization across different platforms and databases.

Happy Delta Encoding! ?

The AnyCPU Illusion: Native Dependencies in .NET Applications

The AnyCPU Illusion: Native Dependencies in .NET Applications

Introduction

In the .NET ecosystem, “AnyCPU” is often considered a silver bullet for cross-platform deployment. However, this assumption can lead to significant problems when your application depends on native assemblies. In this post, I want to share a personal story that highlights how I discovered these limitations and how native dependencies affect the true portability of AnyCPU applications, especially for database access through ADO.NET and popular ORMs.

My Journey to Understanding AnyCPU’s Limitations

Every year, around Thanksgiving or Christmas, I visit my friend, brother, and business partner Javier. Two years ago, during one of these visits, I made a decision that would lead me to a pivotal realization about AnyCPU architecture.

At the time, I was tired of traveling with my bulky MSI GE72 Apache Pro-24 gaming laptop. According to MSI’s official specifications, it weighed 5.95 pounds—but that number didn’t include the hefty charger, which brought the total to around 12 pounds. Later, I upgraded to an MSI GF63 Thin, which was lighter at 4.10 pounds—but with the charger, it was still around 7.5 pounds. Lugging these laptops through airports felt like a workout.

Determined to travel lighter, I purchased a MacBook Air with the M2 chip. At just 2.7 pounds, including the charger, the MacBook Air felt like a breath of fresh air. The Apple Silicon chip was incredibly fast, and I immediately fell in love with the machine.

Having used a MacBook Pro with Bootcamp and Windows 7 years ago, I thought I could recreate that experience by running a Windows virtual machine on my MacBook Air to check projects and do some light development while traveling.

The Virtualization Experiment

As someone who loves virtualization, I eagerly set up a Windows virtual machine on my MacBook Air. I grabbed my trusty Windows x64 ISO, set up the virtual machine, and attempted to boot it—but it failed. I quickly realized the issue was related to CPU architecture. My x64 ISO wasn’t compatible with the ARM-based M2 chip.

Undeterred, I downloaded a Windows 11 ISO for ARM architecture and created the VM. Success! Windows was up and running, and I installed Visual Studio along with my essential development tools, including DevExpress XPO (my favorite ORM).

The Demo Disaster

The real test came during a trip to Dubai, where I was scheduled to give a live demo showcasing how quickly you can develop Line-of-Business (LOB) apps with XAF. Everything started smoothly until I tried to connect my XAF app to the database. Despite my best efforts, the connection failed.

In the middle of the demo, I switched to an in-memory data provider to salvage the presentation. After the demo, I dug into the issue and realized the root cause was related to the CPU architecture. The native database drivers I was using weren’t compatible with the ARM architecture.

A Familiar Problem

This situation reminded me of the transition from x86 to x64 years ago. Back then, I encountered similar issues where native drivers wouldn’t load unless they matched the process architecture.

The Native Dependency Challenge

Platform-Specific Loading Requirements

Native DLLs must exactly match the CPU architecture of your application:

  • If your app runs as x86, it can only load x86 native DLLs.
  • If running as x64, it requires x64 native DLLs.
  • ARM requires ARM-specific binaries.
  • ARM64 requires ARM64-specific binaries.

There is no flexibility—attempting to load a DLL compiled for a different architecture results in an immediate failure.

How Native Libraries are Loaded

When your application loads a native DLL, the operating system follows a specific search pattern:

  1. The application’s directory
  2. System directories (System32/SysWOW64)
  3. Directories listed in the PATH environment variable

Crucially, these native libraries must match the exact architecture of the running process.

// This seemingly simple code
[DllImport("native.dll")]
static extern void NativeMethod();

// Actually requires:
// - native.dll compiled for x86 when running as 32-bit
// - native.dll compiled for x64 when running as 64-bit
// - native.dll compiled for ARM64 when running on ARM64

The SQL Server Example

Let’s look at SQL Server connectivity, a common scenario where the AnyCPU illusion breaks down:

// Traditional ADO.NET connection
using (var connection = new SqlConnection(connectionString))
{
    // This requires SQL Native Client
    // Which must match the process architecture
    await connection.OpenAsync();
}

Even though your application is compiled as AnyCPU, the SQL Native Client must match the process architecture. This becomes particularly problematic on newer architectures like ARM64, where native drivers may not be available.

Impact on ORMs

Entity Framework Core

Entity Framework Core, despite its modern design, still relies on database providers that may have native dependencies:

public class MyDbContext : DbContext
{
    protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder)
    {
        // This configuration depends on:
        // 1. SQL Native Client
        // 2. Microsoft.Data.SqlClient native components
        optionsBuilder.UseSqlServer(connectionString);
    }
}

DevExpress XPO

DevExpress XPO faces similar challenges:

// XPO configuration
string connectionString = MSSqlConnectionProvider.GetConnectionString("server", "database");
XpoDefault.DataLayer = XpoDefault.GetDataLayer(connectionString, AutoCreateOption.DatabaseAndSchema);

// The MSSqlConnectionProvider relies on the same native SQL Server components

Solutions and Best Practices

1. Architecture-Specific Deployment

Instead of relying on AnyCPU, consider creating architecture-specific builds:

<PropertyGroup>
    <Platforms>x86;x64;arm64</Platforms>
    <RuntimeIdentifiers>win-x86;win-x64;win-arm64</RuntimeIdentifiers>
</PropertyGroup>

2. Runtime Provider Selection

Implement smart provider selection based on the current architecture:

public static class DatabaseProviderFactory
{
    public static IDbConnection GetProvider()
    {
        return RuntimeInformation.ProcessArchitecture switch
        {
            Architecture.X86 => new SqlConnection(), // x86 native provider
            Architecture.X64 => new SqlConnection(), // x64 native provider
            Architecture.Arm64 => new Microsoft.Data.SqlClient.SqlConnection(), // ARM64 support
            _ => throw new PlatformNotSupportedException()
        };
    }
}

3. Managed Fallbacks

Implement fallback strategies when native providers aren’t available:

public class DatabaseConnection
{
    public async Task<IDbConnection> CreateConnectionAsync()
    {
        try
        {
            var connection = new SqlConnection(_connectionString);
            await connection.OpenAsync();
            return connection;
        }
        catch (DllNotFoundException)
        {
            var managedConnection = new Microsoft.Data.SqlClient.SqlConnection(_connectionString);
            await managedConnection.OpenAsync();
            return managedConnection;
        }
    }
}

4. Deployment Considerations

  • Include all necessary native dependencies for each target architecture.
  • Use architecture-specific directories in your deployment.
  • Consider self-contained deployment to include the correct runtime.

Real-World Implications

This experience taught me that while AnyCPU provides excellent flexibility for managed code, it has limitations when dealing with native dependencies. These limitations become more apparent in scenarios like cloud deployments, ARM64 devices, and live demos.

Conclusion

The transition to ARM architecture is accelerating, and understanding the nuances of AnyCPU and native dependencies is more important than ever. By planning for architecture-specific deployments and implementing fallback strategies, you can build more resilient applications that can thrive in a multi-architecture world.

PostgreSQL Full-text search using “text search vectors”

PostgreSQL Full-text search using “text search vectors”

Postgres Vector Search

Full-text search in PostgreSQL is implemented using a concept called “text search vectors” (or just “tsvector”). Let’s dive into how it works:

  1. Text Search Vectors (tsvector):
    • A tsvector is a sorted list of distinct lexemes, which are words that have been normalized to merge different forms of the same word (e.g., “run” and “running”).
    • PostgreSQL provides functions to convert plain text into tsvector format, which typically involves:
      • Parsing the text into tokens.
      • Converting tokens to lexemes.
      • Removing stop words (common words like “and” or “the” that are typically ignored in searches).
    • Example: The text “The quick brown fox” might be represented in tsvector as ‘brown’:3 ‘fox’:4 ‘quick’:2.
  2. Text Search Queries (tsquery):
    • A tsquery represents a text search query, which includes lexemes and optional operators.
    • Operators can be used to combine lexemes in different ways (e.g., AND, OR, NOT).
    • Example: The query “quick & fox” would match any tsvector containing both “quick” and “fox”.
  3. Searching:
    • PostgreSQL provides the @@ operator to search a tsvector column with a tsquery.
    • Example: WHERE column @@ to_tsquery(‘english’, ‘quick & fox’).
  4. Ranking:
    • Once you’ve found matches using the @@ operator, you often want to rank them by relevance.
    • PostgreSQL provides the ts_rank function to rank results. It returns a number indicating how relevant a tsvector is to a tsquery.
    • The ranking is based on various factors, including the frequency of lexemes and their proximity to each other in the text.
  5. Indexes:
    • One of the significant advantages of tsvector is that you can create a GiST or GIN index on it.
    • These indexes significantly speed up full-text search queries.
    • GIN indexes, in particular, are optimized for tsvector and provide very fast lookups.
  6. Normalization and Configuration:
    • PostgreSQL supports multiple configurations (e.g., “english”, “french”) that determine how text is tokenized and which stop words are used.
    • This allows you to tailor your full-text search to specific languages or requirements.
  7. Highlighting and Snippets:
    • In addition to just searching, PostgreSQL provides functions like ts_headline to return snippets of the original text with search terms highlighted.

In summary, PostgreSQL’s full-text search works by converting regular text into a normalized format (tsvector) that is optimized for searching. This combined with powerful query capabilities (tsquery) and indexing options makes it a robust solution for many full-text search needs.

Implementing vector search using E.F Core and Postgres SQL

here are the steps to implement vector search in your dot net project:

Step 1: Add the required nuget packages

   
<PackageReference Include="Npgsql.EntityFrameworkCore.PostgreSQL" Version="7.0.11" />
<PackageReference Include="Npgsql.EntityFrameworkCore.PostgreSQL.Design" Version="1.1.0" />
<PackageReference Include="Npgsql.EntityFrameworkCore.PostgreSQL.NetTopologySuite" Version="7.0.11" />

Step 2: Implement a vector in your entities by implementing properties of type NpgsqlTsVector as shown below

public class Blog
{
    public int Id { get; set; }
    public string Title { get; set; }
    public NpgsqlTsVector SearchVector { get; set; }
}

Step 3: add a computed column in your DbContext

protected override void OnModelCreating(ModelBuilder modelBuilder)
{
    modelBuilder.Entity<Blog>()
        .Property(b => b.SearchVector)
      .HasComputedColumnSql("to_tsvector('english', \"Blogs\".\"Title\")", stored: true);
}

in this case you are calculating the vector using the value of the title column of the blogs table, you can calculate the vector using a single column or a combination of columns

Now you are ready to use vector search in your queries, please check the example below

var searchTerm = "Jungle"; // Example search term
var searchVector = NpgsqlTsVector.Parse(searchTerm);

var blogs = context.Blogs
    .Where(p => p.SearchVector.Matches(searchTerm))
    .OrderByDescending(td => td.SearchVector.Rank(EF.Functions.ToTsQuery(searchTerm))).ToList();

In real world scenarios its better to create a vector by joining the values of several columns and weight them according to the relevance for your business case, you can check the test project I have created here : https://github.com/egarim/PostgresVectorSearch

and that’s it for this post, until next time, happy coding ))

 

 

 

 

Alchemy Framework: 1 – Creating a framework for import data

Alchemy Framework: 1 – Creating a framework for import data

Last Friday, I received a message from a dear friend and colleague, Pedro Hernandez. He asked me if I had the latest compiled version of the XPO import framework we created in our office. As it turned out, I did not have it readily available and had to search extensively for it.

While conducting this search across my computers, repositories, and virtual machines, I was inspired to create another import framework — yes, another piece of code to maintain.

Most of the time, my research projects begin in the same manner, typically after a conversation with my close friend Jose Javier Columbie. As always, he would say something like this: “Jose, do you think this can be possible? Why don’t you give it a try? If we succeed, that will be el batazo (like hitting a home run).”

In this specific case, that conversation didn’t happen. However, I could hear his words echoing in my mind.

Furthermore, I want this project to be community-driven, not just a technical experiment I have to maintain alone. I truly believe in the power of a community.

First Step

So, let’s begin. I’ll start with my favorite part: naming the project. I spent all weekend pondering this, attempting to condense the concept and associate it with a literary term or a Latin word (these are my preferred methods for naming a project).

Let’s define what the ultimate goal is. In an import process, the aim is to take information from a source ‘A’, translate or transform the information, and then store it in a target ‘B’.

Growing up, I was an avid reader – and I mean, I read a lot. (Now, I’ve switched to audiobooks.) Therefore, after defining what this project is about, naming it became incredibly easy.

noun

noun: alchemy

  1. the medieval forerunner of chemistry, concerned with the transmutation of matter, in particular with attempts to convert base metals into gold or find a universal elixir.

“occult sciences, such as alchemy and astrology”

Origin

late Middle English: via Old French and medieval Latin from Arabic al-kīmiyā’, from al ‘the’ + kīmiyā’ (from Greek khēmiakhēmeia ‘art of transmuting metals’).

The alchemy Framework

Alchemy is a framework created for DotNet, designed to import data from a data source to a data target. These sources and targets can be anything from a text file, CSV file, a database, ORMs, and so on.

The framework consists of a set of contracts, interfaces, and base classes. When implemented, these allow you to import and transform data between various sources and targets.

The requirement in a few words

As stated in the framework’s description, the requirement is only to define the contracts that represent the sources and targets, as well as a job configuration that describes how the information flows from one source to a target. The concrete implementations are not important at this point and should be discussed individually for each case.

The design patterns.

For this project, we will use the SOLID design principles and dependency injection. This will enable us to easily replace small functionalities, allowing us to mix and match different implementations depending on the data source and data target.