PostgreSQL Full-text search using “text search vectors”

PostgreSQL Full-text search using “text search vectors”

Postgres Vector Search

Full-text search in PostgreSQL is implemented using a concept called “text search vectors” (or just “tsvector”). Let’s dive into how it works:

  1. Text Search Vectors (tsvector):
    • A tsvector is a sorted list of distinct lexemes, which are words that have been normalized to merge different forms of the same word (e.g., “run” and “running”).
    • PostgreSQL provides functions to convert plain text into tsvector format, which typically involves:
      • Parsing the text into tokens.
      • Converting tokens to lexemes.
      • Removing stop words (common words like “and” or “the” that are typically ignored in searches).
    • Example: The text “The quick brown fox” might be represented in tsvector as ‘brown’:3 ‘fox’:4 ‘quick’:2.
  2. Text Search Queries (tsquery):
    • A tsquery represents a text search query, which includes lexemes and optional operators.
    • Operators can be used to combine lexemes in different ways (e.g., AND, OR, NOT).
    • Example: The query “quick & fox” would match any tsvector containing both “quick” and “fox”.
  3. Searching:
    • PostgreSQL provides the @@ operator to search a tsvector column with a tsquery.
    • Example: WHERE column @@ to_tsquery(‘english’, ‘quick & fox’).
  4. Ranking:
    • Once you’ve found matches using the @@ operator, you often want to rank them by relevance.
    • PostgreSQL provides the ts_rank function to rank results. It returns a number indicating how relevant a tsvector is to a tsquery.
    • The ranking is based on various factors, including the frequency of lexemes and their proximity to each other in the text.
  5. Indexes:
    • One of the significant advantages of tsvector is that you can create a GiST or GIN index on it.
    • These indexes significantly speed up full-text search queries.
    • GIN indexes, in particular, are optimized for tsvector and provide very fast lookups.
  6. Normalization and Configuration:
    • PostgreSQL supports multiple configurations (e.g., “english”, “french”) that determine how text is tokenized and which stop words are used.
    • This allows you to tailor your full-text search to specific languages or requirements.
  7. Highlighting and Snippets:
    • In addition to just searching, PostgreSQL provides functions like ts_headline to return snippets of the original text with search terms highlighted.

In summary, PostgreSQL’s full-text search works by converting regular text into a normalized format (tsvector) that is optimized for searching. This combined with powerful query capabilities (tsquery) and indexing options makes it a robust solution for many full-text search needs.

Implementing vector search using E.F Core and Postgres SQL

here are the steps to implement vector search in your dot net project:

Step 1: Add the required nuget packages

   
<PackageReference Include="Npgsql.EntityFrameworkCore.PostgreSQL" Version="7.0.11" />
<PackageReference Include="Npgsql.EntityFrameworkCore.PostgreSQL.Design" Version="1.1.0" />
<PackageReference Include="Npgsql.EntityFrameworkCore.PostgreSQL.NetTopologySuite" Version="7.0.11" />

Step 2: Implement a vector in your entities by implementing properties of type NpgsqlTsVector as shown below

public class Blog
{
    public int Id { get; set; }
    public string Title { get; set; }
    public NpgsqlTsVector SearchVector { get; set; }
}

Step 3: add a computed column in your DbContext

protected override void OnModelCreating(ModelBuilder modelBuilder)
{
    modelBuilder.Entity<Blog>()
        .Property(b => b.SearchVector)
      .HasComputedColumnSql("to_tsvector('english', \"Blogs\".\"Title\")", stored: true);
}

in this case you are calculating the vector using the value of the title column of the blogs table, you can calculate the vector using a single column or a combination of columns

Now you are ready to use vector search in your queries, please check the example below

var searchTerm = "Jungle"; // Example search term
var searchVector = NpgsqlTsVector.Parse(searchTerm);

var blogs = context.Blogs
    .Where(p => p.SearchVector.Matches(searchTerm))
    .OrderByDescending(td => td.SearchVector.Rank(EF.Functions.ToTsQuery(searchTerm))).ToList();

In real world scenarios its better to create a vector by joining the values of several columns and weight them according to the relevance for your business case, you can check the test project I have created here : https://github.com/egarim/PostgresVectorSearch

and that’s it for this post, until next time, happy coding ))