Full-Text Search on File Metadata with PostgreSQL

To make uploaded files searchable by name, caption, and tags, store a tsvector column derived from your text fields, index it with GIN, and query it with tsquery ranked by ts_rank — all without leaving PostgreSQL. Combine it with JSONB filters and you get fast, relevance-ordered search over your media catalog.

This guide is part of Metadata Indexing & Search within Backend Validation & Cloud Storage Architecture. It builds on how to index file metadata in PostgreSQL.

When to use this approach

  • Users need to search files by free text (filename, title, tags) not just exact match.
  • You already store metadata in Postgres and want to avoid a separate search engine.
  • You need relevance ranking and the ability to combine text search with structured filters.

Prerequisites

  1. PostgreSQL 12+ (generated columns require 12; earlier needs a trigger).
  2. A table holding one row per uploaded file with text fields and a JSONB metadata column.
  3. Permission to create columns and indexes on that table.

Implementation

Store a generated tsvector that concatenates your searchable fields with weights, index it, then query with websearch_to_tsquery for a friendly search syntax.

-- 1. Base table for uploaded file records.
CREATE TABLE files (
  id           bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  filename     text NOT NULL,
  title        text,
  tags         text[] DEFAULT '{}',
  metadata     jsonb NOT NULL DEFAULT '{}',
  created_at   timestamptz NOT NULL DEFAULT now()
);

-- 2. A generated tsvector combining fields with relevance weights.
ALTER TABLE files ADD COLUMN search_vec tsvector
  GENERATED ALWAYS AS (
    setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
    setweight(to_tsvector('english', coalesce(filename, '')), 'B') ||
    setweight(to_tsvector('english', array_to_string(tags, ' ')), 'C')
  ) STORED;

-- 3. A GIN index makes tsquery lookups fast.
CREATE INDEX files_search_idx ON files USING GIN (search_vec);

-- 4. A GIN index on JSONB enables containment filters alongside text.
CREATE INDEX files_metadata_idx ON files USING GIN (metadata jsonb_path_ops);

Line-by-line on the critical pieces

  • to_tsvector('english', ...) normalizes text into lexemes — lowercasing, stemming (“uploads” matches “upload”), and dropping stop words. The first argument is the text-search configuration; pick the language your content is in.
  • setweight(..., 'A') tags lexemes with a weight class A–D. A title match should rank above a filename match, so title gets A, filename B, tags C. ts_rank later turns these into a score.
  • coalesce(..., '') guards against NULL fields; concatenating a NULL tsvector would null the whole expression.
  • STORED generated column keeps search_vec automatically in sync on every insert and update, so you never maintain it by hand.
  • USING GIN is the index type built for set-membership lookups like “does this vector contain this lexeme”. A btree index cannot serve @@ queries.
  • jsonb_path_ops is a compact GIN operator class optimized for the @> containment operator you use to filter structured metadata.

Querying with ranking

websearch_to_tsquery accepts user-friendly input (quoted phrases, or, - to exclude) and the @@ operator matches it against the vector. ts_rank scores each row so you can order by relevance.

SELECT id, filename, title,
       ts_rank(search_vec, query) AS rank
FROM files,
     websearch_to_tsquery('english', $1) AS query
WHERE search_vec @@ query
  AND metadata @> $2::jsonb       -- structured filter, e.g. {"type":"video"}
ORDER BY rank DESC, created_at DESC
LIMIT 20;

Calling it from TypeScript with the pg driver:

import { Pool } from "pg";

const pool = new Pool({ connectionString: process.env.DATABASE_URL });

export async function searchFiles(term: string, filter: Record<string, unknown>) {
  const { rows } = await pool.query(
    `SELECT id, filename, title, ts_rank(search_vec, query) AS rank
       FROM files, websearch_to_tsquery('english', $1) AS query
      WHERE search_vec @@ query
        AND metadata @> $2::jsonb
      ORDER BY rank DESC, created_at DESC
      LIMIT 20`,
    [term, JSON.stringify(filter)],
  );
  return rows;
}

// searchFiles("annual report -draft", { type: "document" })
// => rows ordered by relevance, documents only, excluding "draft".

How the pieces fit together

Full-text search pipeline in PostgreSQL Text fields become a weighted tsvector stored and GIN-indexed; a user query becomes a tsquery matched with the match operator and ranked. title, filename, tags to_tsvector tsvector weighted A/B/C GIN index fast @@ lookups user query websearch_to_tsquery @@ match + ts_rank order ranked rows
Text fields become a weighted tsvector, the GIN index serves the match, and ts_rank orders results by relevance.

Combining with JSONB metadata

Upload metadata is often semi-structured — dimensions, codec, EXIF — and fits naturally in JSONB. The @> containment operator filters on it in the same query as the text search, and because both columns are GIN-indexed, Postgres can combine them efficiently. To search inside JSONB text values too, fold them into the tsvector:

-- Include a JSONB text field (e.g. metadata->>'description') in the vector.
ALTER TABLE files DROP COLUMN search_vec;
ALTER TABLE files ADD COLUMN search_vec tsvector
  GENERATED ALWAYS AS (
    setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
    setweight(to_tsvector('english', coalesce(metadata->>'description', '')), 'B')
  ) STORED;

Configuration gotchas

Search returns nothing for partial words

Full-text search matches whole lexemes, not substrings, so “rep” will not match “report”. For prefix matching use to_tsquery('english', 'report:*'), or add a pg_trgm trigram index if you need true substring search.

Wrong language config silently hurts recall

Indexing English content with the simple configuration skips stemming, so “uploads” will not match “upload”. Use the configuration that matches your content’s language consistently in both the column and the query.

GIN index not used on small tables

On a tiny table the planner may sequential-scan instead of using the index; that is correct behavior. Verify with EXPLAIN on a realistically sized table, not a handful of test rows.

Verification

Confirm the index is used and matches behave as expected:

EXPLAIN ANALYZE
SELECT id FROM files, websearch_to_tsquery('english', 'annual report') AS query
WHERE search_vec @@ query;
-- Expected: a "Bitmap Index Scan on files_search_idx" node in the plan.

FAQ

Should I use a trigger or a generated column for the tsvector?

On PostgreSQL 12+, prefer a STORED generated column — it is declarative and cannot drift. Use a trigger only on older versions that lack generated columns.

When should I reach for a dedicated search engine instead?

Postgres full-text search comfortably handles millions of rows with ranking and filters. Move to a dedicated engine when you need fuzzy typo-tolerance, faceted aggregation at scale, or sub-millisecond latency across very large corpora.

Can I rank phrase proximity, not just term presence?

Yes. Use phraseto_tsquery or the <-> distance operator in a tsquery to require terms to appear adjacently, then rank with ts_rank_cd, which accounts for term proximity.