Text Search on PostgreSQL
PostgreSQL has out of box support for text search.
Assume we have a table of documents:
CREATE TABLE documents
(
id SERIAL NOT NULL,
doc TEXT
)
INSERT INTO documents
(doc)
VALUES ("lorem ipsum .....");
INSERT INTO documents
(doc)
VALUES ("quick brown fox .....");
Output:
id | doc -----------------
0 | "Lorem ipsum ....."
1 | "Quick brown fox ..."
A simple text search is a basic requirement in any system. This can be done using tsvector and tsquery types in PostgreSQL.
tsvector gives us the list of lexemes for any given text. tsquery helps facilitate the search by creating lexemes for search terms, combine search terms / lexemes and compare with tsvector for result.
The to_tsvector method processes text by removing stop words, stem and normalize words so that they can be used with different variants of the word. For example, precision would become precis and running would become run
On every insert of a document, we need to get the normalized text of the document and add it to the normalized text column. For this we need to create a new column of type tsvector.
ALTER TABLE documents ADD COLUMN tsv TYPE tsvector;
Next, we need to create a trigger function that will update the tsv column on every insert
CREATE TRIGGER tsvupdate before
INSERT
OR
UPDATE
ON documents FOR each row
EXECUTE PROCEDURE
tsvector_update_trigger(tsv, 'pg.catalog.english', doc);
tsvector_update_trigger() is built-in method which takes arguments -
- column to store the normalized text
- language of the text (because removing stop words and stemming is specific to a language)
- column_to_read_text_from
- column_to_read_text_from (takes multiple columns as input)
With data populated inside the documents table, we can perform a simple text search using the query:
WITH q AS
(
SELECT To_tsquery('brown:*') AS query)
SELECT id,
doc,
tsv
FROM documents,
q
WHERE q.query @@ documents.tsv;
The to_tsquery function will convert the input text to tsquery type which can be used to do logical operations — & (AND), | (OR), ! (NOT), with lexemes and perform prefix matching using “:*”
The @@ operator checks if the tsvector matches tsquery
So the above query would return us documents which contain “brown”
Limitation:
tsvector and tsquery will only help us find words from a given text but not substring matching. For substring matching we will have to use the pg_trgm extension (Trigram based text search). The pg_trgm extension can perform LIKE operation on text fields.