Home » Implementing Full Text Indexing with PostgreSQL

Implementing Full Text Indexing with PostgreSQL

ike many companies, over the years my company developed a home-grown billing/project system that allows our programmers to bill to a Web application. This system is quite capable; however one of the drawbacks to the system is that there’s no way to search the billable entries or customer notes.

Of course, we could use simple LIKE or regex queries via SQL but that would be resource-intensive and would not provide the detailed results we would like. For example, to search for all billing entries that had a description with the words ‘apache’ (to protect the customer name) and ‘php’ we could write a query like this:

   SELECT description    FROM cmd_contact_task_entries    WHERE description ~ 'php'    AND description ~ 'apache';

Author’s Note: The tilde (~) in the query represents the regular expression operator in PostgreSQL.

In our system the preceding query would return seven rows, but it would not rank the matching documents. Moreover, if we were looking for more detailed information we would have to create some fairly onerous queries. What this system really needs is full text indexing.

Although PostgreSQL is, in my opinion, the most capable open source database available, it does have its drawbacks. One such drawback is that PostgreSQL doesn’t include full text indexing as part of the core distribution. However, as with any good open source project, the PostgreSQL community has contributed many useful modules?and one such module is TSearch2.

TSearch2 is a maintained contrib module for both PostgreSQL 7.3 and 7.4. In short, TSearch2 is a new indexable data type that stores the position of words within a particular document?thus making the retrieval of those documents based on specified parameters both efficient and reliable.

Installing TSearch2
There are three ways to get TSearch2 implemented in your PostgreSQL installation.

Download and compile the source
Download and install the contrib rpm (if you are running Linux)
Purchase a binary distribution

Compiling PostgreSQL is outside the scope of this article. However, I will cover compiling and installing the TSearch2 module itself.

The TSearch2 module is included with the PostgreSQL source and can be found within the root of the tree at contrib/tsearch2.

After you have configured and installed PostgreSQL itself, you need to perform two more simple steps to get Tsearch2 installed. From the root of the PostgreSQL source tree, enter the following two commands:

   cd contrib/tsearch2   make; make install

That’s it. The module is now installed. Of course you still have to configure PostgreSQL for its use. For the examples in this article, you’ll use a development database called Fingerless (long story) to test TSearch2.

To configure the database to use TSearch2 you need to execute the tsearch2.sql file against the fingerless database. To do that, enter the following psql command:

   psql fingerless < tsearch2.sql

Executing the script produces some noisy output as it creates sample tables and indexes for testing Tsearch2. Don't worry about all this output. When the script completes it will have added the following tables to your database (see Table 1):

Table 1. A list of tables added by executing the tsearch2.sql file.

Schema	Name	Type	Owner
public	pg_ts_cfg	table	postgres
public	pg_ts_cfgmap	table	postgres
public	pg_ts_dict	table	postgres
public	pg_ts_parser	table	postgres

You will also have a detailed list of new functions and data types. For more information on what the script actually adds to your database please see the tsearch2.sql file or the Web site documentation.

The billing application keeps the main billing details in the table named public.cmd_contact_task_entries.

The table has the structure shown in Table 2:

Table 2: Structure of the public.cmd_contact_task_entries table.

Column	Type	Modifiers
entry_id	integer	not null default nextval('"cmd_contact_task_e_entry_id_seq"'::text)
task_id	integer
creator_employee_id	integer
description	text
time_spent	interval
created	timestamp with time zone
billed	timestamp with time zone
hourly_rate	numeric(6,2)
internal_notes	text	default ''::text
contact_id	integer
note_id	integer	default 0

The table also has this index:

   "cmd_contact_task_entries_pkey"    primary key, btree (entry_id)

Using TSearch2
You need to make several modifications to the table to use Tsearch2. First, add a vectors column. The vectors column carries the vector value of the column data you are going to be searching. In this case you want to calculate the vectors for the description column. To add the column to the table execute the following command.

   ALTER TABLE public.cmd_contact_task_entries   ADD COLUMN vectors tsvector;

The command adds a new type?called tsvector?to PostgreSQL via Tsearch. After adding the column you also need to add a new index. The index to be created is a gist index. Gist is an acronym for Generalized Search Tree. You can find more information on the specifics of this type of index here. Tsearch uses the gist index to store the words of the documents by lexeme?a fundamental unit of language. A lexeme of a word is based on the lexicon (the collection of words) of a particular language such as English. For example, a lexeme of the word find is found, or finding.

   CREATE INDEX description_index    ON cmd_contact_task_entries    USING gist(vectors);

You have now created the basic structure for using full text indexing with the cmd_contact_task_entries table but you need to update the vectors column to have the correct information. You can do this with the following update statement:

   UPDATE cmd_contact_task_entries    SET vectors = to_tsvector(description);

After completing these steps, you can start processing queries. Here's an example query that looks for the same terms as the regex example at the beginning of this article. The query displays matched items in the TSearch headline format:

   SELECT entry_id,headline(description,q),    rank(vectors,q)    FROM cmd_contact_task_entries,    to_tsquery('apache|php') AS q    WHERE vectors @@ q ORDER BY rank(vectors,q) DESC;

On our work system, the query returns 216 rows (versus only 7 using the regex query). The first five are displayed below:

   -[ RECORD 1]------------------------------------------    entry_id | 2435   headline | php php-devel php   -imap php-ldap pine psmisc python    python-devel   rank | 0.468559      -[ RECORD 2 ]-----------------------------------------    entry_id | 4167   headline | php for apache and we    attempted to fix the dependencies several times and    ended up   rank | 0.468559      -[ RECORD 3 ]-----------------------------------------    entry_id | 2786   headline | php 4.3.2 and compiled    (but apache wasn't starting back up    properly....redoing it with   rank | 0.40951      -[ RECORD 4 ]-----------------------------------------    entry_id | 2203   headline | apache 1.3. Installed LXP 0.9.   Apache is now running with LXP but did not    get a chance   rank | 0.3439      -[ RECORD 5 ]-----------------------------------------    entry_id | 1504   headline | Apache 1.3.20, PHP 4.0.5,    lxp 0.7.9, OpenSSL and   rank | 0.3439

Notice that using the headline function in the search query causes the results to have the search terms emphasized with the (bold) html tag. Also, the results are ranked to provide a weighted result.

You can make this more manageable by rewriting the query as a function. Because you'll need to return multiple rows, the function needs to be a SetOF function. When creating a SetOf function you need to create a custom type:

   CREATE TYPE desc_results    AS (entry_id INTEGER, headline TEXT, rank REAL);

After creating the custom type you create a function that calls that type:

   CREATE OR REPLACE FUNCTION dresults(text)    RETURNS SETOF desc_results AS '   SELECT entry_id,headline(description,q),   rank(vectors,q)    FROM cmd_contact_task_entries,    to_tsquery($1) AS q    WHERE vectors @@ q ORDER BY rank(vectors,q)    DESC;' LANGUAGE 'SQL';

Now, rather than writing the full search query, you can use a much simpler query such as:

   SELECT * FROM dresults('apache|php');

TSearch2 has many capabilities that were not discussed in this article. If after reading this, you're still thirsting for more information I suggest you review the excellent project documentation listed in the resources column.

About Our Editorial Process

At DevX, we’re dedicated to tech entrepreneurship. Our team closely follows industry shifts, new products, AI breakthroughs, technology trends, and funding announcements. Articles undergo thorough editing to ensure accuracy and clarity, reflecting DevX’s style and supporting entrepreneurs in the tech sphere.

See our full editorial policy.

About Our Journalist

Charlie Frank

Charlie has over a decade of experience in website administration and technology management. As the site admin, he oversees all technical aspects of running a high-traffic online platform, ensuring optimal performance, security, and user experience.
View Author

Netherlands thriving in European venture capital growth
Noah Nguyen April 23, 2024 5:42 PM

Fintech startup Ramp sees formidable growth
Cameron Wiggins April 23, 2024 5:14 PM

Is Bitcoin the New Gold? Understanding Cryptocurrency as a Hedge Against Inflation
Rashan Dixon April 23, 2024 4:05 PM

Zipline delivers millionth order, aims higher
Rashan Dixon April 23, 2024 3:49 PM

Embracer Group splits into three, boosts efficacy
Johannah Lopez April 23, 2024 1:58 PM

Predicted 2025 COLA sees modest growth
Noah Nguyen April 23, 2024 1:50 PM

Unexpected eviction unsettles Indianapolis businesses
Cameron Wiggins April 23, 2024 11:46 AM

Final April disability payment scheduled for release
Johannah Lopez April 23, 2024 8:08 AM

Social Security amends SSI eligibility for 2024
Johannah Lopez April 23, 2024 8:01 AM

Silver tsunami presents opportunity for young entrepreneurs
Johannah Lopez April 23, 2024 8:00 AM

Unexpected eviction disrupts Indianapolis businesses
Cameron Wiggins April 23, 2024 7:51 AM

Embracer Group restructures into three entities
April Isaacs April 23, 2024 7:07 AM

Mercedes-AMG unveils powerful, innovative 2025 GT 63 S E Performance
Johannah Lopez April 22, 2024 5:45 PM

Google poised for notable organizational changes
Johannah Lopez April 22, 2024 5:29 PM

Merging Catholic faith with entrepreneurial education
April Isaacs April 22, 2024 3:59 PM

AI entrepreneur raises $13.8 million via LinkedIn
April Isaacs April 22, 2024 3:35 PM

Why ChatGPT Is So Important Today
Grace Phillips April 22, 2024 3:07 PM

How Tech Advancements and Tools Have Enhanced Business
Kyle Lewis April 22, 2024 2:22 PM

Mountain biking’s technological advances highlighted in 2024 event
Johannah Lopez April 22, 2024 1:22 PM

St. Louis mourns closure of cherished local establishments
Noah Nguyen April 22, 2024 1:05 PM

Anker’s PowerHouse 90: high-capacity, versatile charger now discounted
Noah Nguyen April 22, 2024 8:06 AM

Marketing executives embracing entrepreneurial approach for growth
April Isaacs April 22, 2024 7:42 AM

Arcana: Overcoming challenges to thrive in San Francisco
Cameron Wiggins April 22, 2024 7:03 AM

Anticipating Bitcoin’s halving: impacts and strategies
Rashan Dixon April 19, 2024 5:39 PM

Nothing startup integrates ChatGPT into devices
Rashan Dixon April 19, 2024 5:06 PM

Show More