devxlogo

Implementing Full Text Indexing with PostgreSQL

Implementing Full Text Indexing with PostgreSQL

ike many companies, over the years my company developed a home-grown billing/project system that allows our programmers to bill to a Web application. This system is quite capable; however one of the drawbacks to the system is that there’s no way to search the billable entries or customer notes.

Of course, we could use simple LIKE or regex queries via SQL but that would be resource-intensive and would not provide the detailed results we would like. For example, to search for all billing entries that had a description with the words ‘apache’ (to protect the customer name) and ‘php’ we could write a query like this:

   SELECT description    FROM cmd_contact_task_entries    WHERE description ~ 'php'    AND description ~ 'apache';

Author’s Note: The tilde (~) in the query represents the regular expression operator in PostgreSQL.

In our system the preceding query would return seven rows, but it would not rank the matching documents. Moreover, if we were looking for more detailed information we would have to create some fairly onerous queries. What this system really needs is full text indexing.

Although PostgreSQL is, in my opinion, the most capable open source database available, it does have its drawbacks. One such drawback is that PostgreSQL doesn’t include full text indexing as part of the core distribution. However, as with any good open source project, the PostgreSQL community has contributed many useful modules?and one such module is TSearch2.

TSearch2 is a maintained contrib module for both PostgreSQL 7.3 and 7.4. In short, TSearch2 is a new indexable data type that stores the position of words within a particular document?thus making the retrieval of those documents based on specified parameters both efficient and reliable.

Installing TSearch2
There are three ways to get TSearch2 implemented in your PostgreSQL installation.

  1. Download and compile the source
  2. Download and install the contrib rpm (if you are running Linux)
  3. Purchase a binary distribution

Compiling PostgreSQL is outside the scope of this article. However, I will cover compiling and installing the TSearch2 module itself.

The TSearch2 module is included with the PostgreSQL source and can be found within the root of the tree at contrib/tsearch2.

After you have configured and installed PostgreSQL itself, you need to perform two more simple steps to get Tsearch2 installed. From the root of the PostgreSQL source tree, enter the following two commands:

   cd contrib/tsearch2   make; make install

That’s it. The module is now installed. Of course you still have to configure PostgreSQL for its use. For the examples in this article, you’ll use a development database called Fingerless (long story) to test TSearch2.

To configure the database to use TSearch2 you need to execute the tsearch2.sql file against the fingerless database. To do that, enter the following psql command:

   psql fingerless < tsearch2.sql

Executing the script produces some noisy output as it creates sample tables and indexes for testing Tsearch2. Don't worry about all this output. When the script completes it will have added the following tables to your database (see Table 1):

Table 1. A list of tables added by executing the tsearch2.sql file.

SchemaNameTypeOwner
publicpg_ts_cfgtablepostgres
publicpg_ts_cfgmaptablepostgres
publicpg_ts_dicttablepostgres
publicpg_ts_parsertablepostgres

You will also have a detailed list of new functions and data types. For more information on what the script actually adds to your database please see the tsearch2.sql file or the Web site documentation.

The billing application keeps the main billing details in the table named public.cmd_contact_task_entries.

The table has the structure shown in Table 2:

Table 2: Structure of the public.cmd_contact_task_entries table.

ColumnTypeModifiers
entry_idintegernot null default nextval('"cmd_contact_task_e_entry_id_seq"'::text)
task_idinteger 
creator_employee_idinteger 
descriptiontext 
time_spentinterval 
created timestamp with time zone 
billedtimestamp with time zone 
hourly_rate numeric(6,2) 
internal_notes textdefault ''::text
contact_idinteger 
note_idintegerdefault 0

The table also has this index:

   "cmd_contact_task_entries_pkey"    primary key, btree (entry_id)

Using TSearch2
You need to make several modifications to the table to use Tsearch2. First, add a vectors column. The vectors column carries the vector value of the column data you are going to be searching. In this case you want to calculate the vectors for the description column. To add the column to the table execute the following command.

   ALTER TABLE public.cmd_contact_task_entries   ADD COLUMN vectors tsvector;

The command adds a new type?called tsvector?to PostgreSQL via Tsearch. After adding the column you also need to add a new index. The index to be created is a gist index. Gist is an acronym for Generalized Search Tree. You can find more information on the specifics of this type of index here. Tsearch uses the gist index to store the words of the documents by lexeme?a fundamental unit of language. A lexeme of a word is based on the lexicon (the collection of words) of a particular language such as English. For example, a lexeme of the word find is found, or finding.

   CREATE INDEX description_index    ON cmd_contact_task_entries    USING gist(vectors);

You have now created the basic structure for using full text indexing with the cmd_contact_task_entries table but you need to update the vectors column to have the correct information. You can do this with the following update statement:

   UPDATE cmd_contact_task_entries    SET vectors = to_tsvector(description);

After completing these steps, you can start processing queries. Here's an example query that looks for the same terms as the regex example at the beginning of this article. The query displays matched items in the TSearch headline format:

   SELECT entry_id,headline(description,q),    rank(vectors,q)    FROM cmd_contact_task_entries,    to_tsquery('apache|php') AS q    WHERE vectors @@ q ORDER BY rank(vectors,q) DESC;

On our work system, the query returns 216 rows (versus only 7 using the regex query). The first five are displayed below:

   -[ RECORD 1]------------------------------------------    entry_id | 2435   headline | php php-devel php   -imap php-ldap pine psmisc python    python-devel   rank | 0.468559      -[ RECORD 2 ]-----------------------------------------    entry_id | 4167   headline | php for apache and we    attempted to fix the dependencies several times and    ended up   rank | 0.468559      -[ RECORD 3 ]-----------------------------------------    entry_id | 2786   headline | php 4.3.2 and compiled    (but apache wasn't starting back up    properly....redoing it with   rank | 0.40951      -[ RECORD 4 ]-----------------------------------------    entry_id | 2203   headline | apache 1.3. Installed LXP 0.9.   Apache is now running with LXP but did not    get a chance   rank | 0.3439      -[ RECORD 5 ]-----------------------------------------    entry_id | 1504   headline | Apache 1.3.20, PHP 4.0.5,    lxp 0.7.9, OpenSSL and   rank | 0.3439

Notice that using the headline function in the search query causes the results to have the search terms emphasized with the (bold) html tag. Also, the results are ranked to provide a weighted result.

You can make this more manageable by rewriting the query as a function. Because you'll need to return multiple rows, the function needs to be a SetOF function. When creating a SetOf function you need to create a custom type:

   CREATE TYPE desc_results    AS (entry_id INTEGER, headline TEXT, rank REAL);

After creating the custom type you create a function that calls that type:

   CREATE OR REPLACE FUNCTION dresults(text)    RETURNS SETOF desc_results AS '   SELECT entry_id,headline(description,q),   rank(vectors,q)    FROM cmd_contact_task_entries,    to_tsquery($1) AS q    WHERE vectors @@ q ORDER BY rank(vectors,q)    DESC;' LANGUAGE 'SQL';

Now, rather than writing the full search query, you can use a much simpler query such as:

   SELECT * FROM dresults('apache|php');

TSearch2 has many capabilities that were not discussed in this article. If after reading this, you're still thirsting for more information I suggest you review the excellent project documentation listed in the resources column.

devxblackblue

About Our Editorial Process

At DevX, we’re dedicated to tech entrepreneurship. Our team closely follows industry shifts, new products, AI breakthroughs, technology trends, and funding announcements. Articles undergo thorough editing to ensure accuracy and clarity, reflecting DevX’s style and supporting entrepreneurs in the tech sphere.

See our full editorial policy.

About Our Journalist