Tuesday, January 05, 2010

"My database is slow" - part 2: indexes.

Note: below description uses PostgreSQL 8.3.7.

This article is a continuation of previous one about database performance.
Suppose we have a database that is normalized to our needs, the memory of the server was also quite reasonably allocated, we are using a connection pool and we did not change the default transaction isolation level for our database - but the database still is not a "speed daemon". What can You do in this situation? Did we miss something?

You can look inside the database giva a closer look on specific operations performed on the data to determine which operations cause performance degradation and possibly "help" them perform faster. As the sample database I will use PostgreSQL - I will examine its specific mechanisms to check and improve performance. This will be only an outline of what can be done (based on crude example), because the topic is so large that it may well be placed on a separate dedicated blog.

Requirements:
  • installed PostgreSQL (description here)
  • a test database with sample data - just to show how PostgreSQL performs a very simple SQL opeartions
Let's consider a very simple database for some online store with two tables: customer table (called Customer) and customers' orders table (called Order). Just open pgAdmin and execute this SQL command:


CREATE DATABASE store ENCODING = 'UTF8';


Then connect to database "store" and run this SQL commands:


CREATE TABLE customers
(
customer_id INTEGER,    
fname VARCHAR(32),
lname VARCHAR(32) NOT NULL,
CONSTRAINT customer_pk PRIMARY KEY(customer_id)
);

CREATE TABLE orders
(
order_id INTEGER,
customer_id INTEGRE NOT NULL CONSTRAINT customer_order_fk REFERENCES customers(customer_id),
description TEXT,
price NUMERIC(7,2),    
CONSTRAINT order_pk PRIMARY KEY(order_id)    
);

For simplicity, here I do not use a SERIAL field to hold primary keys - just ordinary integer with constraint. Now we should fill the database with some sample data: lets put about 10 000 records to Customers table and 30 000 records to Orders table.  That is rather huge amount of data. How can we do it?
The best way is to write a Java program that generates a file with proper SQL INSERT statements for each table - this is homework for You ;-)
I wrote simple program which takes popular first names for males and popular first names for females, then takes popular surnames from here and combines them together to produce 10.000 "people". Having those records and knowing their customer_id (that's why I used ordinary integer field - just to write a program) my program generated random 30 000 orders for those customers.

And now suppose we want to retrieve all customers named Smith and we want to see all orders for each customer Smith. We need two queries:


select * from customers where lname = 'Smith';
select * from orders join customers on orders.customer_id=customers.customer_id where customers.lname = 'Smith';

In my case, the first query returns just 9 customers named Smith (from total of 10 000) and 30 orders (from total of 30 000) for all 9 "Smiths".

Okay, we have SQL commands - but how can we check their execution time on a database? PostgreSQL provides two commands that show details about executed commands:
a) EXPLAIN SQL_command
b) EXPLAIN ANALYZE SQL_command - this command is a variant of the first with small difference: examined query is executed and EXPLAIN ANALYZE returns the real execuion time.

Before we show an example of above commands execution, some technical information about the process of executing the query by PostgreSQL: when the PostgreSQL server receives SQL command from client application, the query string is passed to the query parser that checks for syntax errors. If everything is ok, parser based on the query creates a structure called the query tree. Then, this structure is passed to the query planner which generates all possible query plans and chooses the shortest one from them. This plan is passed to the executor who performs the query and returns the records in the manner described in query plan.
Mentioned EXPLAIN command displays the details of the plan chosen by the planner as the shortest one. Let's see the result of executing EXPLAIN ANALYZE for the queries about users Smith:




Because reading query plans and their details is a separate art, we will focus on the actual run time (visible as a Total Runtime, respectively 1,482 ms for the first question, and 16,552 ms for the second question).

Lets provide indexes for the fields we are searching for and for the fields where the join operation is performed. In our case it will be:
a) the lname column of the Customers table - create index with the command: 

CREATE INDEX lname_index ON customers USING btree (lname);

b) the foreign key column in Orders table (the column cutomer_id) - create an index using the command: 

CREATE INDEX customer_fk_index ON orders USING btree (customer_id);

Note:
After You create indexes, You have to refresh the statistics used by database planner to calculate the optimal query plan. This is done by running VACUUM ANALYZE command. After executing this command, once again execute the EXPLAIN ANALYZE command for our "Smith" queries:



 

The results of the queries which are several times faster than previously. They are respectively: 0,123 ms (1,482 ms previously), and 0,507 ms (16,552 ms previously).

The advantage is obvious. But whether the index will solve all our problems? Not always. There is no guarantee that the planner will benefit from the index. Sometimes it appears that the sequential table scan can be faster than using an index (such behavior may depend on the "selectivity" degree of certain query). In most cases it is good to have indexes - they can only help.