Monday, October 12, 2009

"My database is slow" - part 1

Note: below description uses PostgreSQL 8.3.7.

Perhaps You often heard complaints that some application is veeeery sloooow. Despite the technology used, bought another X GB of RAM and so on the problem still exists. Sometimes the cause can be very simple - interaction with the database can be a bottleneck and a source of all problems. This can be checked and possibly improved.
Often the developer itself must change something in the database - do not be afraid to say it - hire a database magician cost a lot of money for the company and often the developer becomes a database specialist ;-). Therefore, anyone who changes something in the database must have some knowledge. Otherwise, some ad-hoc changes can cause a lot of performance problems.
What can you do then? Having the database, You can take a closer look on some points:

1. Database normalization (decomposition).

Simply said - making a database to be in the n-th normal form. You may naively believe that "the database engine can handle" somehow our SQL query, without worrying ourselves about performance. Theoretically yes, even nightmarish database schema will work somehow - after all, a bit smarter (than the rest) people formed the principles of relational databases. The problem starts when our application is no longer a primitive data browser which executes the same "select" SQL query all the time. That's why the other smarter people have invented a normalization. For example this allows the database engine not to retrieve 20 columns from one table just to show us only those 2 which we are interested.
For example let's assume that we have a company and we store all the employees in a one big table:

CREATE TABLE employees ( 
id                int4 PRIMARY KEY,
active            boolean, 
first_name        varchar, 
middle_name       varchar, 
last_name         varchar, 
ssn               varchar, 
address1          varchar, 
address2          varchar, 
city              varchar, 
state             varchar(2), 
zip               varchar, 
home_phone        varchar, 
work_phone        varchar, 
cell_phone        varchar, 
fax_phone         varchar, 
pager_number      varchar, 
business_email    varchar, 
personal_email    varchar, 
salary            int4, 
vacation_days     int2, 
sick_days         int2, 
employee_number   int4, 
office_addr_1     varchar,
office_addr_2     varchar, 
office_city       varchar, 
office_state      varchar(2), 
office_zip        varchar, 
department        varchar,
title             varchar,
supervisor_id     int4
); 

And now some Secretary wants to find those people who earn more than 5000 USD:

SELECT first_name, last_name from employees where salary > 5000 ;

Let's assume that the company is a big corporation which employs thousands of people. Database engine must scan entire table, retrieve all columns for people who earn more than 5000 and at the end pefrorm a projection and return only 2 columns - first name and last name.

Note: do not kid ourselves - normalization is not an ideal remedy for performance problems, and sometimes may even contribute to decrease a performance. Why? For example let's take huge reporting queries that collect a lot of cross-sectional data. In the fully normalized database, the amount of join operations between tables can be very large and could have significant impact on performance. Sometimes it is better to have non-normalized parts of database schema, or even have non-normalized redundant data tables (for generating reports) next to normalized parts. There is no silver bullet solution for that - the approach depends on specific problem. 

2. Memory configuration for the server.  

The fact that Your hardware allows You to play latest games, does not mean that the database will work fast. Even 10-th normal form of a database schema and super-optimal queries will not help when server has total memory similar to required by old DOS games. Sooner or later because of lack of memory many I/O disk operation will be performed to compensate memory request. System resources are important. Unfortunately, at this point databases vary widely in terms of available configuration options. "No silver bullet" also applies here - the only way to discover best performance is to tune database memory settings and test it against SQL queries. 

3. Opening connection to the database.

Although the authors of JDBC drivers maintain that connection cost is small and well optimized, we must not forget that after all this is some expense. As an example, consider a typical web application where in every moment a user clicks on different screens, which show him various data. We have to approaches here:
  • maintaining an open connection during user's working time
  • opening connection for each request, performing some database query, returns the result (which is presented to the user) and closing the connection.
First approach is expensive (in terms of resources and performance), second is much better. But when we take into account the large number of users and their actions, total cost of opening and closing connections may have considerable importance. It is worth here to consider using the so-called connection pool - such as Apache DBCP or C3P0. Responsibility for creating/closing of connections goes to connection pool. The pool itself has various configuration options that can increase the efficiency of cooperation with the database application. For example some connections may have been opened after startup and ready to immediate use and so on. 

4. Transaction isolation levels and locks.

Transaction isolation levels have a direct impact on the types of locks used. If You are not sure how to choose the level of transaction isolation, it is better to remain at the default level for a particular database. You have to remember one important thing: the higher isolation level you choose the more subtle and complex lock must be use by the database engine to provide concurrent access to data. Using higher levels of isolation can also cause deadlock occurrence.


Note: You may not agree with mentioned points. They are only my personal subjective choice based on problems and observations during my work. Database performance is a very broad issue dependent on many factors - I pointed only four general issues, but You may find another four or more which will be just as important and necessary.