Wednesday, February 24, 2010

"My databse is slow" - part 3: some technical stuff

Note: below description uses PostgreSQL 8.3.7.

In the previous two articles I tried to show what can affect the overall performance of the database and how important are indexes.
If we assume that the database is normalized, memory settings for the server are ok and we have indexes, the only way to check the performance is monitoring the query plans. Therefore I would like to return once again to the query plans and look at them with more details.

From the previous post, we know that we can generate plans for queries using two commands. These commands are:
  • EXPLAIN SQL_command
  • EXPLAIN ANALYZE SQL_command - this command is a variant of the first with small difference: examined query is executed and EXPLAIN ANALYZE returns the real execuion time.
Using the test database from the previous post, let's see the query plan for the following SQL command:

select * from customers where lname = 'Cervantes';

To see the query plan, execute those two commands:

vacuum analyze;
explain select * from customers where lname = 'Cervantes';

Query plan that has been generated for this query is as follows:


Always we are most interested in "the highest" line of displayed plan - it shows the total cost of the query execution, after all other internal operations for more complex queries (eg joins, sorts, etc.).

In our case we have: (cost: 4.51 .. 59.47 rows = 33 width = 18). What does it mean?
  1. cost: 4.51 .. 59.47 - this is an interval defining total execution time measured in special units called disk page fetches - these units are not seconds nor other units of time. First number estimates time needed to retrieve first row in the result, second estimates total cost of returning all rows in the result. The calculation of this value is "higher mathematics", especially if the query uses more techniques than primitive table sequential scanning - which is the easiest and can be easily calculated. Unfortunately, but only for primitive examples like "select * from some_table" (without the use of indices to be 100% sure that the scan will be performed sequentially and without any query conditions "where"), You can find over the Web some examples of calculating these values. Those who are interested in more details and technical stuff, should visit this page
  2. rows = 33 - this is estimated number of rows returned by a query.
  3. width = 18 - this is estimated size (in bytes) of all the returned rows

Above data are obtained on the basis of statistics located in a special system tables. There are two tables of statistics which are crucial for the query planner:
  1. pg_class - contains information about certain database relations (tables, indexes, etc.), i.e. their location and size.
  2. pg_stats - contains information about the data stored in tables, i.e. the most common values, the frequency of common values occurrence.
Let's see what are the statistics for the table "customers", and what is their impact on some elements of the query plan for above query about user "Cervantes".



Returned data mean that the table "cutomers" has 10.000 records and is located on 62 "pages".


Returned data are based on some statistical data for all the columns in the table "cusotmers" and their content. It is specified which values are most popular, what is their frequency, and amount of values which are different from each other.
What is the meaning of this data in the context of our query about "Cervantes" user?

pg_class - is the basis in calculating the cost of the query execution. For our "Cervantes" query  in the case of sequential scan (if there were no index on the "lname" field in the Customers table), we would calculate the cost from the formula:
(relpages * SEQ_PAGE_COST) + (reltuples * CPU_TUPLE_COST) + (relpages * CPU_OPERATOR_COST)
Capitalized names are costs of arbitrary operation - the are defined in postgresql.conf. Unfortunately (or "fortunately"), our query uses indexes, so the calculation of the cost is a bit more complicated (task for volunteers, formulas are here, You only have to to be familiar with the concepts of "last recently used buffers" (LRU) and Mackert-Lohman formula ;-)).

pg_stats - allows to estimate the number of rows returned in a query using given logical condition (in our case: "where lname = 'Cervantes'"). In our case for the "Cervantes" query we are lucky - "Cervantes" occurs directly in the list of most common surnames, and also is known for its occurrence. The number of rows returned is calculated using the formula:
relpages * selectivity
where:
selectivity = most_common_freqs["Cervantes"]
Hence we get:
10000 * 0,00333333 = 33
And what if we are looking for value which is not in the column "most_common_freqs"? Then the formula has a different form:
selectivity = (1 - sum(most_common_freqs))/(num_distinct - num_most_common_vals)
Calculating selectivity value for the name "Smith", I leave for volunteers as usually ;-). Remember that first You have to execute "vacuum analyze" and then view statistics and then the query plan.

-------------------------------------------- 
Some notes about the statistics:

The content of statistics tables is generated based on random samples taken from the tables in the database. Generating (and therefore refreshing) full statistics takes place during the execution of the VACUUM ANALYZE command. Hence, it is very important to run this command periodically to ensure that statistics for the planner are up to date. Therefore, before showing the above example the first command executed by me was VACUUM ANALYZE in order to query plan be based on the most reliable statistics.
It should be noted that next execution of this command can generate a slightly different statistics, so query plan for the same query may be slightly different (in terms of cost and estimated number of rows). This is because of mentioned random samples taken for calculations.

--------------------------------------------
A few final notes about performance

I can say that the general rule to optimize query execution time is as follows:
"There is no certain way to increase performance in all cases. The best method is trial and error."
In practice it means to try to generate different query plans (in addition to that which is generated by default) and select those with the best execution time. These attempts usually involve forcing the query planner  to behave in different way which leads to a different query plan. This can be achieved through:

1. Forcing the query planner to use an index (often the index exists but is not used by the planner). This can be done for example by:

  • switching off a global sequential scans by setting enable_seqscan = off in postgresql.conf file
  • partial indexes (conditional) - You can create a conditional index for the most frequently returned data in column
  • indexes using multiple fields (watch out for the order - the order of fields in the index should be the same as the order of fields in queries)
  • some SQL tricks related to change the query selectivity, for example using LIMIT, ORDER BY or trailing condition AND TABLE_ID = TABLE_ID (where TABLE_ID is a master key in the table)
2. Changing the order of joining multiple tables, for example executing "(A join B) join C" can be different than executing "A join (B join C)" - this is similar to a problem of  "optmial matrix-chain multiplication".
3. Using stored procedures, views, breaking huge query into a number of smaller queries .

Interesting fact (found over the Web):
In order to monitor the progress of executing a long query to the database (a really long time, not a few seconds) you can create a sequence with the value set to 1 (eg, the sequence "monitor_seq") and add it to executing query:


select nextval('monitor_seq'), column1, column2....

Executing:

select currval('monitor_seq');

at the time of running the long query will show us the progress.