MySQL/Optimization

Before Starting To Optimise

When the database seems to be "slow" first consider all of the following points as e.g. making a certain query absolutely unnecessary by simply using a more sophisticated algorithm in the application is always the most elegant way of optimising it :)

Finding the bottleneck (CPU, memory, I/O, which queries)
Optimising the application (remove unnecessary queries or cache PHP generated web pages)
Optimising the queries (using indices, temporary tables or different ways of joining)
Optimising the database server (cache sizes etc)
Optimising the system (different filesystem types, swap space and kernel versions)
Optimising the hardware (sometimes indeed the cheapest and fastest way)

To find those bottlenecks the following tools have been found to be helpful:

vmstat: to quickly monitor cpu, memory and I/O usage and decide which is the bottleneck
top: to check the current memory and cpu usage of mysqld as well as of the applications
mytop: to figure out which queries cause trouble
mysql-admin (the GUI application, not to confuse with mysqladmin): to monitor and tune mysql in a very convenient way
mysqlreport: which output should be use as kind of step by step check list

Using these tools most applications can also be categorised very broadly using the following groups:

I/O based and reading (blogs, news)
I/O based and writing (web access tracker, accounting data collection)
CPU based (complex content management systems, business apps)

Optimising the Tables

Use the following command regularly to reorganize the disk space which reduces the table size without deleting any record^[1]:

OPTIMIZE TABLE MyTable1

Moreover, when creating the tables, their smallest types are preferable. For example:

if a number is always positive, choose an unsigned type to be able to store twice more into the same number of bytes.
to store the contemporaneous dates (from 1970 to 2038), it's better to take a timestamp on four bytes, than a datetime on 8.^[2]

Optimising the Queries

Comparing functions with BENCHMARK

The BENCHMARK() function can be used to compare the speed of MySQL functions or operators. For example:

mysql> SELECT BENCHMARK(100000000, CONCAT('a','b'));
+---------------------------------------+
| BENCHMARK(100000000, CONCAT('a','b')) |
+---------------------------------------+
|                                     0 |
+---------------------------------------+
1 row in set (21.30 sec)

However, this cannot be used to compare queries:

mysql> SELECT BENCHMARK(100, SELECT `id` FROM `lines`);
ERROR 1064 (42000): You have an error in your SQL syntax;
check the manual that corresponds to your MySQL server version for
the right syntax to use near 'SELECT `id` FROM `lines`)' at line 1

As MySQL needs a fraction of a second just to parse the query and the system is probably busy doing other things, too, benchmarks with runtimes of less than 5-10s can be considered as totally meaningless and equally runtimes differences in that order of magnitude as pure chance.

Analysing functions with EXPLAIN

When you precede a SELECT statement with the keyword EXPLAIN, MySQL explains how it would process the SELECT, providing information about how tables are joined and in which order. This allows to place some eventual hints in function.

Using and understanding EXPLAIN is essential when aiming for good performance therefore the relevant chapters of the official documentation are a mandatory reading!

A simple example

The join of two table that both do not have indices:

 mysql> explain SELECT * FROM a left join b using (i) WHERE a.i < 2;
 +----+-------------+-------+------+---------------+------+---------+------+------+-------------+
 | id | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra       |
 +----+-------------+-------+------+---------------+------+---------+------+------+-------------+
 |  1 | SIMPLE      | a     | ALL  | NULL          | NULL | NULL    | NULL |    4 | Using where |
 |  1 | SIMPLE      | b     | ALL  | NULL          | NULL | NULL    | NULL |    3 |             |
 +----+-------------+-------+------+---------------+------+---------+------+------+-------------+
 2 rows in set (0.01 sec)

Now the second table gets an index and the explain shows that MySQL now knows that only 2 of the 3 rows have to be used.

 mysql> ALTER TABLE b ADD KEY(i);
 Query OK, 3 rows affected (0.01 sec)
 Records: 3  Duplicates: 0  Warnings: 0

 mysql> explain SELECT * FROM a left join b using (i) WHERE a.i < 2;
 +----+-------------+-------+------+---------------+------+---------+----------+------+-------------+
 | id | select_type | table | type | possible_keys | key  | key_len | ref      | rows | Extra       |
 +----+-------------+-------+------+---------------+------+---------+----------+------+-------------+
 |  1 | SIMPLE      | a     | ALL  | NULL          | NULL | NULL    | NULL     |    4 | Using where |
 |  1 | SIMPLE      | b     | ref  | i             | i    | 5       | test.a.i |    2 |             |
 +----+-------------+-------+------+---------------+------+---------+----------+------+-------------+
 2 rows in set (0.00 sec)

Now the first table also gets an index so that the WHERE condition can be improved and MySQL knows that only 1 row from the first table is relevant before even trying to search it in the data file.

 mysql> ALTER TABLE a ADD KEY(i);
 Query OK, 4 rows affected (0.00 sec)
 Records: 4  Duplicates: 0  Warnings: 0

 mysql> explain SELECT * FROM a left join b using (i) WHERE a.i < 2;
 +----+-------------+-------+-------+---------------+------+---------+----------+------+-------------+
 | id | select_type | table | type  | possible_keys | key  | key_len | ref      | rows | Extra       |
 +----+-------------+-------+-------+---------------+------+---------+----------+------+-------------+
 |  1 | SIMPLE      | a     | range | i             | i    | 5       | NULL     |    1 | Using where |
 |  1 | SIMPLE      | b     | ref   | i             | i    | 5       | test.a.i |    2 |             |
 +----+-------------+-------+-------+---------------+------+---------+----------+------+-------------+
 2 rows in set (0.02 sec)

Optimising The MySQL Server

Status and server variables

MySQL can be monitored and tuned by watching the status-variables and setting the server-variables which can both be global or per session. The status-variables can be monitored by SHOW [GLOBAL|SESSION] STATUS [LIKE '%foo%'] or mysqladmin [extended-]status. The server-variables can be set in the /etc/mysql/my.cnf file or via SET [GLOBAL|SESSION] VARIABLE foo := bar and be shown with mysqladmin variables or SHOW [GLOBAL|SESSION] VARIABLES [LIKE '%foo%'].

Generally status variables start with a capital letter and server variables with a lowercase one.

When dealing with the above mentioned per-session system variables it should always be considered that those have to be multiplied by max_connections to estimate the maximal memory consumption. Failing to do so can easily lead to server crashes at times of load peaks when more than usual clients connect to the server! A quick and dirty estimation can be made with the following formular:

   min_memory_needed = global_buffers + (thread_buffers * max_connections)

   global_buffers:
       key_buffer
       innodb_buffer_pool
       innodb_log_buffer
       innodb_additional_mem_pool
       net_buffer

   thread_buffers:
       sort_buffer
       myisam_sort_buffer
       read_buffer
       join_buffer
       read_rnd_buffer

Note: Especially when dealing with server settings, all information should be verified in the respective chapters of the official documentation as these are subject of change and the authors of this text lack confirmed knowledge about how the server works internally.

Index / Indices

Indices are a way to locate elements faster. This works for single elements as well as range of elements.

Experiment

Note: when you make your time tests, make sure the query cache is disabled (query_cache_type=0 in my.cnf) to force recomputing your queries each time you type them instead of just taking the pre-computed results from the cache.

Let's run the following Perl program:

#!/usr/bin/perl

use strict;

print "DROP TABLE IF EXISTS weightin;\n";
print "CREATE TABLE weightin (
	id INT PRIMARY KEY auto_increment,
	line TINYINT,
	date DATETIME,
	weight FLOAT(8,3)
);\n";


# 2 millions records, interval = 100s
for (my $timestamp = 1000000000; $timestamp < 1200000000; $timestamp += 100) {
    my $date = int($timestamp + rand(1000) - 500);
    my $weight = rand(1000);
    my $line = int(rand(3)) + 1;
    print "INSERT INTO weightin (date, line, weight) VALUES (FROM_UNIXTIME($date), $line, $weight);\n";
}

What does it do? It simulate the data feeds from an industrial lines that weight stuff at regular intervals so we can compute the average material usage. Over time lots of records are piling up.

How to use it?

 mysql> CREATE DATABASE industrial
 $ perl generate_huge_db.pl | mysql industrial
 real	6m21.042s
 user	0m37.282s
 sys	0m51.467s

We can check the number of elements with:

 mysql> SELECT COUNT(*) FROM weightin;
 +----------+
 | count(*) |
 +----------+
 |  2000000 | 
 +----------+
 1 row in set (0.00 sec)

The size must be important:

 $ perl generate_huge_db.pl > import.sql
 $ ls -lh import.sql
 -rw-r--r-- 1 root root 189M jun 15 22:08 import.sql
 
 $ ls -lh /var/lib/mysql/industrial/weightin.MYD
 -rw-rw---- 1 mysql mysql 35M jun 15 22:17 /var/lib/mysql/industrial/weightin.MYD
 
 $ time mysqldump industrial > dump.sql 
 real	0m9.599s
 user	0m3.792s
 sys	0m0.616s
 $ ls -lh dump.sql
 -rw-r--r-- 1 root root 79M jun 15 22:18 dump.sql
 
 $ time mysqldump industrial | gzip > dump.sql.gz 
 real	0m17.339s
 user	0m11.897s
 sys	0m0.488s
 $ ls -lh dump.sql.gz  
 -rw-r--r-- 1 root root 22M jun 15 22:19 dump.sql.gz

Incidentally restoring from the dump is way faster, because it uses extended inserts!

 # time zcat dump.sql.gz | mysql industrial
 real	0m31.772s
 user	0m3.436s
 sys	0m0.580s

This SQL command will scan all records to get a total sum:

 mysql> SELECT SUM(*) FROM weightin;

Let's say we need to compute the total material used during January 1st 2008:

 mysql> SELECT COUNT(*), SUM(poids) FROM pesee WHERE date >= '2008-01-01' AND date < '2008-01-02';

MySQL will also need to browse the entire database, even for this tiny number of records. This is because records can be anywhere: at the bottom, at the end, in the middle, nothing guarantees that the records are ordered.

To improve this, we can add an index to the 'date' field. This means MySQL will create a new hidden table with all the date sorted chronologically, and store their offset (position) in the 'weightin' table to retrieve the full record.

Because the index is sorted, it's way faster for MySQL to locate a single record (using a binary search algorithm) or even a range of data (find the first and last element, the range is in-between).

To add the index:

 ALTER TABLE weightin ADD INDEX (date);

The index doesn't work if the query needs computer on the field (e.g. TIME(date)) but works for ranges (e.g. WHERE date < '2008-01-02').

You can notice that the .MYD file grew:

 $ ls -lh /var/lib/mysql/industrial/
 -rw-rw---- 1 mysql mysql  49M jun 15 22:36 weightin.MYI

That's were MySQL stores the indices. Initially there was an index for the 'id' field, which the case for all primary keys.

Another example

Another example: let's say we want to optimise this query:

 mysql> SELECT DISTINCT line FROM weightin;

We can do so by adding an index on the 'line' field, in order to group the doubles together, which will avoid the query to rescan the whole table to localize them:

 ALTER TABLE weightin ADD INDEX (line);

The index file grew:

 -rw-rw---- 1 mysql mysql  65M jun 15 22:38 weightin.MYI

General considerations

The first and foremost question that is always asked for SELECT queries is always if indices (aka "keys") are configured and if they are, whether or not they are actually be used by the database server.

1. Check if the indices are actually used

Individual queries can be checked with the "EXPLAIN" command. For the whole server the "Sort_%" variables should be monitored as they indicate how often MySQL had to browse through the whole data file because there was no usable index available.

2. Are the indices buffered

Keeping the indices in memory improves read performance a lot. The quotient of "Key_reads / Key_read_requests" tells how often MySQL actually accessed the index file on disk when it needed a key. Same goes for Key_writes, use mysqlreport to do the math for you here. If the percentage is too high, key_buffer_size for MyISAM and innodb_buffer_pool_size for InnoDB are the corresponding variables to tune.

The Key_blocks_% variables can be used to see how much of the configured key buffer is actually used. The unit is 1KB if not set otherwise in key_cache_block_size. As MySQL uses some blocks internally, key_blocks_unused has to be checked. To estimate how big the buffer should be, the sizes of the relevant .MYI files can be summed up. For InnoDB there is innodb_buffer_pool_size although in this case not only the indices but also the data gets buffered.

3. Further settings

sort_buffer_size (per-thread) is the memory that is used for ORDER BY and GROUP BY. myisam_sort_buffer_size is something completely different and should not be altered.

read_buffer_size (per-thread) is the size of memory chunks that are read from disk into memory at once when doing a full table scan as big tables do not fit into memory completely. This seldomly needs tuning.

Query cache

The main reason not to use any MySQL version below 4.0.1 if you have read-based applications is that beginning with that version, MySQL has the ability to store the result of SELECT queries until their tables are modified.

The Query Cache can be configured with the query_cache_% variables. Most important here are the global query_cache_size and query_cache_limit which prevents single queries with unusual big results larger than this size to use up the whole cache.

Note that the Query Cache blocks have a variable size whose minimum size is query_cache_min_res_unit, so after a complete cache flush the number of free blocks is ideally just one. A large value of Qcache_free_blocks just indicates a high fragmentation.

Worth monitoring are the following variables:

Qcache_free_blocks

If this value is high it indicates a high fragmentation which does not need to be a bad thing though.

Qcache_not_cached

If this value is high there are either much uncachable queries (e.g. because they use functions like now()) or the value for query_cache_limit is too low.

Qcache_lowmem_prunes

This is the number of old results that have been purged because the cache was full and not because their underlying tables have been modified. query_cache_size must be increased to lower this variable.

Examples:

An empty cache:

mysql> SHOW VARIABLES LIKE 'query_cache_type';
+------------------+-------+
| Variable_name    | Value |
+------------------+-------+
| query_cache_type | ON    |
+------------------+-------+
1 row in set (0.00 sec)

mysql> SHOW VARIABLES LIKE 'query_cache_size';
+------------------+-------+
| Variable_name    | Value |
+------------------+-------+
| query_cache_size | 0     |
+------------------+-------+
1 row in set (0.00 sec)

mysql> SHOW STATUS LIKE 'Qcache%';
+-------------------------+-------+
| Variable_name           | Value |
+-------------------------+-------+
| Qcache_free_blocks      | 0     |
| Qcache_free_memory      | 0     |
| Qcache_hits             | 0     |
| Qcache_inserts          | 0     |
| Qcache_lowmem_prunes    | 0     |
| Qcache_not_cached       | 0     |
| Qcache_queries_in_cache | 0     |
| Qcache_total_blocks     | 0     |
+-------------------------+-------+
8 rows in set (0.00 sec)

A used cache (savannah.gnu.org):

mysql> SHOW VARIABLES LIKE "query_cache_size";
+------------------+----------+
| Variable_name    | Value    |
+------------------+----------+
| query_cache_size | 33554432 |
+------------------+----------+
1 row in set (0.00 sec)

mysql> SHOW STATUS LIKE "Qcache%";
+-------------------------+----------+
| Variable_name           | Value    |
+-------------------------+----------+
| Qcache_free_blocks      | 1409     |
| Qcache_free_memory      | 27629552 |
| Qcache_hits             | 7925191  |
| Qcache_inserts          | 3400435  |
| Qcache_lowmem_prunes    | 2946778  |
| Qcache_not_cached       | 71255    |
| Qcache_queries_in_cache | 4546     |
| Qcache_total_blocks     | 10575    |
+-------------------------+----------+
8 rows in set (0.00 sec)

The matching my.cnf configuration parameter is:

query_cache_size = 32M

To clear the cache (useful when testing a new query's efficiency):

mysql> RESET QUERY CACHE;
Query OK, 0 rows affected (0.00 sec)

Waiting for locks

The Table_locks_% variables show the number of queries that had to wait because the tables they tried to access where currently locked by other queries. These situations can be caused by "LOCK TABLE" statements and also by e.g. simultaneous write accesses to the same table.

Table cache

MySQL needs a certain time just to "open" a table and read its meta data like column names etc.

If many threads are trying to access the same table, it is opened multiple times.

To speed this up the meta data can be cached in the table_cache (alias table_open_cache since MySQL 5.1.3).

A good value for this setting is the number of max_connections multiplied with the number of usually used tables per SELECT.

Using mysqlreport or by looking at the currently Open_tables and ever since Opened_tables as well as the Uptime the number of necessary table opens per second can be calculated (consider the off-peak times like nights though).

Connections and threads

For every client connection (aka session) MySQL creates a separated thread under the main mysqld process. For big sites with several hundred new connections per second, creating the threads itself can consume a significant amount of time. To speed things up, idle threads can be cached after their client disconnected. As a rule of thumb not more than one thread per second should be newly created. Clients that send several queries to the server should use persistent connections like with PHPs mysql_pconnect() function.

This cache can be configured by thread_cache_size and monitored with the threads_% variables.

To avoid overloads MySQL blocks new connections if more than max_connections are currently in use. Start with max_used_connections and monitor the number of connection that were rejected in Aborted_clients and the ones that timed out in Aborted_connections. Forgotten disconnects from clients that use persistent connections can easily lead to a denial of service situation so be aware! Normally connections are closed after wait_timeout seconds of being idle.

Temporary tables

It is perfectly normal that MySQL creates temporary tables while sorting or grouping results. Those tables are either be held in memory or if too large be written to disk which is naturally much slower. The number of disk tables among the Created_tmp_% variables should be neglectible or else the settings in max_heap_table_size and tmp_table_size be reconsidered.

Delayed writes

In situations like writing webserver access log files to a database, with many subsequent INSERT queries for rather unimportant data into the same table, the performance can be improved by advising the server to cache the write requests a little while and then send a whole batch of data to disk.

Be aware though that all mentioned methods contradicts ACID compliance because INSERT queries are acknowledged with OK to the client before the data has actually be written to disk and thus can still get lost in case of an power outage or server crash. Additionally the side effects mentioned in the documentation often reads like a patient information leaflet of a modern medicament...

MyISAM tables can be given the DELAY_KEY_WRITE option using CREATE or ALTER TABLE. The drawback is that after a crash the table is automatically marked as corrupt and has to be checked/repaired which can take some time.

InnoDB can be told with innodb_flush_log_at_trx_commit to delay writing the data a bit. In case of a server crash the data itself is supposed to be still consistent, just the indices have to be rebuilt.

INSERT DELAYED works on main Storage Engines on a per query base.

References

[1] ttp://dev.mysql.com/doc/refman/5.1/en/optimize-table.html

[2] ttp://dev.mysql.com/doc/refman/5.1/en/datetime.html

[1]

[2]