MyPySql - Python and MySQL blog: 2013

Saturday 17 August 2013

mysql connection cost in a simple python script

Using New Relic to profile a website in Django of my company I found out that a non negligible amount of time was spend on mysqldb:connect the application is not using a connection pool.

On my local computer I tested the cost of mysql connections here is the script

import cProfile, pstats, MySQLdb

def connect():
    return MySQLdb.connect('127.0.0.1', 'root', 'root')
        
def close(connect):
    connect.close()
        
def query(connection, query_string):
    cursor = connection.cursor()
    cursor.execute(query_string)
    res = cursor.fetchall()
    cursor.close()
    return res

def do_something_simple_on_result(res):
    for row in res:
        a = row[0].lower()
        
def test_multi_connect():       
    for i in range(0, 10000): 
        conn = connect()
        res = query(conn, "select SQL_NO_CACHE host from mysql.user where user='root'")
        do_something_simple_on_result(res)
        close(conn)
        
def test_one_connect():       
    conn = connect()
    for i in range(0, 10000): 
        res = query(conn, "select SQL_NO_CACHE host from mysql.user where user='root'")
        do_something_simple_on_result(res)
    close(conn)

if __name__ == '__main__':
    cProfile.run('test_multi_connect()', 'restats')
    p = pstats.Stats('restats')
    p.sort_stats('cumulative').print_stats('profile_db.py')
    
    cProfile.run('test_one_connect()', 'restats')
    p = pstats.Stats('restats')
    p.sort_stats('cumulative').print_stats('profile_db.py')

And the result:

         1369839 function calls (1369578 primitive calls) in 17.390 seconds
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.237    0.237   17.390   17.390 profile_db.py:20(test_multi_connect)
    10000    0.045    0.000   10.750    0.001 profile_db.py:3(connect)
    10000    0.111    0.000    5.517    0.001 profile_db.py:9(query)
    10000    0.029    0.000    0.744    0.000 profile_db.py:6(close)
    10000    0.076    0.000    0.110    0.000 profile_db.py:16(do_something_simple_on_result)

         440096 function calls in 4.263 seconds
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.081    0.081    4.263    4.263 profile_db.py:27(test_one_connect)
    10000    0.076    0.000    4.071    0.000 profile_db.py:9(query)
    10000    0.055    0.000    0.083    0.000 profile_db.py:16(do_something_simple_on_result)
        1    0.000    0.000    0.001    0.001 profile_db.py:3(connect)
        1    0.000    0.000    0.000    0.000 profile_db.py:6(close)

Whoo, 66% of the time is spend in opening and closing connection. In simple term opening+closing a connection with mysqldb is taking the same time than 2 simple queries.
If your application is opening and closing a connection for each query you can are maybe trippeling the time spend in your database.

1ms is the average time to connect while of simple query the avg time is 0.5ms

Even if you open only one connection per page view and you are doing 10 queries per page the cost of the connection is still 20% of your total database time.

DB connection pools is the solution (sqlalchemy pool for example)

Thursday 9 May 2013

Comparing python/mysql drivers

I made a small script to compare MySQLdb, OurSQL, mysql.connector and PyMysql.
[Edit] Add pypy 2.0 + pymysql

Here is the result

Python version: 2.7.2 (default, Jun 12 2011, 15:08:59) MSC v.1500 32 bit (Intel)
PyPy 2.0 Python version: 2.7.3 (b9c3566aa017, May 09 2013, 09:09:30)
Platform: Windows-7-6.1.7601-SP1
Database: https://launchpad.net/test-db 1.0.6
MySQLdb version: 1.2.3 url:https://github.com/farcepest/MySQLdb1
mysql.connector version: 1.0.10 url:http://dev.mysql.com/downloads/connector/python/
oursql version: 0.9.3.1 url:https://launchpad.net/oursql
pymysql version: 0.5.None url:https://github.com/petehunt/PyMySQL

Simple Select 1 record 10000x

1.33500003815 MySQLdb
2.92000007629 oursql
4.75300002098 mysql.connector
6.37299990654 pymysql
3.05999994278 pypy + pymysql

Select * 500 rows 1000x

5.27799987793 MySQLdb
4.88400006294 oursql
35.1779999733 mysql.connector
45.4629998207 pymysql
6.82800006866 pypy + pymysql

Select 500 rows with 500 args 1000x

2.97600007057 MySQLdb
3.20799994469 oursql
4.76099991798 mysql.connector
4.15899991989 pymysql
2.95799994469 pypy + pymysql

Select 300k id 5x

1.84399986267 MySQLdb
1.62400007248 oursql
24.6400001049 mysql.connector
35.6339998245 pymysql
4.90300011635 pypy + pymysql

Simple Select 1M3 id 1x

16.1340000629 MySQLdb
70.8839998245 oursql
247.088000059 mysql.connector
339.72300005 pymysql
44.5699999332 pypy + pymysql

The script can be found here:https://github.com/Benoss/PythonMysqlDriversTest

Friday 26 April 2013

More complicated data profiling query

This is the extended version of basic-data-profiling, it is slower but gives you much more interesting informations.

When you do data profiling you need to give examples for each data grouping, GROUP_CONCAT MySQL function is here for you

SELECT COUNT(*) AS nb, tablename.title AS datas, 
 SUBSTRING_INDEX(GROUP_CONCAT(tablename.emp_no ORDER BY RAND(42)), ',', 10) AS Examples
FROM titles AS tablename
GROUP BY datas
ORDER BY nb DESC;

Result:

nb	datas	Examples
115003	Engineer	222309,64167,287849,222887,256836,237309,262823,242428,222680,225055
107391	Staff	427715,209965,36589,74250,407627,251530,409579,254748,456807,204250
97750	Senior Engineer	238256,468681,477369,205608,224208,263874,251767,82661,245168,213359
92853	Senior Staff	83208,45313,211968,264723,36232,263944,46428,471920,66956,442742
15159	Technique Leader	260169,436055,295995,251618,434080,492772,54096,18718,97636,496663
15128	Assistant Engineer	448269,461300,443947,417358,21784,437352,412830,94900,262281,98231
24	Manager	110511,110800,110022,110303,111133,110420,110567,110854,111784,111692

Explainations:
COUNT(*) will count the number of records for each value grouped in the GROUP BY clause

tablename.title AS datas will show in the column "datas" the name of the current group

GROUP_CONCAT will join the string value of the first arg with coma between them, the max size of group_concat is set by group_concat_max_len variable, default is 1024chars

ORDER BY RAND(42) is the internal order of the group_concat, in this case we order by random with the seed 42, if you rerun the query you will have exactly the same random order if you use the same seed. If you want different examples each time just change the seed or use ORDER BY RAND() instead
SUBSTRING_INDEX is looking for the comma separator and will keep only everything left of the 10th comma
ORDER by nb DESC will order the final result by the number of occurrence of each data putting the most use data top of the table

Thursday 25 April 2013

Create a Website in Python in 30sec

With a small 1 file lib called BottlePy and nothing else but standard Python install you can create a website in less than 30se.

Download bottle.py file here bottle.py on GitHub
Create a python script example.py
Put bottle.py in the same folder
Copy paste the HelloWord example

from bottle import route, run, template

@route('/hello/:name')
def index(name='World'):
    return template('Hello {{name}}!', name=name)

run(host='localhost', port=8080)

Launch example.py
Open your browser http://localhost:8080/hello/YourName
Your done

Bottle can even work with an Apache server in front of it as a WSGI app.
Why use Tx, GTK or QT for your scripts ui when you can have a lightweight multiplatform web ui in 4 lines of code ?

Tuesday 23 April 2013

How Mysql query cache can kill your perf

query_cache_size is a tricky tricky variable to tune.

Here is how the Query cache works in 4 lines:

For every SELECT made MySQL looks into the cache if there is an entry, if an entry is found it is returned immediately
If there is no entry for a query the string of the SELECT is store with the list of tables involved and the result
For every Update/Insert/Delete on a table MySQL delete from the cache all the queries that use this table
When a thread is looking for something in the cache nobody else can

I have seen that the cost of those search if you have query_cache_size > 0 is around 15% more cpu but it can be a LOT more than that
In my case the query_cache_size was set to 512Mb and was not a problem for a month, then the system reach a critical point where very small queries were spending 20x more time waiting for their turn to look into the cache than executing the query itself even if the query was in the cache.

The system is a Sugarcrm database for a big company, between 9am and 5pm there is around 1M database opp per hour. If we speak in seconds it is around 300 selects/s and 17updates/s. The problem is that with 17 updates per second the cache is invalidated almost immediately from a front end perspective.
I changed this value to 16Mb yesterday and on with more opp/h the server is using 3 times less CPU, 100% to 30% !! The avg response time of the front-end web pages dropped from 1500ms to 300ms the result is people were spending less time waiting and more time working, it is why the number of opp raise on this day.
I will continue to monitor the CPU use and the avg response time and next week I will try to turn the query cache completely of to see if there is any other gains.

Conclusions:
MySQL query cache is really fantastic for server with long selects on a database not often updated, but in this case most of the time it is better to do this caching on an application layer with memcache for example.
MySQL query cache is horrible if you have a high concurrency system with a lot of update, in this case you can test with a small value 16Mb for example or with no cache at all.

Basic Data Profiling

As a DBA I am very often asked to give profiling information from the database. This result need to be understandable by non technical persons.

The basic query

SELECT count(*) as Nb, the_field as PrettyName
FROM the_table
GROUP BY the_field