Python Programming/Databases


Python has support for working with databases via a simple API. Modules included with Python include modules for SQLite and Berkeley DB. Modules for MySQL , PostgreSQL , FirebirdSQL and others are available as third-party modules. The latter have to be downloaded and installed before use. The package MySQLdb can be installed, for example, using the Debian package "python-mysqldb".

DBMS Specifics

edit

MySQL

edit

An Example with MySQL would look like this:

import MySQLdb
db = MySQLdb.connect("host machine", "dbuser", "password", "dbname")
cursor = db.cursor()
query = """SELECT * FROM sampletable"""
lines = cursor.execute(query)
data = cursor.fetchall()
db.close()

On the first line, the Module MySQLdb is imported. Then a connection to the database is set up and on line 4, we save the actual SQL statement to be executed in the variable query. On line 5 we execute the query and on line 6 we fetch all the data. After the execution of this piece of code, lines contains the number of lines fetched (e.g. the number of rows in the table sampletable). The variable data contains all the actual data, e.g. the content of sampletable. In the end, the connection to the database would be closed again. If the number of lines are large, it is better to use row = cursor.fetchone() and process the rows individually:

  #first 5 lines are the same as above
  while True:
    row = cursor.fetchone()
    if row == None: break
    #do something with this row of data
  db.close()

Obviously, some kind of data processing has to be used on a row, otherwise the data will not be stored. The result of the fetchone() command is a Tuple.

In order to make the initialization of the connection easier, a configuration file can be used:

import MySQLdb
db = MySQLdb.connect(read_default_file="~/.my.cnf")
...

Here, the file .my.cnf in the home directory contains the necessary configuration information for MySQL.

Sqlite

edit

An example with SQLite is very similar to the one above and the cursor provides many of the same functionalities.

import sqlite3
db = sqlite3.connect("/path/to/file")
cursor = db.cursor()
query = """SELECT * FROM sampletable"""
lines = cursor.execute(query)
data = cursor.fetchall()
db.close()

When writing to the db, one has to remember to call db.commit(), otherwise the changes are not saved:

import sqlite3
db = sqlite3.connect("/path/to/file")
cursor = db.cursor()
query = """INSERT INTO sampletable (value1, value2) VALUES (1,'test')"""
cursor.execute(query)
db.commit()
db.close()

Postgres

edit
import psycopg2
conn = psycopg2.connect("dbname=test")
cursor = conn.cursor()
cursor.execute("select * from test");
for i in cursor.next():
    print(i)
conn.close()

Firebird

edit
import firebirdsql
conn = firebirdsql.connect(dsn='localhost/3050:/var/lib/firebird/2.5/test.fdb', user='alice', password='wonderland')
cur = conn.cursor()
cur.execute("select * from baz")
for c in cur.fetchall():
    print(c)
conn.close()

General Principles

edit

Parameter Quoting

edit

You will frequently need to substitute dynamic data into a query string. It is important to ensure this is done correctly.

# Do not do this!
result = db.execute("SELECT name FROM employees WHERE location = '" + location + "'")

This example is wrong, because it doesn’t correctly deal with special characters, like apostrophes, in the string being substituted. If your code has to deal with potentially hostile users (like on a public-facing Web server), this could leave you open to an SQL injection attack.

For simple cases, use the automatic parameter substitution provided by the execute method, e.g.

result = db.execute("SELECT name FROM employees WHERE location = ?", [location])

The DBMS interface itself will automatically convert the values you pass into the correct SQL syntax.

For more complex cases, the DBMS module should provide a quoting function that you can explicitly call. For example, MySQLdb provides the escape_string method, while APSW (for SQLite3) provides format_sql_value. This is necessary where the query structure takes a more dynamic form:

criteria = [("company", company)] # list of tuples (fieldname, value)
if department != None :
    criteria.append(("department", department))
# ... append other optional criteria as appropriate ...

result = db.execute(
        "SELECT name FROM employees WHERE "
    +
        " and ".join(
            "%s = %s" % (criterion[0], MySQLdb.escape_string(criterion[1]))
            for criterion in criteria
          )
  )

This will dynamically construct queries like “select name from employees where company = 'some company'” or “select name from employees where company = 'some company' and department = 'some department'”, depending on which fields have been filled in by the user.

Use Iterators

edit

Python iterators are a natural fit for the problem of iterating over lots of database records. Here is an example of a function that performs a database query and returns an iterator for the results, instead of returning them all at once. It relies on the fact that, in APSW (the Python 3 interface library for SQLite), the cursor.execute method itself returns an iterator for the result records. The result is that you can write very concise code for doing complex database queries in Python.

def db_iter(db, cmd, mapfn = lambda x : x) :
    "executes cmd on a new cursor from connection db and yields the results in turn."
    cu = db.cursor()
    result = cu.execute(cmd)
    while True:
        yield mapfn(next(result))

Example uses of this function:

for artist, publisher in db_iter(
        db = db,
        cmd =
                "SELECT artist, publisher FROM artists WHERE location = %s"
            %
                 apsw.format_sql_value(location)
      ):
    print(artist, publisher)

and

for location in db_iter(
        db = db,
        cmd = "SELECT DISTINCT location FROM artists",
        mapfn = lambda x : x[0]
      ):
    print(location)

In the first example, since db_iter returns a tuple for each record, this can be directly assigned to individual variables for the record fields. In the second example, the tuple has only one element, so a custom mapfn is used to extract this element and return it instead of the tuple.

Never Use “SELECT *” in a Script

edit

Database table definitions are frequently subject to change. As application requirements evolve, fields and even entire tables are often added, or sometimes removed. Consider a statement like

result = db.execute("select * from employees")

You may happen to know that the employees table currently contains, say, 4 fields. But tomorrow someone may add a fifth field. Did you remember to update your code to deal with this? If not, it’s liable to crash. Or even worse, produce an incorrect result!

Better to always list the specific fields you’re interested in, no matter how many there are:

result = db.execute("select name, address, department, location from employees")

That way, any extra fields added will simply be ignored. And if any of the named fields are removed, the code will at least fail with a runtime error, which is a good reminder that you forgot to update it!

Looping on Field Breaks

edit

Consider the following scenario: your sales company database has a table of employees, and also a table of sales made by each employee. You want to loop over these sale entries, and produce some per-employee statistics. A naïve approach might be:

  • Query the database to get a list of employees
  • For each employee, do a database query to get the list of sales for each employee.

If you have a lot of employees, then the first query may produce a large list, and the second step will involve a correspondingly large number of database queries.

In fact, the entire processing loop can run off a single database query, using the standard SQL construct called a join.

Note:
SQL programming is a specialty skill in its own right. To learn more about this, start with the Wikipedia article.

Here is what an example of such a loop could look like:

rows = db_iter \
  (
    db = db,
    cmd =
        "select employees.name, sales.amount, sales.date from"
        " employees left join sales on employees.id = sales.employee_id"
        " order by employees.name, sales.date"
  )
prev_employee_name = None
while True:
    row = next(rows, None)
    if row != None :
        employee_name, amount, date = row
    if row == None or employee_name != prev_employee_name :
         if prev_employee_name != None :
              # done stats for this employee
              report(prev_employee_name, employee_stats)
         if row == None :
              break
         # start stats for a new employee
         prev_employee_name = employee_name
         employee_stats = {"total_sales" : 0, "number_of_sales" : 0}
         if date != None :
               employee_stats["earliest_sale"] = date
    # another row of stats for this employee
    if amount != None :
         employee_stats["total_sales"] += amount
         employee_stats["number_of_sales"] += 1
    if date != None :
         employee_stats["latest_sale"] = date

Here the statistics are quite simple: earliest and latest sale, and number and total amount of sales, and could be computed directly within the SQL query. But the same loop could compute more complex statistics (like standard deviation) that cannot be represented directly within a simple SQL query.

Note how the statistics for each employee are written out under either of two conditions:

  • The employee name of the next record is different from the previous one
  • The end of the query results has been reached.

Both conditions are tested with row == None or employee_name != prev_employee_name; after writing out the employee statistics, a separate check for the second condition row == None is used to terminate the loop. If the loop doesn’t terminate, then processing is initialized for the new employee.

Note also the use of a left join in this case: if an employee has had no sales, then the join will return a single row for that employee, with SQL null values (represented by None in Python) for the fields from the sales table. This is why we need checks for such None values before processing those fields.

Alternatively, we could have used an inner join, which would have returned no results for an employee with no sales. Whether you want to omit such an employee from your report, or include them with totals of zero, is really up to your application.

See Also

edit
edit