Databases Python
Databases Python
Databases
• As our data sets become larger and more complex,
• we need something that will let us search for data in many different
ways,
• control who can view and modify the data,
• and ensure that the data is correctly formatted
• A relational database is a collection of tables, each of which has a
fixed number of columns an d a variable number of rows.
• Each column in a table has a name and contains values of the same
data type, such as integer or string.
• Each row, or record, contains values that are related to each other,
such as a particular patient’s name, date of birth, and blood
• Superficially, each table looks like a spreadsheet or a file with one
record per line
• Many different brands of databases are available to choose from,
including commercial systems like Oracle, IBM’s DB2, and Microsoft
Access .
• Open source databases like MySQL and PostgreSQL.
• Our examples use one called SQLite.
• it is free, it is simple to use, and as of Python 3.3.0, the standard
library includes a module called sqlite3 for working with it.
• A database is usually stored in a file or in a collection of files. These
files aren’t formatted as plain text.
• you can interact with the database in one of two ways:
• By typing commands into a database GUI, just as you type commands into a
Python interpreter.
• By writing programs in Python (or some other language). These programs
import a library that knows how to work with the kind of database you are
using
• our programs all start with this line:
• >>> import sqlite3
• To put data into a database or to get information out, we’ll write
commandsin a special-purpose language called SQL, which stands for
Structured Query Language
• pronounced either as the three letters “S-Q-L” or as the word
“sequel.”
Creating and Populating
• we start by telling Python that we want to use sqlite3:
• >>> import sqlite3
• we must make a connection to our database by calling the database
module’s connect method. This method takes one string as a
parameter, which identifies the database to connect to.
• >>> con = sqlite3.connect('population.db')
• Once we have a connection, we need to get a cursor.
• Like the cursor in an editor, this keeps track of where we are in the
database so that if several programs are accessing the database at the
same time, the database can keep track of who is trying to do what:
• >>> cur = con.cursor()
• The first step is to create a database tab le to store the population
data.
• The general form of an SQL statement for table creation is as follows:
• CREATE TABLE «TableName»(«ColumnName» «Type», ...)
• The types of the data in each of the table’s columns are chosen from
the types the database supports:
• CREATE TABLE PopByRegion(Region TEXT, Population INTEGER)
• Now, we put that SQL statement in a string and pass it as an argument
to a Python method that will execute the SQL command:
• >>> cur.execute('CREATE TABLE PopByRegion(Region TEXT,
Population INTEGER)')
• <sqlite3.Cursor object at 0x102e3e490>
• When method execute is called, it returns the cursor object that it
was called on.
• After we create a table, our next task is to insert data into it.
• We do this one record at a time using the INSERT command, whose
general form is as follows:
• INSERT INTO «TableName» VALUES(«Value», ...)
• >>> cur.execute('INSERT INTO PopByRegion VALUES("Central Africa",
330993)')
<sqlite3.Cursor object at 0x102e3e490>
• Another format for the INSERT SQL command uses placeholders for
the values to be inserted.
• >>> cur.execute('INSERT INTO PopByRegion VALUES (?, ?)', ("Japan",
100562))
Saving Changes
• We must commit those changes using the connection’s commit
method:
• >>> con.commit()
• Committing to a database is like saving the changes made to a file in a
text editor.
• Until we do it, our changes are not actually stored and are not visible
to anyone else who is using the database at the same time
Retrieving Data
• we can run queries to search for data that meets specified criteria.
The general form of a queryis as follows:
• SELECT «ColumnName» , ... FROM «TableName»
• The TableName is the name of the table to get the data from and the
column names specify which columns to get values from
• >>> cur.execute('SELECT Region, Population FROM PopByRegion')
• we can access the results one record at a time by calling the cursor’s
fetchone method, just as we can read one line at a time from a file using
readline:
• >>> cur.fetchone()
('Central Africa', 330993)
• The fetchone method returns each record as a tuple whose elements are
in the order specified in the query
• Database cursors have a fetchall method that returns all the data
produced by a query that has not yet been fetched as a list of tuples:
• cur.fetchall()
• [('Southeastern Africa', 743112), ('Northern Africa', 1037463), ('Southern
• Asia', 2051941), ('Asia Pacific', 785468), ('Middle East', 687630),
• ('Eastern Asia', 1362955), ('South America', 593121), ('Eastern Europe',
• 223427), ('North America', 661157), ('Western Europe', 387933), ('Japan',
• 100562)]
• Once all of the data produced by the query has been fetched, any subsequent
calls on fetchone and fetchall return None and the empty list, respectively:
• >>> cur.fetchone()
• >>> cur.execute('SELECT Region, Population FROM PopByRegion ORDER BY Region')
• >>> cur.fetchall()
• >>> cur.execute('''SELECT Region, Population FROM PopByRegion
• ORDER BY Population DESC''')
• <sqlite3.Cursor object at 0x102e3e490>
• >>> cur.fetchall()
• >>> cur.execute('SELECT Region FROM PopByRegion')
• <sqlite3.Cursor object at 0x102e3e490>
• >>> cur.fetchall()
• >>> cur.execute('SELECT * FROM PopByRegion')
• <sqlite3.Cursor object at 0x102e3e490>
• >>> cur.fetchall()
• [('Central Africa',
Query Conditions
• We can select a subset of the data by using the keyword WHERE to
specify conditions that the rows we want must satisfy.
• >>> cur.execute('SELECT Region FROM PopByRegion WHERE
Population > 1000000')
<sqlite3.Cursor object at 0x102e3e490>
• >>> cur.fetchall()
• [('Northern Africa',), ('Southern Asia',), ('Eastern Asia',)]
These are the relational operators that may be used with WHERE:
• As well as these relational operators, we can also use the AND, OR, and
NOT operators.
• >>> cur.execute('''SELECT Region FROM PopByRegion
• WHERE Population > 1000000 AND Region < "L"''')
• <sqlite3.Cursor object at 0x102e3e490>
• >>> cur.fetchall()
• [('Eastern Asia',)]
Updating and Deleting
• Data often changes over time, so we need to be able to change the
information stored in databases.
• To do that, we can use the UPDATE command, as shown in the
following code:
• >>> cur.execute('SELECT * FROM PopByRegion WHERE Region =
"Japan"')
• <sqlite3.Cursor object at 0x102e3e490>
• >>> cur.fetchone()
• ('Japan', 100562)
• >>> cur.execute('''UPDATE PopByRegion SET Population = 100600
• WHERE Region = "Japan"''')
• <sqlite3.Cursor object at 0x102e3e490>
• >>> cur.execute('SELECT * FROM PopByRegion WHERE Region =
"Japan"')
• <sqlite3.Cursor object at 0x102e3e490>
• >>> cur.fetchone()
• ('Japan', 100600)
DELETE
• We can also delete records from the database:
• >>> cur.execute('DELETE FROM PopByRegion WHERE Region < "L"')
<sqlite3.Cursor object at 0x102e3e490>
• >>> cur.execute('SELECT * FROM PopByRegion')
<sqlite3.Cursor object at 0x102e3e490>
• >>> cur.fetchall()
[('Southeastern Africa', 743112), ('Northern Africa', 1037463),
('Southern Asia', 2051941), ('Middle East', 687630), ('South America',
593121), ('North America', 661157), ('Western Europe', 387933)])]
Drop table
• To remove an entire table from the database, we can use the DROP
command:
DROP TABLE TableName
• For example, if we no longer want the table PopByRegion, we would
execute this:
• >>> cur.execute('DROP TABLE PopByRegion');
• When a table is dropped, all the data it contained is lost.
Using NULL for Missing Data
• we often don’t have all the data we want.
• we may choose to insert it and use the value NULL to represent the
missing values.
• For example, if there is a region whose population we don’t know, we
could insert this into our database:
• >>> cur.execute('INSERT INTO PopByRegion VALUES ("Mars", NULL)')
• On the other hand, we probably don’t ever want a record in the
database that has a NULL region name. We can prevent this from ever
happening, stating that the column is NOT NULL when the table is
created:
• >>> cur.execute('CREATE TABLE Test (Region TEXT NOT NULL, '
'Population INTEGER)')
• Now when we try to insert a NULL region into our new Test table, we
get an error message:
• >>> cur.execute('INSERT INTO Test VALUES (NULL, 456789)')
Traceback (most recent call last):
File "<pyshell#45>", line 1, in <module>
cur.execute('INSERT INTO Test VALUES (NULL, 456789)')
sqlite3.IntegrityError: Test.Region may not be NULL
Using Joins to Combine Tables
• When designing a database, it often makes sense to divide data
between two or more tables.
• We could store all of this in one table, but then a lot of information
would be need
• less duplication:
• If we divide information between tables, though, we need some way
to pull that information back together.
• The right way to do this in a relational database is to use a join.
• As the name suggests, a join combines information from two or more
tables to create a new set of records, each of which can contain some
or all of the information in the tables involved.
• Several types of joins exist
• inner joins and self-joins.
• We’ll begin with inner joins, which involve the following.
• 1. Constructing the cross product of the tables
• 2. Discarding rows that do not meet the selection criteria
• 3. Selecting columns from the remaining rows
• >>> cur.execute('''SELECT PopByRegion.Region,PopByCountry.Country
• FROM PopByRegion INNER JOIN PopByCountry
• WHERE (PopByRegion.Region = PopByCountry.Region)
• AND (PopByRegion.Population > 1000000)
• ''')
Removing Duplicates
• To remove the duplicates, we add the keyword DISTINCT to the query:
• >>> cur.execute('''
SELECT DISTINCT PopByRegion.Region
FROM PopByRegion INNER JOIN PopByCountry
WHERE (PopByRegion.Region = PopByCountry.Region)
AND ((PopByCountry.Population * 1.0) / PopByRegion.Population >
0.10)''')
>>> cur.fetchall()
[('Eastern Asia',), ('North America',)]
• >>> cur.execute('''
• SELECT PopByRegion.Region
• FROM PopByRegion INNER JOIN PopByCountry
• WHERE (PopByRegion.Region = PopByCountry.Region)
• AND ((PopByCountry.Population * 1.0) / PopByRegion.Population >
0.10)''')
• <sqlite3.Cursor object at 0x102e3e490>
• >>> cur.fetchall()
• [('Eastern Asia',), ('North America',), ('North America',)]
Keys and Constraints
• A column in a table that is uniquely identified by its names is called a key.
• Ideally, a key’s values should be unique, just like the keys in a dictionary.
• We can tell the database to enforce this constraint by adding a PRIMARY
KEY clause when we create the table.
cur.execute('''CREATE TABLE PopByRegion (
Region TEXT NOT NULL,
Population INTEGER NOT NULL,
PRIMARY KEY (Region))''').
• the primary key for a database table can consist of multiple columns.
• The following code uses the CONSTRAINT keyword to specify that no two
entries in the table being created will ever have the same values for region
and country:
• cur.execute('''
• CREATE TABLE PopByCountry(
• Region TEXT NOT NULL,
• Country TEXT NOT NULL,
• Population INTEGER NOT NULL,
• CONSTRAINT CountryKey PRIMARY KEY (Region, Country))''')
Advanced Features
• Aggregation
• SQL provides several other aggregate functions.
• adding up the values in PopByRegion’s Population column using the
SQL aggregate function SUM:
• >>> cur.execute('SELECT SUM (Population) FROM PopByRegion')
• <sqlite3.Cursor object at 0x102e3e490>
• >>> cur.fetchone()
• (8965762,)
• Grouping
• >>> cur.execute('SELECT SUM (Population) FROM PopByCountry
GROUP BY Region')
• <sqlite3.Cursor object at 0x102e3e490>
• >>> cur.fetchall()
• [(1364389,), (661200,)]
Self joins
• cur.execute('''SELECT Country FROM PopByCountry
• WHERE (ABS(Population - Population) < 1000)''')
• <sqlite3.Cursor object at 0x102e3e490>
• >>> cur.fetchall()
• [('China',), ('DPR Korea',), ('Hong Kong (China)',), ('Mongolia',),
• ('Republic of Korea',), ('Taiwan',), ('Bahamas',), ('Canada',),
• ('Greenland',), ('Mexico',), ('United States',)]
• we need to join PopByCountry with itself using an INNER JOIN:
• >>> cur.execute('''
• SELECT A.Country, B.Country
• FROM PopByCountry A INNER JOIN PopByCountry B
• WHERE (ABS(A.Population - B.Population) <= 1000)
• AND (A.Country != B.Country)''')
• <sqlite3.Cursor object at 0x102e3e490>
• >>> cur.fetchall()
• [('Republic of Korea', 'Canada'), ('Bahamas', 'Greenland'), ('Canada',
• 'Republic of Korea'), ('Greenland', 'Bahamas')]
• SELECT A.Country, B.Country
• FROM PopByCountry A INNER JOIN PopByCountry B
• WHERE (ABS(A.Population - B.Population) <= 1000)
• AND (A.Country != B.Country)''')
• <sqlite3.Cursor object at 0x102e3e490>
• >>> cur.fetchall()
• [('Republic of Korea', 'Canada'), ('Bahamas', 'Greenland'), ('Canada',
• 'Republic of Korea'), ('Greenland', 'Bahamas')]