Lesson 10: Working With External Data
Files
Introduction to Python Programming
Lesson 10: Working With External
Data Files
Lesson Overview
➢
Python File Basics
➢
More File Operations
➢
Pickle and Shelve
➢
Parsing CSV Files With Python’s
Built-in CSV Library
➢
Parsing CSV Files With the pandas
Library
Introduction
➢
Imagine if you
decided to write a
program to keep
track of orders for
your company.
➢
You'd probably
need some kind of
permanent record
of those
transactions, right?
3
Introduction
➢
In this lesson, you'll
learn how to
open data files,
➢
read from them,
➢
write to them, and close them.
➢
➢
Then you’ll take a
look at file elements
that are more
advanced,
➢
like creating random access
files with Python's pickle( ) and
➢
shelve( ) functions.
4
Python File Basics: Opening a Data File
➢
To open a file, you use the
open( ) function, provide the
name of the file, and then
specify whether you'll be
reading from or writing to
the file.
➢
Note that the open( )
function returns a file object
that you'll store in a
variable to be used in your
output statements.
5
Python File Basics: Where is the Data
File Saved?
➢
The save location for the data file depends on how
you create it.
➢
When the above line of code is executed directly
at the interpreter's prompt, the new file
(mydata.txt) is saved in the same location as your
Python executable program (python.exe) when
running from the IDLE prompt.
➢
However, when running using Run as Module, it’s
saved the same location as your source code file.
➢
And if you save all your Python statements into their own source code file,
when you run the program, the new file is saved in the same directory as
your source code file.
6
Python File Basics: Where is the Data
File Saved?
➢
So if you want to be sure of your file's location,
you can simply list the full path instead of just a
filename in your open statement.
➢
For example, if you wanted this file saved on
your Desktop, you could with this:
7
Python File Basics: Writing to a Data
File
There are two functions that you can use to
➢
write the data: write( ) and writelines( ).
➢
The write( ) function writes a single string to a
file. writelines( ) writes a list of strings.
➢
But be aware that neither function creates line breaks within a single
line or multiple lines of string.
8
Python File Basics: Writing to a Data
File
➢
If you issue a command to write data to a file,
the data might not immediately appear there.
➢
This is because the file access is a time-consuming operation.
Therefore, the computer might wait for more data to be written to the
file
➢
If you want to force the data to be written
immediately, you can use the flush( ) function.
Here's an example of the code:
out_file.flush( )
9
Python File Basics: Writing to a Data
File
➢
The last important part in this cycle is to close
the file. For that, you'll use the close( ) function.
For example, the following code closes my
out_file object:
out_file.close()
➢
There is no output because instead of calling
the print() function we wrote the data into the
file.
10
Python File Basics: Adding Line Breaks
➢
You include a newline
character, \n every time
you want to move a
string to the next line in
a file.
11
Python File Basics: The writelines()
Function
➢
This function will also
enable you to write
content to a file, but
instead of passing a
single value, you’ll need
to pass it some type of
collection of values, like
a list you created in the
previous lesson.
12
Python File Basics: Reading from Files
with read()
➢
To read a file instead of
opening your file with a
'w', you open it with an
'r'.
➢
The variable names
out_file and in_file are
different just because of
convention adopted by
many python
programmers.
13
Python File Basics: Reading from Files
with read()
➢
Once the file is opened, there are three functions
you can use to read from a file: read( ),
readline( ), and readlines( ).
➢
Although each function will read from the file,
each works a little differently so you’ll learn about
each one separately.
14
Python File Basics: The read( ) Function
➢
When using the read( )
function, you can provide
a number of bytes to be
read in.
➢
However, Python also
allows you to leave the
parentheses empty.
When you do, the rest of
the data from the file will
be read.
15
Python File Basics: The read( ) Function
In the example:
➢
The first line of code will read in (at the most) 1
➢
byte of data from the file (that is, one character)
and store the result in a variable named first.
➢
The second line of code will read in an entire
line and store it in a variable named second:
➢
A pointer keeps track of what
has been read in, and this pointer
increments after each read.
➢
The results of my two lines of
read code would result in "S"
being printed first and
"aturdaySundaySaturdaySunday"
being printed next.
16
Python File Basics: The readline( )
Function
➢
The readline( ) function
reads an entire line of data
from a file if the
parentheses are empty, and
optionally accepts a
maximum number of bytes
in the parentheses.
➢
You might want to use one
for entire lines and the
other for certain numbers of
characters in your code; it
can help make your code a
little easier to understand.
17
Python File Basics: The readlines()
Function
➢
Like the other two,
readlines( ) gives you the
ability to provide a
maximum number of
bytes to be read.
➢
However, if you don't
provide a number of
bytes, it'll read to the end
of the file, not just to the
end of the line.
18
More File Operations: Appending to an
Existing File
➢
If you want to have a file
that logs information about
the user each time they work
with your program, and
doesn’t erase the previous
data each time- you'll need
to open the file in append
mode with the 'a' argument.
➢
This opens that same data
file you were using before,
but this time it'll keep all the
existing data and add in the
new data at the end of the
file.
19
More File Operations: Other Options
when Opening Files
➢
As you have seen, opening a file for output
means you can only write to it, and opening a file
for input means you can only read from it.
➢
Although you can always just close a file and then reopen it in the
other mode, that extra set of steps can be a hassle.
➢
For this reason, Python provides two other ways
of opening our files: 'r+' and 'w+'.
➢
While both ways give you the ability to both
read and write your files, there is a difference.
20
More File Operations: Other Options
when Opening Files
➢
If you attempt to open your file with 'r+' and
that file doesn't exist, then Python will generate
an IOError exception, and your program will stop.
That’s because you can’t read from a file that doesn’t exist.
➢
➢
On the other hand, if you open a nonexistent file
with 'w+', then Python will simply create one for
you.
➢
Note, however, that if that file did exist, then its data would be erased,
just as if you had opened the file with 'w'.
21
More File Operations:The tell( ) Function
➢
To read and write to the same
file at the same time you can
still use the read( ) and
write( ) functions that you
learned earlier.
➢
The only difference is that now you have to
keep track of your current position in the
file as you do your reading and writing.
➢
You can always find out where
you are in your file with the
tell( ) function. This will give
you your current file position
as the number of bytes from
the start of the file.
22
More File Operations:The tell( ) Function
You can always find out
➢
where you are in your file
with the tell( ) function.
➢
This will give you your current file
position as the number of bytes from
the start of the file.
➢
One thing to realize is that
when you read in an entire
line of text, the newline
character is also read in.
➢
And while we call this a character, it's
actually two characters inside the file.
23
More File Operations:The seek()
Function
➢
Of course, if there's a way
to determine where you're
located in a file, there's
also a way to change that
location. You can do this
with the seek( ) function.
➢
When you use seek( ), you
must provide where in the
file you want to move by
specifying a number that
represents the number of
bytes from the beginning
of the file.
24
More File Operations:Reading and
Writing at the Same Time
Now that you know how to
➢
move around inside a file,
you're ready to start reading
and writing to a file at the
same time.
Be aware that when you're
➢
writing to a file that already
has data in it, you're going to
be overwriting its characters;
existing characters don’t
automatically move over to
accommodate new text the
way they do in a word
processor.
25
More File Operations:
➢
Now that you understand how to work with
basic files, let’s next explore how to work with
a database-like file.
For that you'll need to learn about a couple
➢
more Python functions.
26
Pickle and Shelve: Introduction to Pickle
The pickling process simply converts an object to a stream
➢
of bytes.
➢
This stream can then be reconverted to the original object
later.
This is a useful thing to do for a couple reasons.
➢
➢
First, by converting to a string of bytes, we’ll actually save a little bit of space on the disk.
➢
But possibly more important than this, is that we can store an entire object using just a
single line of code.
➢
That is, without the ability to use pickle, we would need to store every field of an object
one at a time in the file, and then when we restored the data, we would need to read in
each piece of data and create a new object. Even for something like an object from our
little Time class, we’re talking about replacing three lines of code: one for each of the
hour, minute, and second, into a single line.
27
Pickle and Shelve: Introduction to Pickle
➢
There are two different ways to pickle an object,
depending on where you want the result to be
stored.
➢
Note, however, that in order to use any of the
pickling functions, you need to have an import
statement to import pickle.
➢
You use the first function, dumps( ), if you want to store the result in a
string.
The other function, dump( ), stores the result in a file.
➢
28
Pickle and Shelve: Introduction to Pickle
➢
Example
29
Pickle and Shelve: Introduction to Pickle
➢
One of the reasons pickling is so important is
because we’re actually storing the values of all
instance variables in the object in a single line of
code.
➢
If we didn’t have the ability to use pickle, then we would need to access
each data member, one at a time.
➢
Pickling converts your list to a stream of bytes
that you store as a string.
➢
It shouldn't be a surprise that the stream is hard to read. However,
putting that stream in a data file for later use can be quite helpful in
certain situations, as you see next.
➢
30
Pickle and Shelve: Pickling to a File
➢
Sending the pickled result to a file is very similar
to sending it to a string, with two differences.
First, the function name is different: remember,
➢
we use dump( ) for files.
➢
Second, you need to provide the name of the file
as a second argument to the function call.
➢
This file needs to be opened so you can write a set of bytes to it—
because our pickled object is now a set of bytes, not characters.
➢
This is easy in Python: just use 'wb', instead of 'w' as you were using
before.
31
Pickle and Shelve: Pickling to a File
➢
For example, if you want
to send the list above to
a data file named,
data.txt, you'd need to
do this:
32
Pickle and Shelve: Pickling from a File
➢
To get the data back to its
original form- you'll need
either the loads( ) or the
load( ) function.
➢
Notice how loads( ) works
just like dumps( ).
➢
That is, you place the variable that's
holding the pickled data inside the
parentheses.
➢
The result is returned, and
in this case, we're storing
it in another variable.
33
Pickle and Shelve: Pickling from a File
➢
The same idea works then
with load( ) and data files,
except that you need to
remember to first open the
data file such that it can
read bytes.
What if there's more than
➢
one pickled object in the
data file?
➢
The load( ) function will read these
objects one at a time. The first call to
load will get the first object, the next call
will get the second object, and so on.
34
Pickle and Shelve: Python Shelves
➢
As you can see, pickling is a handy way of
converting your data into bytes and cramming
them into an external data file. However it’s quite
useful when used in conjunction with shelves.
➢
A shelf is a database-like object that can
efficiently store pickled values.
In actuality, a shelf is an external data file that is
➢
used the same way as a Python dictionary.
➢
The only difference is that in a shelf the keys
must be strings and the values must be objects
that can be pickled.
➢
35
Pickle and Shelve: Python Shelves
➢
To use a shelf in your program, you first add the
import shelf line.
➢
Next, you can open the shelf file by using the
open( ) function.
➢
This works just like the open( ) function for regular files with the name
of the file as the first argument and a flag to tell the computer how the
file should be opened as the second argument.
However, the flags for shelves are a little
➢
different.
36
Pickle and Shelve: Python Shelves
➢
You can use the 'r' and 'w' flags to open your
shelf for only reading or writing, respectively.
➢
Alternatively you can use the 'c' flag, which
enables you to open the shelf for both reading
and writing.
➢
Using it creates a new file if it doesn't already
exist.
➢
The last flag is 'n', which creates a new, empty
file no matter what.
37
Pickle and Shelve: Python Shelves
➢
The following code will open
the file letters.txt and write
two different records.
➢
The first one will have the vowels a, e, i, o,
and u.
➢
The second will have the key 'end' that
contains the letters x, y, and z.
Note that when you run this
➢
code, Python may produce
additional files with
extensions .bak, .dat, and .dir.
Just know that these files are the shelf and
➢
not intended to be human-readable.
38
Pickle and Shelve: Interacting with the
Shelf
➢
Notice how the syntax for adding an item is the
same as adding an item to a regular dictionary.
The difference, of course, is that this data is
➢
being sent to a file.
➢
Other operations that are possible on your shelf
are accessing the value by providing the key,
using the in operator, and using the keys( )
function.
➢
If you decide that you want to remove a record
from the file, you can use del, just like you did
with dictionaries.
39
Pickle and Shelve: Interacting with the
Shelf
➢
Demonstration:>
40
Pickle and Shelve: The sync( ) Function
➢
Because file operations are time consuming,
these files don't always write the data to the file
immediately.
➢
If you want to immediately write the data, you
can use the sync( ) function. It is similar to the
flush() function you used with the simple files
earlier.
➢
We’ll now see some examples to better acquaint
you with all we’ve so far.
41
Reading and Writing CSV Files in Python
➢
Exchanging information through text files is a
common way to share info between programs.
One of the most popular formats for exchanging
➢
data is the CSV format. But how do you use it?
➢
Let’s get one thing clear: you don’t have to (and
you won’t) build your own CSV parser from scratch.
➢
There are several perfectly acceptable libraries
you can use. The Python csv library will work for
most cases.
➢
If your work requires lots of data or numerical analysis, the pandas library
has CSV parsing capabilities as well, which should handle the rest.
42
Reading and Writing CSV Files in Python
➢
In this section we will see:
How to read, process, and parse CSV from text files using Python.
➢
➢
You’ll see how CSV files work, learn the all-important csv library built
into Python, and
See how CSV parsing works using the pandas library.
➢
We will also see how to install libraries using pip.
➢
➢
But first let’s get acquainted with CSV.
43
Reading and Writing CSV Files in Python
➢
In this section we will see:
How to read, process, and parse CSV from text files using Python.
➢
➢
You’ll see how CSV files work, learn the all-important csv library built
into Python, and
See how CSV parsing works using the pandas library.
➢
We will also see how to install libraries using pip.
➢
➢
But first let’s get acquainted with CSV.
44
What Is a CSV File?
➢
A CSV file (Comma Separated Values file) is a
type of plain text file that uses specific
structuring to arrange tabular data.
➢
Because it’s a plain text file, it can contain only actual text data—in
other words, printable ASCII or Unicode characters.
➢
Normally, CSV files use a comma to separate
each specific data value. Here’s what that
structure looks like:
45
What Is a CSV File?
➢
Notice how each piece of data is separated by a
comma.
➢
Normally, the first line identifies each piece of data—in other words, the name
of a data column.
➢
Every subsequent line after that is actual data and is limited only by file size
constraints.
In general, the separator character is called a
➢
delimiter, and the comma is not the only one used.
➢
Other popular delimiters include the tab (\t), colon (:) and semi-colon (;)
characters.
➢
Properly parsing a CSV file requires us to know which
delimiter is being used.
46
What Is a CSV File?
➢
Notice how each piece of data is separated by a
comma.
➢
Normally, the first line identifies each piece of data—in other words, the name
of a data column.
➢
Every subsequent line after that is actual data and is limited only by file size
constraints.
In general, the separator character is called a
➢
delimiter, and the comma is not the only one used.
➢
Other popular delimiters include the tab (\t), colon (:) and semi-colon (;)
characters.
➢
Properly parsing a CSV file requires us to know which
delimiter is being used.
47
What Is a CSV File?
CSV files are normally created by programs that
➢
handle large amounts of data. T
hey are a convenient way to export data from spreadsheets and databases as
➢
well as import or use it in other programs.
➢
For example, you might export the results of a data mining program to a CSV
file and then import that into a spreadsheet to analyze the data, generate
graphs for a presentation, or prepare a report for publication.
CSV files are very easy to work with
➢
programmatically.
Any language that supports text file input and string
➢
manipulation (like Python) can work with CSV files
directly.
48
Parsing CSV Files With Python’s Built-in
CSV Library
➢
The csv library provides functionality to both
read from and write to CSV files.
Designed to work out of the box with Excel-
➢
generated CSV files, it is easily adapted to work
with a variety of CSV formats.
➢
The csv library contains objects and other code
to read, write, and process data from and to CSV
files.
49
Reading CSV Files With csv
Reading from a CSV file is done using the reader object.
➢
➢
The CSV file is opened as a text file with Python’s
built-in open() function, which returns a file object.
➢
This is then passed to the reader, which does the heavy
lifting.
➢
For example here’s a file called employee_birthday.txt
file.
50
Reading CSV Files With csv
➢
Here’s a code to read the csv file:-
51
Reading CSV Files Into a Dictionary With
csv
Rather than deal with a list of individual String elements,
➢
you can read CSV data directly into a dictionary .
We will be working with the same file
➢
employee_birthday.txt.
52
Optional Python CSV reader Parameters
➢
The reader object can handle different styles of
CSV files by specifying additional parameters,
some of which are shown below:
➢
delimiter specifies the character used to separate each field. The
default is the comma (',').
➢
quotechar specifies the character used to surround fields that contain
the delimiter character. The default is a double quote (' " ').
➢
escapechar specifies the character used to escape the delimiter
character, in case quotes aren’t used. The default is no escape character.
53
Optional Python CSV reader Parameters
➢
For example if we have employee_addresses.txt
and it looked like this:
➢
The problem is that the data for the address field
also contains a comma to signify separation
between the fields.
54
Optional Python CSV reader Parameters
➢
There are three different ways to handle this situation:
Use a different delimiter
➢
That way, the comma can safely be used in the data itself. You use the delimiter optional
parameter to specify the new delimiter.
Wrap the data in quotes
➢
The special nature of your chosen delimiter is ignored in quoted strings. Therefore, you can
specify the character used for quoting with the quotechar optional parameter. As long as that
character also doesn’t appear in the data, you’re fine.
Escape the delimiter characters in the data
➢
Escape characters work just as they do in format strings, nullifying the interpretation of the
character being escaped (in this case, the delimiter). If an escape character is used, it must be
specified using the escapechar optional parameter.
55
Writing CSV Files With csv
➢
You can also write to a CSV file using a writer
object and the .write_row() method:
The quotechar optional parameter tells the
➢
writer which character to use to quote fields
when writing.
56
Writing CSV Files With csv
The quotechar optional parameter tells the writer
➢
which character to use to quote fields when writing.
➢
If quoting is set to csv.QUOTE_MINIMAL, then .writerow() will quote fields only
if they contain the delimiter or the quotechar. This is the default case.
If quoting is set to csv.QUOTE_ALL, then .writerow() will quote all fields.
➢
➢
If quoting is set to csv.QUOTE_NONNUMERIC, then .writerow() will quote all
fields containing text data and convert all numeric fields to the float data type.
➢
If quoting is set to csv.QUOTE_NONE, then .writerow() will escape delimiters
instead of quoting them. In this case, you also must provide a value for the
escapechar optional parameter.
57
Writing CSV Files With csv
➢
Reading the file back in plain text shows that the
file is created as follows:
58
Parsing CSV Files With the pandas
Library
Reading CSV files is possible in pandas as well. It is
➢
highly recommended if you have a lot of data to
analyze.
pandas is an open-source Python library that
➢
provides high performance data analysis tools and
easy to use data structures.
➢
pandas is available for all Python installations, but
it is a key part of the Anaconda distribution and
works extremely well in Jupyter notebooks to share
data, code, analysis results, visualizations, and
narrative text.
59
Parsing CSV Files With the pandas
Library
You can install pandas either through conda or
➢
pip.
You can read on how to use conda if you work
➢
with Anaconda or Jupyter – for now we will see
how to use pip.
➢
Enter these commands in cmd.
60
Reading CSV Files With pandas
➢
Once you’ve pandas installed – you can use it to
read csv files and much more.
➢
For example if we have a data file like this:
61
Reading CSV Files With pandas
➢
You can easily read the hrdata.csv file by this
code which uses pandas dataframe
62
Reading CSV Files With pandas
➢
Some notes
➢
First, pandas recognized that the first line of the CSV contained column
names, and used them automatically.
However, pandas is also using zero-based integer indices in the
➢
DataFrame. That’s because we didn’t tell it what our index should be.
➢
Further, if you look at the data types of our columns , you’ll see pandas
has properly converted the Salary and Sick Days remaining columns to
numbers, but the Hire Date column is still a String. This is easily
confirmed in interactive mode:
print(type(df[‘Hire Date][0]))
63
Reading CSV Files With pandas
➢
To use a different column as the DataFrame
index, add the index_col optional parameter:
64
Reading CSV Files With pandas
➢
You can force pandas to read data as a date with
the parse_date optional parameter-which is
defined as a list of column names to treat as
dates:
65
Reading CSV Files With pandas
➢
You can check that the date is parsed
appropriately by typing in the prompt:-
print(type(df['Hire Date'][0]))
➢
You will get a result like:-
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
➢
If your CSV files doesn’t have column names in
the first line, you can use the names optional
parameter to provide a list of column names.
66
Reading CSV Files With pandas
➢
You can also use this if you want to override the
column names provided in the first line. In this
case, you must also tell pandas.read_csv() to
ignore existing column names using the
header=0 optional parameter:
67
Reading CSV Files With pandas
➢
Notice that, since the column names changed,
the columns specified in the index_col and
parse_dates optional parameters must also be
changed.
68
Writing CSV Files With pandas
➢
Writing a DataFrame to a CSV file is just as
easy as reading one in. Let’s write the data
with the new column names to a new CSV file:
69
Writing CSV Files With pandas
➢
The only difference between this code and the
reading code above is that the print(df) call
was replaced with df.to_csv(), providing the file
name. The new CSV file looks like this:
70
Lesson 10 Review
➢
This lesson started by discussing external data
files.
➢
You learned how to create, open, write, and close
a data file. Then you went on to learn how to open
the same file and read the data out of it.
➢
Although working with simple, sequential files one line at a time can be
easy, you also learned that Python gives enables you to move around in
the file wherever you want with some additional functions.
You also saw how to work with csv files with the
➢
builtin library as well as with pandas.
71
Some exercises:
➢
Write a simple library management system. It
should have one class called Book – which
contains title, author name, publisher, ISBN
edition etc.
➢
The main program should allow to create new books or view the
books already there.
➢
You should be able to see the list of books even when you restart the
application.
72