Python Data Science Toolbox
Python Data Science Toolbox
Both flash1 and flash2 are iterables. # Create an iterator for range(3): small_value
small_value = iter(range(3))
flash1 is an iterable and flash2 is an iterator. # Print the values in small_value
Iterating over iterables (1) print(next(small_value))
print(next(small_value))
Create a for loop to loop over flash and print the values in the list.
Use person as the loop variable. print(next(small_value))
Create an iterator for the list flash and assign the result to superhero. # Loop over range(3) and print the valuesfor num in range(3):
Print each of the items from superhero using next() 4 times.
print(num)
# Create a list of strings: flash # Create an iterator for range(10 ** 100): googol
flash = ['jay garrick', 'barry allen', 'wally west', 'bart allen'] googol = iter(range(10 ** 100))
# Print each list item in flash using a for loopfor person in flash: # Print the first 5 values from googol
print(person) print(next(googol))
print(next(superhero)) print(next(googol))
Define the function count_entries(), which has 3 parameters. The first # Return counts_dict
parameter is csv_file for the filename, the second is c_size for the
return counts_dict
chunk size, and the last is colname for the column name.
Iterate over the file in csv_file file by using a for loop. Use the loop # Call count_entries(): result_counts
variable chunk and iterate over the call to pd.read_csv(), result_counts = count_entries('tweets.csv', 10, 'lang')
passing c_size to chunksize.
In the inner loop, iterate over the column given # Print result_counts
by colname in chunk by using a for loop. Use the loop variable entry. print(result_counts)
Call the count_entries() function by passing to it the
filename 'tweets.csv', the size of chunks 10, and the name of the
Write a basic list comprehension
column to count, 'lang'. Assign the result of the call to the
variable result_counts. The following list has been pre-loaded in the environment.
# Initialize an empty dictionary: counts_dict The list comprehension is [for doc in doctor: doc[0]] and produces the
list ['h', 'c', 'c', 't', 'w'].
counts_dict = {}
The list comprehension is [doc[0] for doc in doctor] and produces the
# Iterate over the file chunk by chunk list ['h', 'c', 'c', 't', 'w'].
for chunk in pd.read_csv(csv_file, chunksize=c_size):
The list comprehension is [doc[0] in doctor] and produces the list ['h', 'c',
'c', 't', 'w'].
# Iterate over the column in DataFrame
List comprehension over iterables Using the range of numbers from 0 to 9 as your iterable and i as your
You know that list comprehensions can be built over iterables. Given the iterator variable, write a list comprehension that produces a list of
following objects below, which of these can we build list comprehensions numbers consisting of the squared values of i.
over?
# Create list comprehension: squares
doctor = ['house', 'cuddy', 'chase', 'thirteen', 'wilson']
squares = [i**2 for i in range(0,10)]
In the inner list comprehension - that is, the output expression of the
underwood = 'After all, we are nothing more or less than what we choose to r nested list comprehension - create a list of values
eveal.' from 0 to 4 using range(). Use col as the iterator variable.
In the iterable part of your nested list comprehension, use range() to
count 5 rows - that is, create a list of values from 0 to 4. Use row as
jean = '24601' the iterator variable; note that you won’t be needing this variable to
create values in the list of lists.
flash = ['jay garrick', 'barry allen', 'wally west', 'bart allen']
# Create a 5 x 5 matrix using a list of lists: matrix
matrix = [[col for col in range(5)] for row in range(5)]
valjean = 24601
# Print the matrixfor row in matrix:
print(row)
You can build list comprehensions over all the objects except the string of
number characters jean.
Using conditionals in comprehensions (1)
You can build list comprehensions over all the objects except the string Use member as the iterator variable in the list comprehension. For the
lists doctor and flash. conditional, use len() to evaluate the iterator variable. Note that you
only want strings with 7 characters or more.
You can build list comprehensions over all the objects except range(50).
# Create a list of strings: fellowship
You can build list comprehensions over all the objects except the integer fellowship = ['frodo', 'samwise', 'merry', 'aragorn', 'legolas', 'boromir', 'gimli']
object valjean.
# Create list comprehension: new_fellowship
Writing list comprehensions new_fellowship = [member for member in fellowship if len(member) >= 7]
# Print the new list To help with that task, the following code has been pre-loaded in the
environment:
print(new_fellowship)
# List of strings
Using conditionals in comprehensions (2) fellowship = ['frodo', 'samwise', 'merry', 'aragorn', 'legolas', 'boromir', 'gimli']
Dict comprehensions List comprehensions and generators are not different at all; they are just
Create a dict comprehension where the key is a string in fellowship and the different ways of writing the same thing.
value is the length of the string. Remember to use the syntax <key> :
<value> in the output expression part of the comprehension to create the A list comprehension produces a list as output, a generator produces a
members of the dictionary. Use member as the iterator variable. generator object.
# Create a list of strings: fellowship
A list comprehension produces a list as output that can be iterated over, a
fellowship = ['frodo', 'samwise', 'merry', 'aragorn', 'legolas', 'boromir', 'gimli'] generator produces a generator object that can’t be iterated over.
# Create dict comprehension: new_fellowship
new_fellowship = { member:len(member) for member in fellowship } Write your own generator expressions
# Print the new dictionary Create a generator object that will produce values from 0 to 30.
print(new_fellowship) Assign the result to result and use num as the iterator variable in the
generator expression.
Print the first 5 values by using next() appropriately in print().
List comprehensions vs. generators
Print the rest of the values by using a for loop to iterate over the Complete the function header for the function get_lengths() that has a
generator object. single parameter, input_list.
In the for loop in the function definition, yield the length of the strings
# Create generator object: result in input_list.
Complete the iterable part of the for loop for printing the values
result = (num for num in range(31)) generated by the get_lengths() generator function. Supply the call
# Print the first 5 values to get_lengths(), passing in the list lannister.
print(next(result))
# Create a list of strings
print(next(result))
lannister = ['cersei', 'jaime', 'tywin', 'tyrion', 'joffrey']
print(next(result))
# Define generator function get_lengthsdef get_lengths(input_list):
print(next(result))
"""Generator function that yields the
print(next(result))
length of the strings in input_list."""
# Print the rest of the valuesfor value in result:
print(value)
# Yield the length of a string
Changing the output in generator expressions for person in input_list:
yield len(person)
Write a generator expression that will generate the lengths of each
string in lannister. Use person as the iterator variable. Assign the # Print the values generated by get_lengths()for value in get_lengths(lanniste
result to lengths. r):
Supply the correct iterable in the for loop for printing the values in the print(value)
generator object.
List comprehensions for time-stamped data
# Create a list of strings: lannister
lannister = ['cersei', 'jaime', 'tywin', 'tyrion', 'joffrey'] Extract the column 'created_at' from df and assign the result
to tweet_time. Fun fact: the extracted column in tweet_time here is a
# Create a generator object: lengths Series data structure!
lengths = (len(person) for person in lannister) Create a list comprehension that extracts the time from each row
in tweet_time. Each row is a string that represents a timestamp, and
# Iterate over and print the values in lengthsfor value in lengths: you will access the 12th to 19th characters in the string to extract the
print(value) time. Use entry as the iterator variable and assign the result
to tweet_clock_time. Remember that Python uses 0-based indexing!
Build a generator
# edited/added Create a zip object by calling zip() and passing to
it feature_names and row_vals. Assign the result to zipped_lists.
df = pd.read_csv('tweets.csv') Create a dictionary from the zipped_lists zip object by
# Extract the created_at column from df: tweet_time calling dict() with zipped_lists. Assign the resulting dictionary
to rs_dict.
tweet_time = df['created_at']
# Extract the clock time: tweet_clock_time # edited/added
tweet_clock_time = [entry[11:19] for entry in tweet_time] feature_names = ['CountryName', 'CountryCode', 'IndicatorName', 'Indicator
# Print the extracted times Code', 'Year', 'Value']
print(tweet_clock_time) row_vals = ['Arab World', 'ARB', 'Adolescent fertility rate (births per 1,000 w
omen ages 15-19)', 'SP.ADO.TFRT', '1960', '133.56090740552298']
Conditional list comprehensions for time-stamped data # Zip lists: zipped_lists
zipped_lists = zip(feature_names, row_vals)
Extract the column 'created_at' from df and assign the result
to tweet_time. # Create a dictionary: rs_dict
Create a list comprehension that extracts the time from each row rs_dict = dict(zipped_lists)
in tweet_time. Each row is a string that represents a timestamp, and
you will access the 12th to 19th characters in the string to extract the # Print the dictionary
time. Use entry as the iterator variable and assign the result print(rs_dict)
to tweet_clock_time. Additionally, add a conditional expression that
checks whether entry[17:19] is equal to '19'. Writing a function to help you
# Extract the created_at column from df: tweet_time Define the function lists2dict() with two parameters: first is list1 and
tweet_time = df['created_at'] second is list2.
Return the resulting dictionary rs_dict in lists2dict().
# Extract the clock time: tweet_clock_time Call the lists2dict() function with the
tweet_clock_time = [entry[11:19] for entry in tweet_time if entry[17:19] == ' arguments feature_names and row_vals. Assign the result of the
19'] function call to rs_fxn.
# Print the extracted times
# Define lists2dict()def lists2dict(list1, list2):
print(tweet_clock_time)
"""Return a dictionary where list1 provides
Dictionaries for data science the keys and list2 provides the values."""
# Zip lists: zipped_lists print(row_lists[1])
zipped_lists = zip(list1, list2) # Turn list of lists into list of dicts: list_of_dicts
list_of_dicts = [lists2dict(feature_names, sublist) for sublist in row_lists]
# Create a dictionary: rs_dict # Print the first two dictionaries in list_of_dicts
rs_dict = dict(zipped_lists) print(list_of_dicts[0])
print(list_of_dicts[1])
# Return the dictionary
Turning this all into a DataFrame
return rs_dict
# Call lists2dict: rs_fxn To use the DataFrame() function you need, first import the pandas
package with the alias pd.
rs_fxn = lists2dict(feature_names, row_vals)
Create a DataFrame from the list of dictionaries in list_of_dicts by
# Print rs_fxn calling pd.DataFrame(). Assign the resulting DataFrame to df.
print(rs_fxn) Inspect the contents of df printing the head of the DataFrame. Head of
the DataFrame df can be accessed by calling df.head().
Using a list comprehension
# Import the pandas packageimport pandas as pd
Inspect the contents of row_lists by printing the first two lists # Turn list of lists into list of dicts: list_of_dicts
in row_lists.
Create a list comprehension that generates a dictionary list_of_dicts = [lists2dict(feature_names, sublist) for sublist in row_lists]
using lists2dict() for each sublist in row_lists. The keys are from # Turn list of dicts into a DataFrame: df
the feature_names list and the values are the row entries in row_lists.
df = pd.DataFrame(list_of_dicts)
Use sublist as your iterator variable and assign the resulting list of
dictionaries to list_of_dicts. # Print the head of the DataFrame
Look at the first two dictionaries in list_of_dicts by printing them out. print(df.head())
# edited/addedimport csvwith open('row_lists.csv', 'r', newline='') as csvfile: Processing data in chunks (1)
reader = csv.reader(csvfile)
Use open() to bind the csv file 'world_dev_ind.csv' as file in the
row_lists = [row for row in reader] context manager.
# Print the first two lists in row_lists Complete the for loop so that it iterates 1000 times to perform the
loop body and process only the first 1000 rows of data of the file.
print(row_lists[0])
# Open a connection to the filewith open('world_dev_ind.csv') as file: Writing a generator to load data in chunks (2)
df_pop_ceb['Urban population (% of total)']) df_pop_ceb['Total Urban Population'] = [int(tup[0] * tup[1] * 0.01) for tup in
pops_list]
# Turn zip object into list: pops_list
# Plot urban population data
pops_list = list(pops)
df_pop_ceb.plot(kind='scatter', x='Year', y='Total Urban Population')
# Print pops_list
plt.show()
print(pops_list)
Writing an iterator to load data in chunks (4)
Writing an iterator to load data in chunks (3)
Initialize an empty DataFrame data using pd.DataFrame().
Write a list comprehension to generate a list of values In the for loop, iterate over urb_pop_reader to be able to process all
from pops_list for the new column 'Total Urban Population'. the DataFrame chunks in the dataset.
The output expression should be the product of the first and second Concatenate data and df_pop_ceb by passing a list of the DataFrames
element in each tuple in pops_list. Because the 2nd element is a to pd.concat().
percentage, you also need to either multiply the result by 0.01 or
divide it by 100. In addition, note that the column 'Total Urban
# Initialize reader object: urb_pop_reader
Population' should only be able to take on integer values. To ensure
this, make sure you cast the output expression to an integer with int(). urb_pop_reader = pd.read_csv('ind_pop_data.csv', chunksize=1000)
Create a scatter plot where the x-axis are values from # Initialize empty DataFrame: data
the 'Year' column and the y-axis are values from the 'Total Urban
Population' column. data = pd.DataFrame()
# Iterate over each DataFrame chunkfor df_urb_pop in urb_pop_reader:
# edited/addedimport matplotlib.pyplot as plt
# Code from previous exercise # Check out specific country: df_pop_ceb
urb_pop_reader = pd.read_csv('ind_pop_data.csv', chunksize=1000) df_pop_ceb = df_urb_pop[df_urb_pop['CountryCode'] == 'CEB']
# Zip DataFrame columns of interest: pops # Initialize reader object: urb_pop_reader
pops = zip(df_pop_ceb['Total Population'], urb_pop_reader = pd.read_csv(filename, chunksize=1000)
df_pop_ceb['Urban population (% of total)'])
# Initialize empty DataFrame: data
# Turn zip object into list: pops_list data = pd.DataFrame()
pops_list = list(pops)
# Iterate over each DataFrame chunk
# Use list comprehension to create new DataFrame column 'Total Urban P for df_urb_pop in urb_pop_reader:
opulation' # Check out specific country: df_pop_ceb
df_pop_ceb['Total Urban Population'] = [int(tup[0] * tup[1] * 0.01) for tup df_pop_ceb = df_urb_pop[df_urb_pop['CountryCode'] == country_code]
in pops_list]