0% found this document useful (0 votes)
54 views

Numpy NP Pandas PD Matplotlib - Pyplot PLT Seaborn SNS: "Merged - Uscol - TXT" ","

This document explores and summarizes a dataset containing information about US colleges. It loads the dataset, cleans it by removing carriage returns from column names and replacing null values with column means. It then provides some initial analysis, showing the dataset has 1133 entries across 51 columns containing information like test scores, tuition costs, faculty details, and more. The document examines the dataset types and concludes by noting the cleaning removed null values.

Uploaded by

Peper12345
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views

Numpy NP Pandas PD Matplotlib - Pyplot PLT Seaborn SNS: "Merged - Uscol - TXT" ","

This document explores and summarizes a dataset containing information about US colleges. It loads the dataset, cleans it by removing carriage returns from column names and replacing null values with column means. It then provides some initial analysis, showing the dataset has 1133 entries across 51 columns containing information like test scores, tuition costs, faculty details, and more. The document examines the dataset types and concludes by noting the cleaning removed null values.

Uploaded by

Peper12345
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

7/30/2020 Exploratory Data Analysis

In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:

# Read in the dataset:


df = pd.read_table("merged_uscol.txt", sep=",")
df.head()

Out[2]:

FICE College_name.x States Public_indicator Average_M\r\nath_SAT_score Average_Verb

Alabama Agri. &


0 1002 AL 1 NaN
Mech. Univ.

University of
1 1004 AL 1 NaN
Montevallo

Auburn
2 1009 University-Main AL 1 575.0
Campus

Birmingham-
3 1012 Southern AL 2 575.0
College

University of
4 1016 AL 1 NaN
North Alabama

5 rows × 51 columns

In [3]:

# Remove carriage return and newline sequences in column names:


df.columns = list(map(lambda x: x.replace("\r\n", ""), df.columns.tolist()))

localhost:8888/lab 1/18
7/30/2020 Exploratory Data Analysis

In [4]:

# Return first 5 rows of our dataframe:


df.head()

Out[4]:

FICE College_name.x States Public_indicator Average_Math_SAT_score Average_Verbal_

Alabama Agri. &


0 1002 AL 1 NaN
Mech. Univ.

University of
1 1004 AL 1 NaN
Montevallo

Auburn
2 1009 University-Main AL 1 575.0
Campus

Birmingham-
3 1012 Southern AL 2 575.0
College

University of
4 1016 AL 1 NaN
North Alabama

5 rows × 51 columns

In [5]:

# Let's replace our NaN values with the mean of the corresponding column:
df.fillna(df.mean(), inplace=True, axis=0)

In [6]:

df.head()

Out[6]:

FICE College_name.x States Public_indicator Average_Math_SAT_score Average_Verbal_

Alabama Agri. &


0 1002 AL 1 512.605144
Mech. Univ.

University of
1 1004 AL 1 512.605144
Montevallo

Auburn
2 1009 University-Main AL 1 575.000000
Campus

Birmingham-
3 1012 Southern AL 2 575.000000
College

University of
4 1016 AL 1 512.605144
North Alabama

5 rows × 51 columns

localhost:8888/lab 2/18
7/30/2020 Exploratory Data Analysis

In [7]:

# Let's explore our dataset now !


df.info()

localhost:8888/lab 3/18
7/30/2020 Exploratory Data Analysis

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1133 entries, 0 to 1132
Data columns (total 51 columns):
# Column Non-Null Count Dt
ype
--- ------ -------------- --
---
0 FICE 1133 non-null in
t64
1 College_name.x 1133 non-null ob
ject
2 States 1133 non-null ob
ject
3 Public_indicator 1133 non-null in
t64
4 Average_Math_SAT_score 1133 non-null fl
oat64
5 Average_Verbal_SAT_score 1133 non-null fl
oat64
6 Average_Combined_SAT_score 1133 non-null fl
oat64
7 Average_ACT_score 1133 non-null fl
oat64
8 First_quartile_Math_SAT 1133 non-null fl
oat64
9 Third_quartile_Math_SAT 1133 non-null fl
oat64
10 First_quartile_Verbal_SAT 1133 non-null fl
oat64
11 Third_quartile_Verbal_SAT 1133 non-null fl
oat64
12 First_quartile_ACT 1133 non-null fl
oat64
13 Third_quartile_ACT 1133 non-null fl
oat64
14 Number_applications_received 1133 non-null fl
oat64
15 Number_applicants_accepted 1133 non-null fl
oat64
16 Number_new_students_enrolled 1133 non-null fl
oat64
17 new_students_from_top_ten_percent_HS_class 1133 non-null fl
oat64
18 students_from_top_twenty_five_percent_of_HS_class 1133 non-null fl
oat64
19 Number_fulltime_undergraduates 1133 non-null fl
oat64
20 Number_parttime_undergraduates 1133 non-null fl
oat64
21 In_state_tuition 1133 non-null fl
oat64
22 Out_state_tuition 1133 non-null fl
oat64
23 Room_and_board_costs 1133 non-null fl
oat64
24 Room_costs 1133 non-null fl
oat64
25 Board_costs 1133 non-null fl
oat64
26 Additional_fees 1133 non-null fl
oat64
localhost:8888/lab 4/18
7/30/2020 Exploratory Data Analysis

27 Estimated_book_costs 1133 non-null fl


oat64
28 Estimated_personal_spending 1133 non-null fl
oat64
29 Pct_of_faculty_with_PhD 1133 non-null fl
oat64
30 Pct_of_faculty_with_terminal_degree 1133 non-null fl
oat64
31 Student_and_faculty_ratio 1133 non-null fl
oat64
32 Pct_alumni_who_donate 1133 non-null fl
oat64
33 Instructional_expenditure_per_student 1133 non-null fl
oat64
34 Graduation_rate 1133 non-null fl
oat64
35 College_name.y 1133 non-null ob
ject
36 State 1133 non-null ob
ject
37 Type 1133 non-null ob
ject
38 Average_salary_full_professors 1133 non-null fl
oat64
39 Average_salary_associate_professors 1133 non-null fl
oat64
40 Average_salary_assistant_professors 1133 non-null fl
oat64
41 Average_salary_all_ranks 1133 non-null in
t64
42 Average_compensation_full_professors 1133 non-null fl
oat64
43 Average_compensation_associate_professors 1133 non-null fl
oat64
44 Average_compensation_assistant_professors 1133 non-null fl
oat64
45 Average_compensation_all_ranks 1133 non-null in
t64
46 Number_of_full_professors 1133 non-null in
t64
47 Number_of_associate_professors 1133 non-null in
t64
48 Number_of_assistant_professors 1133 non-null in
t64
49 Number_of_instructors 1133 non-null in
t64
50 Number_of_faculty_all_ranks 1133 non-null in
t64
dtypes: float64(37), int64(9), object(5)
memory usage: 451.6+ KB

We know have no null values, which is great ! Our data is clean ,


let's explore the cleaned data further.

localhost:8888/lab 5/18
7/30/2020 Exploratory Data Analysis

In [8]:

# Return information about our dataset columns/features:


df.describe()

Out[8]:

FICE Public_indicator Average_Math_SAT_score Average_Verbal_SAT_score A

count 1133.000000 1133.00000 1133.000000 1133.000000

mean 2955.491615 1.61165 512.605144 465.243570

std 2136.044239 0.48759 51.038802 43.852767

min 1002.000000 1.00000 320.000000 280.000000

25% 1893.000000 1.00000 496.000000 450.000000

50% 2638.000000 2.00000 512.605144 465.243570

75% 3406.000000 2.00000 520.000000 468.000000

max 29261.000000 2.00000 750.000000 665.000000

8 rows × 46 columns

localhost:8888/lab 6/18
7/30/2020 Exploratory Data Analysis

In [9]:

# Return the number of unique values in each column:


df.nunique()

Out[9]:

FICE 1132
College_name.x 1110
States 51
Public_indicator 2
Average_Math_SAT_score 227
Average_Verbal_SAT_score 206
Average_Combined_SAT_score 315
Average_ACT_score 17
First_quartile_Math_SAT 83
Third_quartile_Math_SAT 80
First_quartile_Verbal_SAT 65
Third_quartile_Verbal_SAT 82
First_quartile_ACT 21
Third_quartile_ACT 20
Number_applications_received 1007
Number_applicants_accepted 974
Number_new_students_enrolled 804
new_students_from_top_ten_percent_HS_class 89
students_from_top_twenty_five_percent_of_HS_class 91
Number_fulltime_undergraduates 1018
Number_parttime_undergraduates 809
In_state_tuition 850
Out_state_tuition 868
Room_and_board_costs 735
Room_costs 548
Board_costs 434
Additional_fees 410
Estimated_book_costs 157
Estimated_personal_spending 376
Pct_of_faculty_with_PhD 88
Pct_of_faculty_with_terminal_degree 75
Student_and_faculty_ratio 197
Pct_alumni_who_donate 62
Instructional_expenditure_per_student 1051
Graduation_rate 89
College_name.y 1112
State 52
Type 4
Average_salary_full_professors 428
Average_salary_associate_professors 303
Average_salary_assistant_professors 235
Average_salary_all_ranks 343
Average_compensation_full_professors 486
Average_compensation_associate_professors 373
Average_compensation_assistant_professors 305
Average_compensation_all_ranks 431
Number_of_full_professors 298
Number_of_associate_professors 255
Number_of_assistant_professors 241
Number_of_instructors 83
Number_of_faculty_all_ranks 493
dtype: int64

localhost:8888/lab 7/18
7/30/2020 Exploratory Data Analysis

In [10]:

# Return the counts of all the categorical values in the "Type" column:
df["Type"].value_counts()

Out[10]:

IIB 598
IIA 356
I 178
VIIB 1
Name: Type, dtype: int64

In [11]:

# Drop all categorical columns except "Type" as we convert this to a numerical column!
df.drop(["College_name.x", "States", "College_name.y", "State"], axis=1,inplace=True)

In [12]:

# Convert "Type" to numerical columns:


df = pd.get_dummies(df)

localhost:8888/lab 8/18
7/30/2020 Exploratory Data Analysis

In [13]:

# Check that our "Type" column has been replaced with numerical columns for "Type":
df.columns.tolist()

Out[13]:

['FICE',
'Public_indicator',
'Average_Math_SAT_score',
'Average_Verbal_SAT_score',
'Average_Combined_SAT_score',
'Average_ACT_score',
'First_quartile_Math_SAT',
'Third_quartile_Math_SAT',
'First_quartile_Verbal_SAT',
'Third_quartile_Verbal_SAT',
'First_quartile_ACT',
'Third_quartile_ACT',
'Number_applications_received',
'Number_applicants_accepted',
'Number_new_students_enrolled',
'new_students_from_top_ten_percent_HS_class',
'students_from_top_twenty_five_percent_of_HS_class',
'Number_fulltime_undergraduates',
'Number_parttime_undergraduates',
'In_state_tuition',
'Out_state_tuition',
'Room_and_board_costs',
'Room_costs',
'Board_costs',
'Additional_fees',
'Estimated_book_costs',
'Estimated_personal_spending',
'Pct_of_faculty_with_PhD',
'Pct_of_faculty_with_terminal_degree',
'Student_and_faculty_ratio',
'Pct_alumni_who_donate',
'Instructional_expenditure_per_student',
'Graduation_rate',
'Average_salary_full_professors',
'Average_salary_associate_professors',
'Average_salary_assistant_professors',
'Average_salary_all_ranks',
'Average_compensation_full_professors',
'Average_compensation_associate_professors',
'Average_compensation_assistant_professors',
'Average_compensation_all_ranks',
'Number_of_full_professors',
'Number_of_associate_professors',
'Number_of_assistant_professors',
'Number_of_instructors',
'Number_of_faculty_all_ranks',
'Type_I',
'Type_IIA',
'Type_IIB',
'Type_VIIB']

localhost:8888/lab 9/18
7/30/2020 Exploratory Data Analysis

In [14]:

df.head()

Out[14]:

FICE Public_indicator Average_Math_SAT_score Average_Verbal_SAT_score Average_Com

0 1002 1 512.605144 465.24357

1 1004 1 512.605144 465.24357

2 1009 1 575.000000 501.00000

3 1012 2 575.000000 525.00000

4 1016 1 512.605144 465.24357

5 rows × 50 columns

In [15]:

# Let's create an intuituve dataset which we think is consisting only of significant fe


atures
labels_to_drop = ['First_quartile_Math_SAT',
'Third_quartile_Math_SAT',
'First_quartile_Verbal_SAT',
'Third_quartile_Verbal_SAT',
'First_quartile_ACT',
'Third_quartile_ACT',
'Average_salary_full_professors',
'Average_salary_associate_professors',
'Average_salary_assistant_professors',
'Average_compensation_full_professors',
'Average_compensation_associate_professors',
'Average_compensation_assistant_professors',
'Number_of_full_professors',
'Number_of_associate_professors',
'Number_of_assistant_professors',
'Number_of_instructors',
'Pct_alumni_who_donate'
]
df_intuitive = df.drop(labels=labels_to_drop, axis=1)
df_intuitive.to_csv('intuitive_data.txt', header=True, index=None, sep=',')

localhost:8888/lab 10/18
7/30/2020 Exploratory Data Analysis

In [16]:

# Let's visualize our correlation matrix using a heatmap:


plt.figure(figsize=(12,12))
sns.heatmap(df.corr(), cmap="coolwarm")
plt.savefig("correlation matrix", quality=95, dpi=300, bbox_inches="tight")

Let's examine the relationship with explanatory variables which


have a profound correlation (positive or negative) with
Graduation_rate

localhost:8888/lab 11/18
7/30/2020 Exploratory Data Analysis

In [17]:

# We see a positive correlation here:


sns.lmplot(x="In_state_tuition", y="Graduation_rate", data=df)
plt.title("Graduation Rate vs In State Tuition")
plt.savefig("grad_rate vs in_state_tuition", quality=95, dpi=300, bbox_inches="tight")

In [18]:

# We see a positive correlation here:


sns.lmplot(x="Out_state_tuition", y="Graduation_rate", data=df)
plt.title("Graduation Rate vs Out State Tuition")
plt.savefig("grad_rate vs out_state_tuition", quality=95, dpi=300, bbox_inches="tight")

localhost:8888/lab 12/18
7/30/2020 Exploratory Data Analysis

In [19]:

# We see a negative correlation here:


sns.lmplot(x="Number_parttime_undergraduates", y="Graduation_rate", data=df)
plt.title("Graduation Rate vs Number of part time undergraduates")
plt.savefig("grad_rate vs Number_parttime_undergraduate", quality=95, dpi=300, bbox_inc
hes="tight")

In [20]:

# We see a negative correlation here:


sns.lmplot(x="Student_and_faculty_ratio", y="Graduation_rate", data=df)
plt.title("Student and Faculty Ratio vs Number of part time undergraduates")
plt.savefig("grad_rate vs Student_and_faculty_ratio", quality=95, dpi=300, bbox_inches=
"tight")

localhost:8888/lab 13/18
7/30/2020 Exploratory Data Analysis

In [21]:

# It is apparent that in our data we have some outliers, let's proceed to remove these
outliers:
plt.figure(figsize=(12,12))
sns.boxplot(x="Public_indicator", y="Graduation_rate", data=df)
plt.savefig("grad_rate vs public_indicator", quality=95, dpi=300, bbox_inches="tight")

localhost:8888/lab 14/18
7/30/2020 Exploratory Data Analysis

In [22]:

# We now remove all examples which contain values more than 3 standard deviations away
from our mean:
df = df[df.apply(lambda x: np.abs(x - x.mean()) / x.std() < 3).all(axis=1)]
df.info()

localhost:8888/lab 15/18
7/30/2020 Exploratory Data Analysis

<class 'pandas.core.frame.DataFrame'>
Int64Index: 832 entries, 1 to 1104
Data columns (total 50 columns):
# Column Non-Null Count Dt
ype
--- ------ -------------- --
---
0 FICE 832 non-null in
t64
1 Public_indicator 832 non-null in
t64
2 Average_Math_SAT_score 832 non-null fl
oat64
3 Average_Verbal_SAT_score 832 non-null fl
oat64
4 Average_Combined_SAT_score 832 non-null fl
oat64
5 Average_ACT_score 832 non-null fl
oat64
6 First_quartile_Math_SAT 832 non-null fl
oat64
7 Third_quartile_Math_SAT 832 non-null fl
oat64
8 First_quartile_Verbal_SAT 832 non-null fl
oat64
9 Third_quartile_Verbal_SAT 832 non-null fl
oat64
10 First_quartile_ACT 832 non-null fl
oat64
11 Third_quartile_ACT 832 non-null fl
oat64
12 Number_applications_received 832 non-null fl
oat64
13 Number_applicants_accepted 832 non-null fl
oat64
14 Number_new_students_enrolled 832 non-null fl
oat64
15 new_students_from_top_ten_percent_HS_class 832 non-null fl
oat64
16 students_from_top_twenty_five_percent_of_HS_class 832 non-null fl
oat64
17 Number_fulltime_undergraduates 832 non-null fl
oat64
18 Number_parttime_undergraduates 832 non-null fl
oat64
19 In_state_tuition 832 non-null fl
oat64
20 Out_state_tuition 832 non-null fl
oat64
21 Room_and_board_costs 832 non-null fl
oat64
22 Room_costs 832 non-null fl
oat64
23 Board_costs 832 non-null fl
oat64
24 Additional_fees 832 non-null fl
oat64
25 Estimated_book_costs 832 non-null fl
oat64
26 Estimated_personal_spending 832 non-null fl
oat64
localhost:8888/lab 16/18
7/30/2020 Exploratory Data Analysis

27 Pct_of_faculty_with_PhD 832 non-null fl


oat64
28 Pct_of_faculty_with_terminal_degree 832 non-null fl
oat64
29 Student_and_faculty_ratio 832 non-null fl
oat64
30 Pct_alumni_who_donate 832 non-null fl
oat64
31 Instructional_expenditure_per_student 832 non-null fl
oat64
32 Graduation_rate 832 non-null fl
oat64
33 Average_salary_full_professors 832 non-null fl
oat64
34 Average_salary_associate_professors 832 non-null fl
oat64
35 Average_salary_assistant_professors 832 non-null fl
oat64
36 Average_salary_all_ranks 832 non-null in
t64
37 Average_compensation_full_professors 832 non-null fl
oat64
38 Average_compensation_associate_professors 832 non-null fl
oat64
39 Average_compensation_assistant_professors 832 non-null fl
oat64
40 Average_compensation_all_ranks 832 non-null in
t64
41 Number_of_full_professors 832 non-null in
t64
42 Number_of_associate_professors 832 non-null in
t64
43 Number_of_assistant_professors 832 non-null in
t64
44 Number_of_instructors 832 non-null in
t64
45 Number_of_faculty_all_ranks 832 non-null in
t64
46 Type_I 832 non-null ui
nt8
47 Type_IIA 832 non-null ui
nt8
48 Type_IIB 832 non-null ui
nt8
49 Type_VIIB 832 non-null ui
nt8
dtypes: float64(37), int64(9), uint8(4)
memory usage: 308.8 KB

localhost:8888/lab 17/18
7/30/2020 Exploratory Data Analysis

In [23]:

df.describe()

Out[23]:

FICE Public_indicator Average_Math_SAT_score Average_Verbal_SAT_score A

count 832.000000 832.000000 832.000000 832.000000

mean 2752.307692 1.639423 507.004439 461.343310

std 1192.384792 0.480457 37.914697 32.679401

min 1004.000000 1.000000 380.000000 350.000000

25% 1939.250000 1.000000 495.000000 450.000000

50% 2653.500000 2.000000 512.605144 465.243570

75% 3388.250000 2.000000 512.605144 465.243570

max 9345.000000 2.000000 655.000000 579.000000

8 rows × 50 columns

localhost:8888/lab 18/18

You might also like