使用Pandas、Numpy等工具库,完成以下数据操作
一、CSV数据读入
- 从指定路径下读取CSV数据文件,并将Loan_ID设为Index
- 数据文件train.csv在"./data/"路径下
- 打印输出前该数据集10行
import pandas as pd
import numpy as np
data=pd.read_csv("train.csv",index_col=0)
data.head(10)
|
Gender |
Married |
Dependents |
Education |
Self_Employed |
ApplicantIncome |
CoapplicantIncome |
LoanAmount |
Loan_Amount_Term |
Credit_History |
Property_Area |
Loan_Status |
Loan_ID |
|
|
|
|
|
|
|
|
|
|
|
|
LP001002 |
Male |
No |
0 |
Graduate |
No |
5849 |
0.0 |
NaN |
360.0 |
1.0 |
Urban |
Y |
LP001003 |
Male |
Yes |
1 |
Graduate |
No |
4583 |
1508.0 |
128.0 |
360.0 |
1.0 |
Rural |
N |
LP001005 |
Male |
Yes |
0 |
Graduate |
Yes |
3000 |
0.0 |
66.0 |
360.0 |
1.0 |
Urban |
Y |
LP001006 |
Male |
Yes |
0 |
Not Graduate |
No |
2583 |
2358.0 |
120.0 |
360.0 |
1.0 |
Urban |
Y |
LP001008 |
Male |
No |
0 |
Graduate |
No |
6000 |
0.0 |
141.0 |
360.0 |
1.0 |
Urban |
Y |
LP001011 |
Male |
Yes |
2 |
Graduate |
Yes |
5417 |
4196.0 |
267.0 |
360.0 |
1.0 |
Urban |
Y |
LP001013 |
Male |
Yes |
0 |
Not Graduate |
No |
2333 |
1516.0 |
95.0 |
360.0 |
1.0 |
Urban |
Y |
LP001014 |
Male |
Yes |
3+ |
Graduate |
No |
3036 |
2504.0 |
158.0 |
360.0 |
0.0 |
Semiurban |
N |
LP001018 |
Male |
Yes |
2 |
Graduate |
No |
4006 |
1526.0 |
168.0 |
360.0 |
1.0 |
Urban |
Y |
LP001020 |
Male |
Yes |
1 |
Graduate |
No |
12841 |
10968.0 |
349.0 |
360.0 |
1.0 |
Semiurban |
N |
二、数据选择
- 从数据集中得到“所有没有毕业”(Education: Not Graduate)并且“获得贷款”(Loan_Status: Y)的“女性”(Gender: Female),并输出“性别”(Gender)、“教育状况”(Education)及“贷款状态”(Loan_Status)。
|
Gender |
Education |
Loan_Status |
Loan_ID |
|
|
|
LP001155 |
Female |
Not Graduate |
Y |
LP001669 |
Female |
Not Graduate |
Y |
LP001692 |
Female |
Not Graduate |
Y |
LP001908 |
Female |
Not Graduate |
Y |
LP002300 |
Female |
Not Graduate |
Y |
LP002314 |
Female |
Not Graduate |
Y |
LP002407 |
Female |
Not Graduate |
Y |
LP002489 |
Female |
Not Graduate |
Y |
LP002502 |
Female |
Not Graduate |
Y |
LP002534 |
Female |
Not Graduate |
Y |
LP002582 |
Female |
Not Graduate |
Y |
LP002731 |
Female |
Not Graduate |
Y |
LP002757 |
Female |
Not Graduate |
Y |
LP002917 |
Female |
Not Graduate |
Y |
三、使用apply对数据集应用自定义函数
def num_missing(x):
return sum(x.isnull())
3.1 使用apply函数将num_missing函数用于统计数据集的每列缺失值数量
Gender 13
Married 3
Dependents 15
Education 0
Self_Employed 32
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 22
Loan_Amount_Term 14
Credit_History 50
Property_Area 0
Loan_Status 0
dtype: int64
3.2 使用apply函数将num_missing函数用于统计数据集每行缺失值数量,并打印前10行
Loan_ID
LP001002 1
LP001003 0
LP001005 0
LP001006 0
LP001008 0
LP001011 0
LP001013 0
LP001014 0
LP001018 0
LP001020 0
dtype: int64
四、缺失值填充
4.1 对于Gender、Married、Self_Employed三个因子型变量,使用各自最常见的因子进行缺失值填充
4.2 对于LoanAmount变量进行缺失值填充处理
- 按照“Gender”、“Married”及“Self_Employed”的组合下的每个组群进行LoanAmount变量的均值统计
- 按照每组统计得到的平均值,对“LoanAmount”中缺失值进行填充
五、数据透视表
基于data数据,得到下表:
Loan_Status |
N |
Y |
ALL |
Credit_History |
|
|
|
0.0 |
num |
num |
num |
1.0 |
num |
num |
num |
All |
num |
num |
num |
其中num代表统计数量。
pd.crosstab(data['Credit_History'],data['Loan_Status'],margins=True)
Loan_Status |
N |
Y |
All |
Credit_History |
|
|
|
0.0 |
82 |
7 |
89 |
1.0 |
97 |
378 |
475 |
All |
179 |
385 |
564 |
六、合并数据集
- 将prop_rates数据集与data数据集合并
- 基于合并后的数据集,按照“Property_Area”、“rates”的组合下的每个组群下Credit_History变量的样本数量统计
prop_rates = pd.DataFrame([1000, 5000, 12000], index=['Rural','Semiurban','Urban'],columns=['rates'])
prop_rates
|
rates |
Rural |
1000 |
Semiurban |
5000 |
Urban |
12000 |
data1=data.iloc[:5,:2]
data1