使用get_dummies将类别型数据转化为哑变量矩阵
加载数据,并将数据离散化
detail = pd.read_excel('./meal_order_detail.xlsx') # print(detail.loc[:,'dishes_name']) res = pd.get_dummies(detail.loc[:,'dishes_name'],prefix='菜品',prefix_sep=':') print(res)
![]()
类别型数据转变为数值型数据
将连续性数据进行离散化,进行分组,将具体的数值转化为区间数据。bins表示分几组,include_lowest为True包含数据中的最小值。
amounts列的数据如下:
数据离散化之后的结果如下:
res_cut = pd.cut(detail.loc[:,'amounts'],bins=5,include_lowest=True) print(res_cut)
ptp = detail.loc[:,'amounts'].max()-detail.loc[:,'amounts'].min() step = np.ceil(ptp/5) bins = np.arange(detail.loc[:,'amounts'].min(),detail.loc[:,'amounts'].max()+step,step) res_cut = pd.cut(detail.loc[:,'amounts'],bins=bins,include_lowest=True) print(res_cut)
利用分位数进行等频分组 [0,0.2,0.4,0.6,0.8,1.0]
bins = detail.loc[:,'amounts'].quantile(np.arange(0,1+1/5,1/5)) res_cut = pd.cut(detail.loc[:,'amounts'],bins=bins,include_lowest=True) print(res_cut)
bins = [0,40,80,120,160,200] res_cut = pd.cut(detail.loc[:,'amounts'],bins=bins,include_lowest=True) print(res_cut)
将连续性数据转变成哑变量数据
bins = [0,40,80,120,160,200] res_cut = pd.cut(detail.loc[:,'amounts'],bins=bins,include_lowest=True) res_counts = pd.value_counts(res_cut) res_dum = pd.get_dummies(res_cut,prefix='区间',prefix_sep=':') print(res_dum)