【天池智慧海洋建设】Topline源码——特征工程学习
团队名称:大白
链接:
https://github.com/Ai-Light/2020-zhihuihaiyang
前言
topline代码开源学习,仅关注特征工程部分,具体为输入,输出,作用、原理及部分个人理解。
I 数据部分
原始数据描述:
渔船ID:渔船的唯一识别,结果文件以此ID为标示
x: 渔船在平面坐标系的x轴坐标
y: 渔船在平面坐标系的y轴坐标
速度:渔船当前时刻航速,单位节
方向:渔船当前时刻航首向,单位度
time:数据上报时刻,单位月日 时:分
type:渔船label,三种作业类型(围网、刺网、拖网)
II 特征工程部分
具体方案:
根据查询的相关资料,清楚了解到渔船针对海中生物群和水域差异,拖网、围网、刺网采用不同路线、不同速度、作业时间差异设计出的捕捞方式。根据这些特点,我们为此提取三个方面的特征,最后用lgb的gbdt模型进行学习,未进行模型融合。
2.0 基础特征工程
def extract_dt(df):
df['time'] = pd.to_datetime(df['time'], format='%m%d %H:%M:%S')
df['date'] = df['time'].dt.date
df['hour'] = df['time'].dt.hour
df['x_dis_diff'] = (df['x'] - 6165599).abs()
df['y_dis_diff'] = (df['y'] - 5202660).abs()
df['base_dis_diff'] = ((df['x_dis_diff']**2)+(df['y_dis_diff']**2))**0.5
del df['x_dis_diff'],df['y_dis_diff']
df["x"] = df["x"] / 1e6
df["y"] = df["y"] / 1e6
df['day_nig'] = 0
df.loc[(df['hour'] > 5) & (df['hour'] < 20),'day_nig'] = 1
return df
标准时间转换,提取出日期、小时,同时设定小时数为5-20为day_nig = 1代表白天,和0代表黑夜
2.1 速度划分与基本统计量特征
首先通过速度为0和非0两个部分进行数据集的划分
data = train
data_label = train_label
data_1 = data[data['speed']==0]
data_2 = data[data['speed']!=0]
(flag = 0 代表统计方向,1代表统计速度)
然后以ship为主键,通过extract_feature函数分组进行目标speed(或direction)的统计量构造,并构造了一些x、y方向上的最值差,区域面积,区域边长相除,以及hour的统计值
其中extract_feature函数中用到的group_feature涉及
到这样一个基本统计量的方法
df.groupby(key)[target].agg(agg_dict).reset_index()
key:分组主键
target:分组的目标值
agg_dict:所用方法(函数)
完整代码如下
def group_feature(df, key, target, aggs,flag):
agg_dict = {}
for ag in aggs:
agg_dict['{}_{}_{}'.format(target,ag,flag)] = ag
print(agg_dict)
t = df.groupby(key)[target].agg(agg_dict).reset_index()
return t
def extract_feature(df, train, flag):
'''
统计feature
'''
if (flag == 'on_night') or (flag == 'on_day'):
t = group_feature(df, 'ship','speed',['max','mean','median','std','skew'],flag)
train = pd.merge(train, t, on='ship', how='left')
# return train
if flag == "0":
t = group_feature(df, 'ship','direction',['max','median','mean','std','skew'],flag)
train = pd.merge(train, t, on='ship', how='left')
elif flag == "1":
t = group_feature(df, 'ship','speed',['max','mean','median','std','skew'],flag)
train = pd.merge(train, t, on='ship', how='left')
t = group_feature(df, 'ship','direction',['max','median','mean','std','skew'],flag)
train = pd.merge(train, t, on='ship', how='left')
hour_nunique = df.groupby('ship')['speed'].nunique().to_dict()
train['speed_nunique_{}'.format(flag)] = train['ship'].map(hour_nunique)
hour_nunique = df.groupby('ship')['direction'].nunique().to_dict()
train['direction_nunique_{}'.format(flag)] = train['ship'].map(hour_nunique)
t = group_feature(df, 'ship','x',['max','min','mean','median','std','skew'],flag)
train = pd.merge(train, t, on='ship', how='left')
t = group_feature(df, 'ship','y',['max','min','mean','median','std','skew'],flag)
train = pd.merge(train, t, on='ship', how='left')
t = group_feature(df, 'ship','base_dis_diff',['max','min','mean','std','skew'],flag)
train = pd.merge(train, t, on='ship', how='left')
train['x_max_x_min_{}'.format(flag)] = train['x_max_{}'.format(flag)] - train['x_min_{}'.format(flag)]
train['y_max_y_min_{}'.format(flag)] = train['y_max_{}'.format(flag)] - train['y_min_{}'.format(flag)]
train['y_max_x_min_{}'.format(flag)] = train['y_max_{}'.format(flag)] - train['x_min_{}'.format(flag)]
train['x_max_y_min_{}'.format(flag)] = train['x_max_{}'.format(flag)] - train['y_min_{}'.format(flag)]
train['slope_{}'.format(flag)] = train['y_max_y_min_{}'.format(flag)] / np.where(train['x_max_x_min_{}'.format(flag)]==0, 0.001, train['x_max_x_min_{}'.format(flag)])
train['area_{}'.format(flag)] = train['x_max_x_min_{}'.format(flag)] * train['y_max_y_min_{}'.format(flag)]
mode_hour = df.groupby('ship')['hour'].agg(lambda x:x.value_counts().index[0]).to_dict()
train['mode_hour_{}'.format(flag)] = train['ship'].map(mode_hour)
train['slope_median_{}'.format(flag)] = train['y_median_{}'.format(flag)] / np.where(train['x_median_{}'.format(flag)]==0, 0.001, train['x_median_{}'.format(flag)])
return train
2.2 时间划分
同样,以时间划分白天和黑夜,进行基本统计特征的构造,方法与速度划分完全相同
data_label = extract_feature(data_1, data_label,"0")
data_label = extract_feature(data_2, data_label,"1")
data_1 = data[data['day_nig'] == 0]
data_2 = data[data['day_nig'] == 1]
data_label = extract_feature(data_1, data_label,"on_night")
data_label = extract_feature(data_2, data_label,"on_day")
first = "0"
second = "1"
data_label['direction_median_ratio'] = data_label['direction_median_{}'.format(second)] / data_label['direction_median_{}'.format(first)]
data_label['slope_ratio'] = data_label['slope_{}'.format(second)] / data_label['slope_{}'.format(first)]
data_label['slope_mean_ratio'] = data_label['slope_median_{}'.format(second)] / data_label['slope_median_{}'.format(first)]
first = "on_night"
second = "on_day"
data_label['speed_median_ratio'] = data_label['speed_median_{}'.format(second)] / data_label['speed_median_{}'.format(first)]
data_label['speed_std_ratio'] = data_label['speed_std_{}'.format(second)] / data_label['speed_std_{}'.format(first)]
2.3 NLP技术特征
在此,该团队主要用到了TfidfVectorizer和NMF技术,其具体使用方法为
首先,构造词频
然后,通过词频找出高频词(前100)和停用词(使用频率>0.5)
之后,将对每一艘船只的目标特征当作一句话进行文本处理
最后,使用TfidfVectorizer进行文本向量化,并使用NMF提取文本主题分布
主要代码如下
print('bulid word_fre')
# 词频的构建
def word_fre(x):
word_dict = []
x = x.split('|')
docs = []
for doc in x:
doc = doc.split()
docs.append(doc)
word_dict.extend(doc)
word_dict = Counter(word_dict)
new_word_dict = {}
for key,value in word_dict.items():
new_word_dict[key] = [value,0]
del word_dict
del x
for doc in docs:
doc = Counter(doc)
for word in doc.keys():
new_word_dict[word][1] += 1
return new_word_dict
self.data['word_fre'] = self.data[self.to_list].apply(word_fre)
print('bulid top_' + str(self.top_n))
# 设定100个高频词
def top_100(word_dict):
return sorted(word_dict.items(),key = lambda x:(x[1][1],x[1][0]),reverse = True)[:self.top_n]
self.data['top_'+str(self.top_n)] = self.data['word_fre'].apply(top_100)
def top_100_word(word_list):
words = []
for i in word_list:
i = list(i)
words.append(i[0])
return words
self.data['top_'+str(self.top_n)+'_word'] = self.data['top_' + str(self.top_n)].apply(top_100_word)
print(self.data.shape)
word_list = []
for i in self.data['top_'+str(self.top_n)+'_word'].values:
word_list.extend(i)
word_list = Counter(word_list)
word_list = sorted(word_list.items(),key = lambda x:x[1],reverse = True)
user_fre = []
for i in word_list:
i = list(i)
user_fre.append(i[1]/self.data[self.by_name].nunique())
stop_words = []
for i,j in zip(word_list,user_fre):
if j>0.5:
i = list(i)
stop_words.append(i[0])
print('start title_feature')
# 讲融合后的taglist当作一句话进行文本处理
self.data['title_feature'] = self.data[self.to_list].apply(lambda x: x.split('|'))
self.data['title_feature'] = self.data['title_feature'].apply(lambda line: [w for w in line if w not in stop_words])
self.data['title_feature'] = self.data['title_feature'].apply(lambda x: ' '.join(x))
print('start NMF')
# 使用tfidf对元素进行处理
tfidf_vectorizer = TfidfVectorizer(ngram_range=(tf_n,tf_n))
tfidf = tfidf_vectorizer.fit_transform(self.data['title_feature'].values)
#使用nmf算法,提取文本的主题分布
text_nmf = NMF(n_components=self.nmf_n).fit_transform(tfidf)