【天池智慧海洋建设】Topline源码——特征工程学习(大白)

该博客介绍了团队在天池智慧海洋建设中的特征工程方法,包括时间、速度和NLP技术的特征提取。通过数据预处理,如时间划分和速度统计,构建了丰富的渔船特征。此外,利用TfidfVectorizer和NMF技术处理文本数据,提取了船舶活动的潜在模式。最终,这些特征被用于lightGBM的梯度提升决策树模型进行学习。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

【天池智慧海洋建设】Topline源码——特征工程学习

团队名称:大白
链接:
https://github.com/Ai-Light/2020-zhihuihaiyang


前言

topline代码开源学习,仅关注特征工程部分,具体为输入,输出,作用、原理及部分个人理解。

I 数据部分

在这里插入图片描述

原始数据描述:

渔船ID:渔船的唯一识别,结果文件以此ID为标示
x: 渔船在平面坐标系的x轴坐标
y: 渔船在平面坐标系的y轴坐标
速度:渔船当前时刻航速,单位节
方向:渔船当前时刻航首向,单位度
time:数据上报时刻,单位月日 时:分
type:渔船label,三种作业类型(围网、刺网、拖网)

II 特征工程部分

具体方案:
根据查询的相关资料,清楚了解到渔船针对海中生物群和水域差异,拖网、围网、刺网采用不同路线、不同速度、作业时间差异设计出的捕捞方式。根据这些特点,我们为此提取三个方面的特征,最后用lgb的gbdt模型进行学习,未进行模型融合。

2.0 基础特征工程

def extract_dt(df):
    df['time'] = pd.to_datetime(df['time'], format='%m%d %H:%M:%S')
    df['date'] = df['time'].dt.date
    df['hour'] = df['time'].dt.hour

    df['x_dis_diff'] = (df['x'] - 6165599).abs()
    df['y_dis_diff'] = (df['y'] - 5202660).abs()
    df['base_dis_diff'] = ((df['x_dis_diff']**2)+(df['y_dis_diff']**2))**0.5    
    del df['x_dis_diff'],df['y_dis_diff']    
    
    df["x"] = df["x"] / 1e6
    df["y"] = df["y"] / 1e6    
    df['day_nig'] = 0
    df.loc[(df['hour'] > 5) & (df['hour'] < 20),'day_nig'] = 1
    return df

标准时间转换,提取出日期、小时,同时设定小时数为5-20为day_nig = 1代表白天,和0代表黑夜

2.1 速度划分与基本统计量特征

在这里插入图片描述
首先通过速度为0和非0两个部分进行数据集的划分

data  = train
data_label = train_label

data_1 = data[data['speed']==0]
data_2 = data[data['speed']!=0]

(flag = 0 代表统计方向,1代表统计速度)

然后以ship为主键,通过extract_feature函数分组进行目标speed(或direction)的统计量构造,并构造了一些x、y方向上的最值差,区域面积,区域边长相除,以及hour的统计值

其中extract_feature函数中用到的group_feature涉及
到这样一个基本统计量的方法
df.groupby(key)[target].agg(agg_dict).reset_index()

key:分组主键
target:分组的目标值
agg_dict:所用方法(函数)

完整代码如下

def group_feature(df, key, target, aggs,flag):   
    agg_dict = {}
    for ag in aggs:
        agg_dict['{}_{}_{}'.format(target,ag,flag)] = ag
    print(agg_dict)
    t = df.groupby(key)[target].agg(agg_dict).reset_index()
    return t

def extract_feature(df, train, flag):
    '''
    统计feature
    '''
    if (flag == 'on_night') or (flag == 'on_day'): 
        t = group_feature(df, 'ship','speed',['max','mean','median','std','skew'],flag)
        train = pd.merge(train, t, on='ship', how='left')
        # return train
    
    
    if flag == "0":
        t = group_feature(df, 'ship','direction',['max','median','mean','std','skew'],flag)
        train = pd.merge(train, t, on='ship', how='left')  
    elif flag == "1":
        t = group_feature(df, 'ship','speed',['max','mean','median','std','skew'],flag)
        train = pd.merge(train, t, on='ship', how='left')
        t = group_feature(df, 'ship','direction',['max','median','mean','std','skew'],flag)
        train = pd.merge(train, t, on='ship', how='left') 
        hour_nunique = df.groupby('ship')['speed'].nunique().to_dict()
        train['speed_nunique_{}'.format(flag)] = train['ship'].map(hour_nunique)   
        hour_nunique = df.groupby('ship')['direction'].nunique().to_dict()
        train['direction_nunique_{}'.format(flag)] = train['ship'].map(hour_nunique)  

    t = group_feature(df, 'ship','x',['max','min','mean','median','std','skew'],flag)
    train = pd.merge(train, t, on='ship', how='left')
    t = group_feature(df, 'ship','y',['max','min','mean','median','std','skew'],flag)
    train = pd.merge(train, t, on='ship', how='left')
    t = group_feature(df, 'ship','base_dis_diff',['max','min','mean','std','skew'],flag)
    train = pd.merge(train, t, on='ship', how='left')

       
    train['x_max_x_min_{}'.format(flag)] = train['x_max_{}'.format(flag)] - train['x_min_{}'.format(flag)]
    train['y_max_y_min_{}'.format(flag)] = train['y_max_{}'.format(flag)] - train['y_min_{}'.format(flag)]
    train['y_max_x_min_{}'.format(flag)] = train['y_max_{}'.format(flag)] - train['x_min_{}'.format(flag)]
    train['x_max_y_min_{}'.format(flag)] = train['x_max_{}'.format(flag)] - train['y_min_{}'.format(flag)]
    train['slope_{}'.format(flag)] = train['y_max_y_min_{}'.format(flag)] / np.where(train['x_max_x_min_{}'.format(flag)]==0, 0.001, train['x_max_x_min_{}'.format(flag)])
    train['area_{}'.format(flag)] = train['x_max_x_min_{}'.format(flag)] * train['y_max_y_min_{}'.format(flag)] 
    
    mode_hour = df.groupby('ship')['hour'].agg(lambda x:x.value_counts().index[0]).to_dict()
    train['mode_hour_{}'.format(flag)] = train['ship'].map(mode_hour)
    train['slope_median_{}'.format(flag)] = train['y_median_{}'.format(flag)] / np.where(train['x_median_{}'.format(flag)]==0, 0.001, train['x_median_{}'.format(flag)])

    return train

2.2 时间划分

在这里插入图片描述
同样,以时间划分白天和黑夜,进行基本统计特征的构造,方法与速度划分完全相同

data_label = extract_feature(data_1, data_label,"0")
data_label = extract_feature(data_2, data_label,"1")

data_1 = data[data['day_nig'] == 0]
data_2 = data[data['day_nig'] == 1]
data_label = extract_feature(data_1, data_label,"on_night")
data_label = extract_feature(data_2, data_label,"on_day")
first = "0"
second = "1"
data_label['direction_median_ratio'] = data_label['direction_median_{}'.format(second)] / data_label['direction_median_{}'.format(first)]
data_label['slope_ratio'] = data_label['slope_{}'.format(second)] / data_label['slope_{}'.format(first)] 
data_label['slope_mean_ratio'] = data_label['slope_median_{}'.format(second)] / data_label['slope_median_{}'.format(first)]

first = "on_night"
second = "on_day"
data_label['speed_median_ratio'] = data_label['speed_median_{}'.format(second)] / data_label['speed_median_{}'.format(first)] 
data_label['speed_std_ratio'] = data_label['speed_std_{}'.format(second)] / data_label['speed_std_{}'.format(first)] 

2.3 NLP技术特征

在这里插入图片描述
在这里插入图片描述
在此,该团队主要用到了TfidfVectorizer和NMF技术,其具体使用方法为
首先,构造词频
然后,通过词频找出高频词(前100)和停用词(使用频率>0.5)
之后,将对每一艘船只的目标特征当作一句话进行文本处理
最后,使用TfidfVectorizer进行文本向量化,并使用NMF提取文本主题分布

主要代码如下

print('bulid word_fre')
# 词频的构建
def word_fre(x):
    word_dict = []
    x = x.split('|')
    docs = []
    for doc in x:
        doc = doc.split()
        docs.append(doc)
        word_dict.extend(doc)
    word_dict = Counter(word_dict)
    new_word_dict = {}
    for key,value in word_dict.items():
        new_word_dict[key] = [value,0]
    del word_dict  
    del x
    for doc in docs:
        doc = Counter(doc)
        for word in doc.keys():
            new_word_dict[word][1] += 1
    return new_word_dict 
self.data['word_fre'] = self.data[self.to_list].apply(word_fre)

print('bulid top_' + str(self.top_n))
# 设定100个高频词
def top_100(word_dict):
    return sorted(word_dict.items(),key = lambda x:(x[1][1],x[1][0]),reverse = True)[:self.top_n]
self.data['top_'+str(self.top_n)] = self.data['word_fre'].apply(top_100)
def top_100_word(word_list):
    words = []
    for i in word_list:
        i = list(i)
        words.append(i[0])
    return words 
self.data['top_'+str(self.top_n)+'_word'] = self.data['top_' + str(self.top_n)].apply(top_100_word)
print(self.data.shape)

word_list = []
for i in self.data['top_'+str(self.top_n)+'_word'].values:
    word_list.extend(i)
word_list = Counter(word_list)
word_list = sorted(word_list.items(),key = lambda x:x[1],reverse = True)
user_fre = []
for i in word_list:
    i = list(i)
    user_fre.append(i[1]/self.data[self.by_name].nunique())
stop_words = []
for i,j in zip(word_list,user_fre):
    if j>0.5:
        i = list(i)
        stop_words.append(i[0])

print('start title_feature')
# 讲融合后的taglist当作一句话进行文本处理
self.data['title_feature'] = self.data[self.to_list].apply(lambda x: x.split('|'))
self.data['title_feature'] = self.data['title_feature'].apply(lambda line: [w for w in line if w not in stop_words])
self.data['title_feature'] = self.data['title_feature'].apply(lambda x: ' '.join(x))

print('start NMF')
# 使用tfidf对元素进行处理
tfidf_vectorizer = TfidfVectorizer(ngram_range=(tf_n,tf_n))
tfidf = tfidf_vectorizer.fit_transform(self.data['title_feature'].values)
#使用nmf算法,提取文本的主题分布
text_nmf = NMF(n_components=self.nmf_n).fit_transform(tfidf)
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

阿芒aris

感谢大佬,让我来帮助您解决问题

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值