kaggle菜鸟-Mercari

最新推荐文章于 2024-11-08 05:53:46 发布

一米三的老阿姨

最新推荐文章于 2024-11-08 05:53:46 发布

阅读量1k

点赞数

CC 4.0 BY-SA版权

分类专栏： kaggle 文章标签： kaggle 数据

本文链接：https://blog.csdn.net/qq_33323162/article/details/78954263

kaggle 专栏收录该内容

3 篇文章

订阅专栏

本文是针对kaggle新手的一篇入门教程，以Mercari商品价格预测比赛为例，介绍如何进行数据预处理、特征工程和模型训练。通过分析商品的id、名称、状态、分类、品牌、价格、是否包邮等信息，利用大神分享的代码，初学者可以了解 Kaggle 比赛的基本流程和技巧。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

kaggle菜鸟入门

kaggle上一个预测商品价格的featured级比赛，
描述：预测一个商品的价格
数据情况：
train_id or test_id - 训练数据和测试数据的id
name - 商品名称
item_condition_id - the condition of the items provided by the seller
category_name - 商品分类
brand_name –品牌名称
price - 价格
shipping -是否包邮
item_description - 商品描述
数据格式如下图：
这里写图片描述

一个大神的代码如下
代码：

#导入所需模块
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.linear_model import Ridge
from sklearn.pipeline import FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import mean_squared_log_error

PATH="../input/"
#读文件
train = pd.read_csv(f'{PATH}train.tsv', sep='\t')
test = pd.read_csv(f'{PATH}test.tsv', sep='\t')
submiss= pd.read_csv(f'{PATH}sample_submission.csv', sep='\t')
#训练数据和测试数据一起处理
df = pd.concat([train, test], 0)
#训练数据的行数
nrow_train = train.shape[0]
#对价格进行处理
y_train = np.log1p(train['price'])
#准备测试数据的id
y_test=test['test_id']
#删除不需要处理的数据
df=df.drop(['price','test_id','train_id'],axis=1)
#对缺失值进行处理
df['category_name'] = df['category_name'].fillna('MISS').astype(str)
df['brand_name'] = df['brand_name'].fillna('missing').astype(str)
df['item_description'] = df['item_description'].fillna('No')
#数据类型处理
df['shipping'] = df['shipping'].astype(str)
df['item_condition_id'] = df['item_condition_id'].astype(str)
#文本处理
default_preprocessor = CountVectorizer().build_preprocessor()
def build_preprocessor(field):
    field_idx = list(df.columns).index(field)
    return lambda x: default_preprocessor(x[field_idx])

vectorizer = FeatureUnion([
    ('name', CountVectorizer(
        ngram_range=(1, 2),
        max_features=50000,
        preprocessor=build_preprocessor('name'))),
    ('category_name', CountVectorizer(
        token_pattern='.+',
        preprocessor=build_preprocessor('category_name'))),
    ('brand_name', CountVectorizer(
        token_pattern='.+',
        preprocessor=build_preprocessor('brand_name'))),
    ('shipping', CountVectorizer(
        token_pattern='\d+',
        preprocessor=build_preprocessor('shipping'))),
    ('item_condition_id', CountVectorizer(
        token_pattern='\d+',
        preprocessor=build_preprocessor('item_condition_id'))),
    ('item_description', TfidfVectorizer(
        ngram_range=(1, 3),
        max_features=100000,
        preprocessor=build_preprocessor('item_description'))),
])
#传入数据集进行处理
X = vectorizer.fit_transform(df.values)
#处理后的训练数据
X_train = X[:nrow_train]
#处理后的测试数据
X_test = X[nrow_train:]

#模型
model = Ridge(
        solver='auto',
        fit_intercept=True,
        alpha=0.5,
        max_iter=100,
        normalize=False,
        tol=0.05)
#训练
model.fit(X_train, y_train)
#测试
preds = model.predict(X_test)
#保存结果
test["price"] = np.expm1(preds)
test[["test_id", "price"]].to_csv("submission_ridge.csv", index = False)
... prompt'''

提交结果
这里写图片描述