🚀 ML.NET 实战:从数据到容器化部署的全流程二分类系统模板
本教程选用 UCI Adult 开源数据集,搭建一个具备完整预处理、模型训练、评估可视化、部署、CI/CD、监控与模型版本管理能力的 ML.NET 企业级分类系统。
🧭 为什么选择 UCI Adult?
UCI Adult 数据集(又称 Census Income)是机器学习分类任务中最广泛使用的基准数据集之一,具备以下特征:
- 📉 目标分布不均衡(>50K 占比约 24%)
- 🔀 含有数值与类别混合特征(年龄、学历、婚姻状态等)
- ❓ 含有缺失值(以
?
表示) - 🧪 可广泛复用于二分类算法对比、特征工程实践
📊 1. 数据集介绍与清洗
UCI Adult 原始数据共 15 列,字段如下:
索引 | 字段名 | 类型 | 示例值 | 说明 |
---|---|---|---|---|
0 | age | int | 39 | 年龄 |
1 | workclass | string | State-gov | 工作类型 |
2 | fnlwgt | int | 77516 | 人口统计权重 |
3 | education | string | Bachelors | 教育程度 |
4 | education-num | int | 13 | 教育等级(数值化) |
5 | marital-status | string | Never-married | 婚姻状态 |
6 | occupation | string | Adm-clerical | 职业 |
7 | relationship | string | Not-in-family | 家庭关系 |
8 | race | string | White | 种族 |
9 | sex | string | Male | 性别 |
10 | capital-gain | int | 2174 | 资本收益 |
11 | capital-loss | int | 0 | 资本损失 |
12 | hours-per-week | int | 40 | 每周工作时长 |
13 | native-country | string | United-States | 原籍国家 |
14 | income | string | <=50K / >50K | 年收入(二分类标签) |
这些字段中,workclass、occupation、native-country 含有 ?
表示缺失值。
Python 清洗代码
import pandas as pd
import numpy as np
# 完整列名(按顺序)
cols = [
"age", "workclass", "fnlwgt", "education", "education-num",
"marital-status", "occupation", "relationship", "race", "sex",
"capital-gain", "capital-loss", "hours-per-week", "native-country", "income"
]
# 读取数据(将 " ?" 视为缺失值)
df = pd.read_csv("adult.data", names=cols, na_values=" ?", sep=',', skipinitialspace=True)
# 删除缺失值样本
df.dropna(inplace=True)
# 显示前几行验证清洗结果
print(df.head())
安装依赖提示
pip install pandas matplotlib numpy seaborn
🧼 数据预处理流程图
📈 2. Python 可视化
import matplotlib.pyplot as plt
import seaborn as sns
# 标签分布饼图
plt.pie(df['income'].value_counts(), labels=df['income'].unique(), autopct='%1.1f%%')
plt.title("Income Distribution")
plt.show()
# 年龄分布直方图
sns.histplot(df['age'], bins=20)
plt.title("Age Histogram")
plt.show()
🧱 3. TextLoader 配置与映射(C#)
using Microsoft.ML;
using Microsoft.ML.Data;
var mlContext = new MLContext(seed: 123);
// 定义 TextLoader 加载列结构(不将 Income 直接转换为 bool)
var loader = mlContext.Data.CreateTextLoader(new TextLoader.Options
{
Separators = new[] { ',' },
HasHeader = false,
TrimWhitespace = true,
AllowQuoting = true,
AllowSparse = false,
Columns = new[]
{
new TextLoader.Column("Age", DataKind.Single, 0),
new TextLoader.Column("Workclass", DataKind.String, 1),
new TextLoader.Column("Fnlwgt", DataKind.Single, 2),
new TextLoader.Column("Education", DataKind.String, 3),
new TextLoader.Column("EducationNum", DataKind.Single, 4),
new TextLoader.Column("MaritalStatus", DataKind.String, 5),
new TextLoader.Column("Occupation", DataKind.String, 6),
new TextLoader.Column("Relationship", DataKind.String, 7),
new TextLoader.Column("Race", DataKind.String, 8),
new TextLoader.Column("Sex", DataKind.String, 9),
new TextLoader.Column("CapitalGain", DataKind.Single, 10),
new TextLoader.Column("CapitalLoss", DataKind.Single, 11),
new TextLoader.Column("HoursPerWeek", DataKind.Single, 12),
new TextLoader.Column("NativeCountry", DataKind.String, 13),
new TextLoader.Column("IncomeRaw", DataKind.String, 14) // 原始收入字段保留字符串
}
});
var data = loader.Load("adult_clean.csv");
// 标签转换(将字符串收入转为布尔值 true/false)
var pipeline = mlContext.Transforms.CustomMapping<IncomeMapInput, IncomeMapOutput>(
(input, output) => output.Income = input.IncomeRaw == ">50K", contractName: null)
.Append(mlContext.Transforms.CopyColumns("Label", nameof(IncomeMapOutput.Income)));
var mappedData = pipeline.Fit(data).Transform(data);
🔧 4. 特征工程流水线(含归一化与编码)
var pipeline = mlContext.Transforms.Categorical.OneHotEncoding("WorkclassEncoded", "Workclass")
.Append(mlContext.Transforms.Concatenate("Features", "Age", "WorkclassEncoded"))
.Append(mlContext.Transforms.NormalizeMinMax("Features"))
.AppendCacheCheckpoint(mlContext)
.Append(mlContext.BinaryClassification.Trainers.SdcaLogisticRegression());
🔍 注释:
- OneHotEncoding:将类别字符串映射为稀疏向量
- NormalizeMinMax:归一化特征值至 [0,1]
- AppendCacheCheckpoint:防止重复转换,仅应用于训练集
🚀 流水线构建流程图
🤖 5. 超参数调优与 AutoML
GridSearch 示例
new SdcaLogisticRegressionBinaryTrainer.Options {
L2Regularization = 0.1f,
ConvergenceTolerance = 0.01f,
MaximumNumberOfIterations = 100
}
AutoML 实验
var experiment = mlContext.Auto().CreateBinaryClassificationExperiment(60);
var result = experiment.Execute(trainSet);
📦 需添加包:Microsoft.ML.AutoML
📁 缓存配置:设置 CacheDirectory = "cache"
🔁 AutoML 实验流程图
📉 6. 模型评估与可视化
var metrics = mlContext.BinaryClassification.Evaluate(predictions);
Console.WriteLine($"F1: {metrics.F1Score}, Accuracy: {metrics.Accuracy}, AUC: {metrics.AreaUnderRocCurve}");
ROC / PR 曲线(Python 示例)
from sklearn.metrics import roc_curve, precision_recall_curve, auc
fpr, tpr, _ = roc_curve(y_true, y_scores)
plt.plot(fpr, tpr); plt.title("ROC Curve")
📌 建议固定 seed
以保证结果可复现:mlContext = new MLContext(seed: 123);
🛡 7. 中间件与 Swagger
ExceptionHandlerMiddleware
public class ExceptionHandlerMiddleware
{
private readonly RequestDelegate _next;
public ExceptionHandlerMiddleware(RequestDelegate next) => _next = next;
public async Task Invoke(HttpContext context) {
try { await _next(context); }
catch (Exception ex) {
context.Response.StatusCode = 500;
await context.Response.WriteAsJsonAsync(new { error = ex.Message });
}
}
}
SwaggerUI 集成(Minimal API)
app.UseSwaggerUI(c => c.SwaggerEndpoint("/swagger/v1/swagger.json", "API V1"));
🐳 8. Dockerfile
# Build
FROM mcr.microsoft.com/dotnet/sdk:8.0 AS build
WORKDIR /src
COPY . .
RUN dotnet publish -c Release -o /app --no-build
# Runtime
FROM mcr.microsoft.com/dotnet/aspnet:8.0
WORKDIR /app
COPY --from=build /app .
ENV ASPNETCORE_URLS=http://+:80
ENTRYPOINT ["dotnet", "App.dll"]
📦 建议删除 .pdb
, .xml
并只 COPY .dll
, model.zip
🌐 多端口配置通过 ENV
或 EXPOSE 5000 5001
🔁 9. GitHub Actions + Azure 部署
- name: Build
run: docker build -t myuser/mlnet-api:latest .
- name: Push
run: docker push myuser/mlnet-api:latest
- name: Azure Deploy
uses: azure/webapps-deploy@v2
with:
app-name: mlnet-app
publish-profile: ${{ secrets.AZURE_WEBAPP_PUBLISH_PROFILE }}
images: 'myuser/mlnet-api:latest'
- name: Smoke Test
run: curl http://localhost:5000/healthz
🔐 secrets 中包含:DockerHub 登录、Azure profile、模型路径等配置
🧭 CI/CD 自动部署流程图
📘 10. 日志与版本管理
Application Insights
builder.Services.AddApplicationInsightsTelemetry();
模型版本控制
// version.json
{ "ModelVersion": "1.0.0", "Hash": "abc123" }
可在服务启动时对 model.zip
的 hash 进行校验,并与 version.json 对比,提示更新或警告
♻️ 模型版本热更新流程图
🏁 总结
通过本篇,你将掌握使用 UCI Adult 构建真实二分类模型的完整流程,结合训练可视化、AutoML、异常处理、部署、版本管理与运维监控,打造一套高度可复用、工程化、生产就绪的 ML.NET 模板。