【机器学习】深入理解二元Logistic回归：原理、实现与应用

本文链接：https://blog.csdn.net/summer6c/article/details/148068241

深入理解二元Logistic回归

引言
1. Logistic回归的背景
- 1.1 从线性回归到Logistic回归
- 1.2 应用场景
2. Logistic回归的数学原理
3. 代价函数与优化
- 3.1 交叉熵损失函数
- 3.2 梯度下降
4. Python实现与结果分析
5. Logistic回归的优缺点
6.完整代码分享：

引言

在机器学习的世界中，分类问题是核心挑战之一。二元Logistic回归作为最基础且强大的分类算法之一，广泛应用于医疗诊断、信用评分、市场营销等众多领域。本文将带你全面了解二元Logistic回归的原理、数学背景，并通过Python实现展示其实际应用。

1. Logistic回归的背景

1.1 从线性回归到Logistic回归

线性回归适用于预测连续值，但当面对分类问题时（尤其是二元分类），我们需要一种能够输出概率估计的方法。这就是Logistic回归的由来——它虽然名为"回归"，但实际上是一种分类算法。
关键区别：

线性回归：输出连续值（-∞, +∞）

Logistic回归：输出概率值（0,1）

1.2 应用场景

Logistic回归在以下场景表现优异：

预测客户是否会购买产品（是/否）
判断邮件是否为垃圾邮件（是/否）
诊断患者是否患有某种疾病（阳性/阴性）

2. Logistic回归的数学原理

2.1 Sigmoid函数

Logistic回归的核心是Sigmoid函数（也称为逻辑函数）：在这里插入图片描述
这个S形函数将任何实数映射到(0,1)区间，完美适合概率估计。

2.2 假设函数

我们的假设函数将线性组合通过Sigmoid转换：在这里插入图片描述
其中θ是参数向量，x是特征向量。

2.3 决策边界

当hθ(x) ≥ 0.5时，我们预测y=1；否则预测y=0。因为Sigmoid在z=0时值为0.5，所以决策边界是：
在这里插入图片描述
这是一个线性决策边界（对于多元情况是超平面）。

3. 代价函数与优化

3.1 交叉熵损失函数

不同于线性回归使用均方误差，Logistic回归使用交叉熵损失函数：
在这里插入图片描述
这个凸函数保证了梯度下降能找到全局最优解。

3.2 梯度下降

参数的更新规则与线性回归形式相似但推导不同：在这里插入图片描述
具体推导后得到：

4. Python实现与结果分析

4.1 数据准备

我们使用的数据集包含两个特征和一个二元标签。首先对数据进行标准化处理可以提高模型性能。

X, y = load_dataset('testSet.txt')
X = np.hstack((np.ones((X.shape[0], 1)), X))  # 添加偏置项

4.2 模型训练

设置学习率α=0.01，迭代1000次：

theta = np.zeros(X.shape[1])
alpha = 0.01
num_iters = 1000
theta, cost_history = gradient_descent(X, y, theta, alpha, num_iters)

4.3 结果可视化

代价函数变化曲线：
随着迭代次数增加，代价函数稳步下降，表明学习过程有效。
决策边界图：
可以清晰看到模型如何将两类数据分开。
在这里插入图片描述

4.4 模型评估

计算训练集准确率：

predictions = predict(X, theta)
accuracy = np.mean(predictions == y) * 100
print(f"训练集准确率: {accuracy:.2f}%")

5. Logistic回归的优缺点

优点：

计算代价低，易于实现和理解

输出具有概率意义

对线性决策边界问题非常有效

不容易过拟合（特别是加入正则化后）

局限性：

只能处理线性可分或近似线性可分的数据

对异常值敏感

需要特征间相关性较低

6.完整代码分享：

import numpy as np
import matplotlib.pyplot as plt

def load_dataset(testSet):
    """加载数据集"""
    data = np.loadtxt(testSet)
    X = data[:, :2]
    y = data[:, 2]
    return X, y

def sigmoid(z):
    """Sigmoid函数"""
    return 1 / (1 + np.exp(-z))

def compute_cost(X, y, theta):
    """计算代价函数"""
    m = len(y)
    h = sigmoid(X.dot(theta))
    cost = (-y.dot(np.log(h)) - (1-y).dot(np.log(1-h))) / m
    return cost

def gradient_descent(X, y, theta, alpha, num_iters):
    """梯度下降算法"""
    m = len(y)
    cost_history = []
    
    for _ in range(num_iters):
        h = sigmoid(X.dot(theta))
        gradient = X.T.dot(h - y) / m
        theta -= alpha * gradient
        cost = compute_cost(X, y, theta)
        cost_history.append(cost)
        
    return theta, cost_history

def predict(X, theta):
    """预测函数"""
    return np.round(sigmoid(X.dot(theta)))

def plot_decision_boundary(X, y, theta):
    """绘制决策边界"""
    plt.figure(figsize=(10, 6))
    
    # 绘制数据点
    plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], color='blue', label='Class 0')
    plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], color='red', label='Class 1')
    
    # 绘制决策边界
    plot_x = np.array([min(X[:, 0]) - 2, max(X[:, 0]) + 2])
    plot_y = (-1/theta[2]) * (theta[1] * plot_x + theta[0])
    plt.plot(plot_x, plot_y, color='green', label='Decision Boundary')
    
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.title('Logistic Regression Decision Boundary')
    plt.legend()
    plt.grid(True)
    plt.show()

def logistic_regression():
    """主函数"""
    # 加载数据
    X, y = load_dataset('testSet.txt')
    
    # 添加偏置项
    X = np.hstack((np.ones((X.shape[0], 1)), X))
    
    # 初始化参数
    theta = np.zeros(X.shape[1])
    
    # 设置学习率和迭代次数
    alpha = 0.01
    num_iters = 1000
    
    # 运行梯度下降
    theta, cost_history = gradient_descent(X, y, theta, alpha, num_iters)
    
    print(f"最优参数: {theta}")
    print(f"初始代价: {cost_history[0]:.4f}")
    print(f"最终代价: {cost_history[-1]:.4f}")
    
    # 绘制代价函数变化
    plt.plot(cost_history)
    plt.xlabel('Iterations')
    plt.ylabel('Cost')
    plt.title('Cost Function over Iterations')
    plt.show()
    
    # 绘制决策边界
    plot_decision_boundary(X[:, 1:], y, theta)
    
    # 计算准确率
    predictions = predict(X, theta)
    accuracy = np.mean(predictions == y) * 100
    print(f"训练集准确率: {accuracy:.2f}%")

if __name__ == "__main__":
    logistic_regression()