CheckpointConfig类参数

本文介绍了Java中checkpointing的默认设置,包括检查点模式(Exactly-Once)、超时时间、间隔时间、并发限制等,并讨论了过时选项。重点在于理解如何配置以确保数据一致性。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

/** The default checkpoint mode: exactly once. */
public static final CheckpointingMode DEFAULT_MODE = CheckpointingMode.EXACTLY_ONCE;

 checkpoint超时时间
/** The default timeout of a checkpoint attempt: 10 minutes. */
public static final long DEFAULT_TIMEOUT = 10 * 60 * 1000;

 checkpoint间隔时间
/** The default minimum pause to be made between checkpoints: none. */
public static final long DEFAULT_MIN_PAUSE_BETWEEN_CHECKPOINTS = 0;

/** The default limit of concurrently happening checkpoints: one. */
public static final int DEFAULT_MAX_CONCURRENT_CHECKPOINTS = 1;

public static final int UNDEFINED_TOLERABLE_CHECKPOINT_NUMBER = -1;

// ------------------------------------------------------------------------

/** Checkpointing mode (exactly-once vs. at-least-once). */
private CheckpointingMode checkpointingMode = DEFAULT_MODE;

/** Periodic checkpoint triggering interval. */
private long checkpointInterval = -1; // disabled

/** Maximum time checkpoint may take before being discarded. */
private long checkpointTimeout = DEFAULT_TIMEOUT;

/** Minimal pause between checkpointing attempts. */
private long minPauseBetweenCheckpoints = DEFAULT_MIN_PAUSE_BETWEEN_CHECKPOINTS;

/** Maximum number of checkpoint attempts in progress at the same time. */
private int maxConcurrentCheckpoints = DEFAULT_MAX_CONCURRENT_CHECKPOINTS;

/** Flag to force checkpointing in iterative jobs. */
private boolean forceCheckpointing;

/** Cleanup behaviour for persistent checkpoints. */
private ExternalizedCheckpointCleanup externalizedCheckpointCleanup;

/**
 * Task would not fail if there is an error in their checkpointing.
 *
 * <p>{@link #tolerableCheckpointFailureNumber} would always overrule this deprecated field if they have conflicts.
 *
 * @deprecated Use {@link #tolerableCheckpointFailureNumber}.
 */
@Deprecated
private boolean failOnCheckpointingErrors = true;

/** Determines if a job will fallback to checkpoint when there is a more recent savepoint. **/
private boolean preferCheckpointForRecovery = false;

/**
 * Determines the threshold that we tolerance declined checkpoint failure number.
 * The default value is -1 meaning undetermined and not set via {@link #setTolerableCheckpointFailureNumber(int)}.
 * */
private int tolerableCheckpointFailureNumber = UNDEFINED_TOLERABLE_CHECKPOINT_NUMBER;
### ResNet50 图像分的实现方法 ResNet50 是一种基于残差网络结构的深度卷积神经网络模型,广泛应用于图像分任务。其核心思想在于通过引入残差连接解决深层神经网络中的梯度消失问题,从而能够训练更深层次的网络[^1]。 #### 数据集准备与加载 在使用 ResNet50 进行图像分之前,需准备好用于训练和测试的数据集。常见的数据集如 CIFAR-10 提供了丰富的图片样本,适用于多别的图像分任务。可以通过 MindSpore 的 `mindspore.dataset` 模块轻松完成数据集的下载、预处理以及加载操作[^3]。 ```python import mindspore.dataset as ds from mindspore.dataset.vision import transforms, Inter from mindspore.dataset.transforms import TypeCast def create_dataset(data_path, batch_size=32, repeat_size=1): cifar_ds = ds.Cifar10Dataset(data_path) resize_height, resize_width = 32, 32 rescale = 1.0 / 255.0 shift = 0.0 # 定义数据增强操作 random_crop_op = transforms.RandomCrop([resize_height, resize_width], padding=4) normalize_op = transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)) changeswap_op = transforms.HWC2CHW() trans = [ random_crop_op, transforms.Rescale(rescale, shift), normalize_op, changeswap_op ] type_cast_op = TypeCast(mindspore.int32) cifar_ds = cifar_ds.map(operations=trans, input_columns="image") cifar_ds = cifar_ds.map(operations=type_cast_op, input_columns="label") # 批量处理并重复数据集 cifar_ds = cifar_ds.batch(batch_size=batch_size, drop_remainder=True) cifar_ds = cifar_ds.repeat(repeat_size) return cifar_ds ``` #### 构建 ResNet50 网络模型 MindSpore 提供了内置的 ResNet50 模型实现,可以直接调用以简化开发流程。如果需要自定义修改,则可以继承基重新设计部分层结构[^2]。 ```python from mindspore import nn class ResNet50(nn.Cell): def __init__(self, num_classes=10): super(ResNet50, self).__init__() self.resnet = nn.ResNet( block=nn.ResidualBlockBase, layer_nums=[3, 4, 6, 3], in_channels=64, out_channels=[64, 128, 256, 512], strides=[1, 2, 2, 2], num_classes=num_classes ) def construct(self, x): return self.resnet(x) ``` #### 训练与评估过程 为了优化模型性能,在训练阶段通常会采用交叉熵损失函数配合 Adam 或 SGD 优化器来调整权重参数。同时可通过验证集上的表现监控过拟合情况,并保存最佳效果的模型文件。 ```python from mindspore.nn import SoftmaxCrossEntropyWithLogits, TrainOneStepCell, WithLossCell from mindspore.train.callback import ModelCheckpoint, CheckpointConfig # 初始化网络实例 network = ResNet50(num_classes=10) # 设置损失函数与优化算法 loss_fn = SoftmaxCrossEntropyWithLogits(sparse=True, reduction='mean') opt = nn.Momentum(network.trainable_params(), learning_rate=0.01, momentum=0.9) # 将网络封装成可训练形式 model = Model(network, loss_fn=loss_fn, optimizer=opt, metrics={"Accuracy": Accuracy()}) # 配置检查点存储策略 config_ck = CheckpointConfig(save_checkpoint_steps=1875, keep_checkpoint_max=10) ckpoint_cb = ModelCheckpoint(prefix="resnet_cifar", directory="./checkpoints", config=config_ck) # 开始执行训练循环 dataset_train = create_dataset("/path/to/cifar10/train") model.train(epoch=10, train_dataset=dataset_train, callbacks=[ckpoint_cb]) ``` #### 结果可视化 经过充分迭代后的模型可以在测试集中展示预测精度。利用混淆矩阵或者绘制 ROC 曲线等方式进一步分析分准确性。 ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值