Skip to content

Releases: instructlab/training

v0.13.0 - Pretraining Support & Optimizer Configuration

08 Jan 19:48
574f946

Choose a tag to compare

What's New

Features

  • Pretraining Data Processing API (#672)

    • Added new API for processing pretraining-style datasets
    • Documents are now chunked by configurable block_size
    • Chunks are treated as independent, fully-unmasked samples
    • Updated training loop to ingest pretraining-style datasets
    • Includes comprehensive test coverage (test_pretraining_data_process.py, test_pretraining_mode.py, test_pretraining_sampler.py)
  • AdamW Optimizer Configuration (#674)

    • Exposed weight_decay, betas, and eps parameters in TrainingArgs
    • Users can now tune AdamW hyperparameters through run_training() API
    • Provides more control over optimizer behavior
  • Granite 4 Model Support (#669)

    • Added support for Granite 4 models as Mixture of Experts (MoE) models in training

Bug Fixes

  • Process Timing Fix (#675)

    • Fixed race condition where process wasn't completed by the time it was read
  • Variable Access Fix (#668)

    • Fixed invalid variable access bug

Dependencies

  • Build Dependency Update (#670)
    • Updated hynek build dependency

Files Changed

17 files changed with 1,642 insertions and 52 deletions:

  • Core training modules: data_process.py, main_ds.py, sampler.py, model.py, config.py
  • New test suites for pretraining functionality
  • Updated README with new capabilities

Full Changelog

All Changes:

  • 574f946 Exposes API for processing pretraining data (#672)
  • 638a753 fixes bug where process isn't completed by the time the process gets read (#675)
  • c495035 Expose AdamW optimizer parameters in training API (#674)
  • 3d05302 Handle granite 4 as MoE models in training (#669)
  • 781c36f fixes stray invalid variable access bug (#668)
  • 529c2f7 bumps hynek build dep (#670)

Full Diff: v0.12.1...v0.13.0

v0.12.1 - Granite 4 support, and adding extended env var and torchrun arg support

14 Oct 20:47
637afae

Choose a tag to compare

What's Changed

  • Update requirements-cuda.txt to increase liger-kernel minimum by @Maxusmusti in #659
  • Adds mamba-ssm[causal-conv1d] to CUDA requirements by @RobotSail in #663
  • Removes Numpy version cap by @RobotSail in #664
  • fix(torchrun): Omit empty arguments and correct nproc_per_node type by @szaher in #661

New Contributors

Full Changelog: v0.12.0...v0.12.1

v0.12.0 - GPT-OSS Support

17 Sep 17:10
536ebfb

Choose a tag to compare

Full fine-tuning now supports gpt-oss models, alongside minor bugfixes to ensure correct loss calculations with higher gradient accumulation.

What's Changed

Full Changelog: v0.11.1...v0.12.0

v0.11.1

05 Aug 19:34

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.11...v0.11.1

v0.10.4

07 Jul 13:33
0cc2e30

Choose a tag to compare

What's Changed

Full Changelog: v0.10.3...v0.10.4

v0.10.3

08 May 19:50
40e1e8c

Choose a tag to compare

v0.10.3

What's Changed

  • moves deepspeed requirements into their own file; add deepspeed extras (backport #455) by @mergify in #546

Full Changelog: v0.10.2...v0.10.3

v0.11

08 May 19:23
e8eb284

Choose a tag to compare

What's Changed

  • ci: Remove workflow that doesn't utilize training library (medium, -mp) by @booxter in #478
  • Obey the FSDP sharding option default by @Maxusmusti in #486
  • Change default internal sharding strategy to HYBRID_SHARD by @Maxusmusti in #488
  • chore: Update the large e2e job to use fallback logic for selecting EC2 instances by @courtneypacheco in #491
  • moves deepspeed requirements into their own file; add deepspeed extras by @JamesKunstle in #455
  • chore: introduce dummy workflow by @cdoern in #497
  • ci: Search for necessary instance for smoke job in multiple AZs by @booxter in #481
  • ci: Fix -sdk fake workflow failure on actionlint by @booxter in #501
  • build(deps): Bump actions/setup-python from 5.5.0 to 5.6.0 by @dependabot in #493
  • use instructlab constraints-dev.txt in e2e test by @ktdreyer in #499
  • build(deps): Bump step-security/harden-runner from 2.11.1 to 2.12.0 by @dependabot in #490
  • ci: Use tox-current-env to reuse prepared venv with torch by @booxter in #482
  • fix: extend nccl timeout by @cdoern in #507
  • always log storage by @RobotSail in #510
  • deps: Remove caps on ROCm dependencies by @courtneypacheco in #517
  • ci: don't trigger pull_request_target job on its own workflow by @booxter in #519
  • Enable pylint 'unused-argument' check by @fynnsu in #528

New Contributors

Full Changelog: v0.10.0...v0.11

v0.10.2 - Remove ROCm dependency caps

01 May 14:49
a9a69e9

Choose a tag to compare

What's Changed

Full Changelog: v0.10.1...v0.10.2

v0.10.1 - Updating Default FSDP Sharding

21 Apr 20:57
a4d52a5

Choose a tag to compare

What's Changed

  • ci: Remove workflow that doesn't utilize training library (medium, -mp) by @booxter in #478
  • Obey the FSDP sharding option default (backport #486) by @mergify in #487
  • Change default internal sharding strategy to HYBRID_SHARD (backport #488) by @mergify in #489

Full Changelog: v0.10.0...v0.10.1

v0.10.0 - Updated FSDP Mixed Precision and Liger Kernel Model Option Support

17 Apr 21:26
be01c2c

Choose a tag to compare

What's Changed

  • disables e2e-nvidia-l4-x1 test by @JamesKunstle in #454
  • ci: Fix unit test run due to no tests found to execute by @booxter in #466
  • ci: Don't run smoke tests when only irrelevant files are touched by @booxter in #460
  • ci: don't waste ec2 resources on unit tests by @booxter in #464
  • ci: Trigger unit test run on tox.ini change by @booxter in #469
  • ci: Fix path filter for unit tests for the workflow file by @booxter in #461
  • chore: Don't install pytest dependencies for coverage reports by @booxter in #468
  • chore: Remove spell checks from the repo by @booxter in #458
  • chore: Don't set ec2_runner_variant for unit tests by @booxter in #475
  • Remove CHANGELOG.md by @booxter in #457
  • Fix FSDP mixed precision setting and loss w/ accelerate by @Maxusmusti in #465
  • fixes non-granite model instantiation with Liger Kernel by @JamesKunstle in #476
  • ci: Install torch before flash-attn by @booxter in #474
  • ci: Use pull_request as trigger for unit tests by @booxter in #473
  • ci: Run unit tests for all supported python version, 3.11+ by @booxter in #472
  • chore: Require python3.11+ by @booxter in #470
  • chore: Drop pytest-asyncio by @booxter in #467
  • chore: don't trigger unit tests for cuda and rocm requirements changes by @booxter in #463
  • build(deps): Bump step-security/harden-runner from 2.10.4 to 2.11.1 by @dependabot in #452
  • build(deps): Bump machulav/ec2-github-runner from 2.3.8 to 2.3.9 by @dependabot in #450
  • build(deps): Bump aws-actions/configure-aws-credentials from 4.0.2 to 4.1.0 by @dependabot in #451

Full Changelog: v0.9.0...v0.10.0