Skip to content

Rfam/rfam-msa-qa

Repository files navigation

rfam-msa-qa

Tools for ensuring the quality of multiple sequence alignments (MSAs) in Rfam

Stockholm File Validation

This repository includes a modular validation script for Stockholm format alignment files (.so, .sto, .stk).

Usage

python3 validate_stockholm.py [-v] [--fix] [--output-mode {stdout,file}] <file1.so> [file2.so ...]

Options:

  • -v, --verbose: Print detailed validation information
  • --fix: Attempt to fix fixable errors automatically
  • --output-mode {stdout,file}: Output mode for fixed files (default: file)
    • file: Create a new file with _corrected suffix
    • stdout: Print corrected content to stdout

What is validated?

Fatal Errors (must be fixed manually):

  • Missing # STOCKHOLM 1.0 header
  • Missing // terminator
  • No sequences found in alignment
  • All sequences must have the same length
  • Sequences must not contain whitespace characters

Fixable Errors (can be auto-corrected with --fix):

  • Duplicate sequences (same accession, coordinates, and sequence data)

Warnings (non-critical):

  • Missing 2D structure consensus annotation (#=GC SS_cons)
  • Lines exceeding 10,000 character limit

Modular Architecture

The validation logic is split into separate modules in the scripts/ directory:

  • fatal_errors.py: Errors that cannot be automatically fixed
  • fixable_errors.py: Errors that can be automatically corrected
  • warnings.py: Non-critical issues
  • parser.py: Stockholm file parsing utilities

Sequence Format

Sequences should follow the format: ACCESSION/START-END where:

  • ACCESSION is the sequence identifier (e.g., from GenBank like AF228364.1)
  • START-END are the coordinates indicating which portion of the original sequence is included (e.g., 1-74)

Example: AF228364.1/1-74

Sequence data:

  • May contain any characters except whitespace
  • Gaps may be indicated by . or -

Duplicate Detection and Removal

The script can detect and remove duplicate sequences using the --fix flag. Duplicates are defined as sequences that have:

  1. The same accession/identifier
  2. The same coordinates
  3. The exact same sequence data

Examples

# Validate a single file
python3 validate_stockholm.py example_valid.so

# Validate multiple files with verbose output
python3 validate_stockholm.py -v file1.so file2.so file3.so

# Fix duplicate sequences and create corrected file
python3 validate_stockholm.py --fix file.so

# Fix and output to stdout
python3 validate_stockholm.py --fix --output-mode stdout file.so

Continuous Integration

The repository includes a GitHub Action that automatically validates Stockholm files in pull requests. The CI check will:

  • Trigger when a PR modifies .so, .sto, or .stk files
  • Run the validation script on all changed files
  • Pass only if all files are valid

Stockholm Format Reference

Stockholm format is used for multiple sequence alignments. Basic structure:

# STOCKHOLM 1.0
AF228364.1/1-74    CGGCAGAUGAUGAU-UUUACUUGGAUUCCCCUUCAGAACAUUUA
AF228365.1/1-73    CGGCAGAUGAUGAU-UUUACUUGGAUUCCCCUUCAGAACAUUU
#=GC SS_cons       <<<<_______..________.__._.______.___.___.___
//

The #=GC SS_cons line is the 2D structure consensus annotation, which represents the secondary structure of the RNA alignment. While not strictly required, it is recommended for Rfam alignments.

For more information, see the Stockholm format specification.

About

Tools for ensuring the quality of multiple sequence alignments (MSAs) in Rfam

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages