Tools for ensuring the quality of multiple sequence alignments (MSAs) in Rfam
This repository includes a modular validation script for Stockholm format alignment files (.so, .sto, .stk).
python3 validate_stockholm.py [-v] [--fix] [--output-mode {stdout,file}] <file1.so> [file2.so ...]Options:
-v, --verbose: Print detailed validation information--fix: Attempt to fix fixable errors automatically--output-mode {stdout,file}: Output mode for fixed files (default: file)file: Create a new file with_correctedsuffixstdout: Print corrected content to stdout
Fatal Errors (must be fixed manually):
- Missing
# STOCKHOLM 1.0header - Missing
//terminator - No sequences found in alignment
- All sequences must have the same length
- Sequences must not contain whitespace characters
Fixable Errors (can be auto-corrected with --fix):
- Duplicate sequences (same accession, coordinates, and sequence data)
Warnings (non-critical):
- Missing 2D structure consensus annotation (
#=GC SS_cons) - Lines exceeding 10,000 character limit
The validation logic is split into separate modules in the scripts/ directory:
fatal_errors.py: Errors that cannot be automatically fixedfixable_errors.py: Errors that can be automatically correctedwarnings.py: Non-critical issuesparser.py: Stockholm file parsing utilities
Sequences should follow the format: ACCESSION/START-END where:
ACCESSIONis the sequence identifier (e.g., from GenBank likeAF228364.1)START-ENDare the coordinates indicating which portion of the original sequence is included (e.g.,1-74)
Example: AF228364.1/1-74
Sequence data:
- May contain any characters except whitespace
- Gaps may be indicated by
.or-
The script can detect and remove duplicate sequences using the --fix flag. Duplicates are defined as sequences that have:
- The same accession/identifier
- The same coordinates
- The exact same sequence data
# Validate a single file
python3 validate_stockholm.py example_valid.so
# Validate multiple files with verbose output
python3 validate_stockholm.py -v file1.so file2.so file3.so
# Fix duplicate sequences and create corrected file
python3 validate_stockholm.py --fix file.so
# Fix and output to stdout
python3 validate_stockholm.py --fix --output-mode stdout file.soThe repository includes a GitHub Action that automatically validates Stockholm files in pull requests. The CI check will:
- Trigger when a PR modifies
.so,.sto, or.stkfiles - Run the validation script on all changed files
- Pass only if all files are valid
Stockholm format is used for multiple sequence alignments. Basic structure:
# STOCKHOLM 1.0
AF228364.1/1-74 CGGCAGAUGAUGAU-UUUACUUGGAUUCCCCUUCAGAACAUUUA
AF228365.1/1-73 CGGCAGAUGAUGAU-UUUACUUGGAUUCCCCUUCAGAACAUUU
#=GC SS_cons <<<<_______..________.__._.______.___.___.___
//
The #=GC SS_cons line is the 2D structure consensus annotation, which represents the secondary structure of the RNA alignment. While not strictly required, it is recommended for Rfam alignments.
For more information, see the Stockholm format specification.