# Liftover
NOTE
This documentation is a work in progress and is incomplete.
Please contact developers for more details.
This page describes the utility for genomic coordinates liftover and annotation.
# VCF Liftover Tool - user guide
# Overview
The VCF Liftover Tool enables annotation of the variant call format (VCF) files with coordinates from a different reference genome (currently hard-coded to GRCh38/hg38). This is particularly useful when one needs to analyze the variants in the context of the reference genome different from the one in which variant calling has been done.
# Key features
- Coordinate conversion: Converts genomic coordinates from one reference genome to another using CrossMap
- Annotation: Adds hg38 coordinates to the INFO field of VCF entries
- Structural Variant support: Correctly handles structural variants with END and SVLEN annotations
- Multiple mapping resolution: Intelligently resolves coordinates that map to multiple locations
- BED output option: Optionally outputs the lifted coordinates in BED format with tabix indexing
# Prerequisites
Before using the tool, ensure you have:
- Python 3.12+
- The following Python packages:
- argparse: Command-line argument parsing
- subprocess: Executing external commands like CrossMap
- tempfile: Creating temporary directories and files
- pathlib (Path): Platform-independent path handling
- pysam: Handling VCF/BCF files and tabix indexing
- logging: Structured logging
- shutil: File operations like copying
- re: Regular expression support
- time: Handling delays in retry logic
- typing: Type annotation support
- os: File system operations
- dataclasses: Data class functionality
- External tools:
- CrossMap
- bgzip
- tabix
- sort (standard Unix utility)
- A chain file for the reference genome conversion (e.g., hg19ToHg38.over.chain)
# Basic usage
python annotate_with_liftover.py input.vcf hg19ToHg38.over.chain output.vcf
# Command line arguments
Argument | Description |
---|---|
input_vcf | Path to the input VCF file to be annotated |
chain_file | Path to the chain file used for coordinate liftover |
outfile | Path where the annotated VCF file will be saved |
--debug | Enable debug mode for verbose logging |
--output-bed | Also output the annotated coordinates as a BED file |
# Examples
# Basic annotation
python annotate_with_liftover.py sample.vcf hg19ToHg38.over.chain sample.hg38.vcf
# Annotation with BED output
python annotate_with_liftover.py sample.vcf hg19ToHg38.over.chain sample.hg38.vcf --output-bed
This will create both sample.hg38.vcf
and sample.hg38.vcf.bed.gz
(with a tabix index).
# Debug mode
python annotate_with_liftover.py sample.vcf hg19ToHg38.over.chain sample.hg38.vcf --debug
# Understanding the output
The annotated VCF will include the following new INFO fields:
Field | Description |
---|---|
hg38_chr | Chromosome in hg38 |
hg38_start | Start position in hg38 (1-based) |
hg38_end | End position in hg38 |
hg38_coord | Coordinates in hg38 (chromosome:start-end format) |
hg38_map | Mapping status (UNIQUE, REGION, or FAILED) |
# Mapping status values
- UNIQUE: The original coordinates mapped uniquely to a single location in hg38
- REGION: The original coordinates mapped to multiple locations, but a primary mapping was determined
- FAILED: The original coordinates could not be mapped to hg38
# Working with Structural Variants
The tool automatically detects structural variants by looking for:
- The
END
tag in the INFO field - The
SVLEN
tag in the INFO field
For variants where these are not available, the tool uses the reference allele length to determine the end position.
# Troubleshooting
If you encounter issues:
- Enable debug mode with the
--debug
flag for detailed logging - Check that your input VCF is properly formatted
- Verify that the chain file is appropriate for your genome conversion
- Ensure all required dependencies are installed and in your PATH
# Chromosome naming conventions
The tool automatically handles chromosome naming differences between reference genomes (e.g., "1" vs "chr1"). The output coordinates will be formatted according to the hg38 convention.
# VCF Liftover Tool - technical specification
# Architecture overview
The VCF Liftover Tool implements a pipeline for converting genomic coordinates between reference genomes and annotating variant calls. The tool uses a modular design pattern with specialized classes that handle specific aspects of the workflow.
# Class structure
classDiagram
Pipeline --> BEDGenerator
Pipeline --> BEDSorter
Pipeline --> LiftOverHandler
Pipeline --> CoordinateProcessor
Pipeline --> VCFAnnotator
Pipeline --> VCFHeaderProcessor
BEDGenerator --> GenomeConfig
LiftOverHandler --> CommandRunner
CoordinateProcessor --> GenomeConfig
VCFAnnotator --> GenomeConfig
BEDSorter --> CommandRunner
FileHandler <-- Pipeline
FileHandler <-- LiftOverHandler
class Pipeline {
+__init__(input_vcf, chain_file, outfile, debug, output_bed)
+run()
-_validate_inputs()
}
class GenomeConfig {
+use_chr_prefix: bool
+default_chr_prefix: str
+format_chrom(chrom)
+normalize_chrom(chrom)
}
class BEDGenerator {
+vcf_to_bed(input_vcf)
}
class LiftOverHandler {
+lift_coordinates(bed_path)
+run_region_mapping(multi_path)
}
class CoordinateProcessor {
+process_lifted_coordinates(lifted_path, unlifted_path, original_bed_path, cmd_runner, chain_file)
}
class CommandRunner {
+run(cmd, check, capture_output)
}
# Key components
# 1. GenomeConfig
This class handles chromosome naming conventions between different reference genomes:
use_chr_prefix
: Boolean flag for chromosome prefix usageformat_chrom()
: Formats chromosome names according to configurationnormalize_chrom()
: Normalizes chromosome names for consistent comparison
# 2. BEDGenerator
Converts VCF records to BED format for coordinate liftover:
- Properly handles structural variants using END and SVLEN fields
- Maintains variant identity for tracking through the liftover process
- Avoids duplicate entries with a tracking mechanism
# 3. LiftOverHandler
Manages the coordinate liftover process using CrossMap:
- Handles both regular coordinate lifting and region mapping
- Processes multi-mapped regions with specialized logic
- Maintains unlifted coordinates for complete variant tracking
# 4. CoordinateProcessor
Process lifted coordinates and generates annotated BED lines:
- Handles multi-mapped coordinates using region mapping
- Tracks mapping status (UNIQUE, REGION, FAILED)
- Provides statistics on mapping results
# 5. VCFAnnotator & VCFHeaderProcessor
Annotate the VCF with lifted coordinates:
- Adds new INFO fields for hg38 coordinates
- Modifies VCF header to include field definitions
- Maps lifted coordinates back to original VCF entries
# 6. CommandRunner & FileHandler
Utility classes for execution control and file operations:
- Implements retry logic with exponential backoff
- Provides atomic file operations for robustness
- Handles compression and indexing of output files
# Data types
# BedLine
chrom: str # Chromosome name
start: int # 0-based start position
end: int # End position
name: str # Identifier (original coordinates)
# LiftedBedLine
orig_chrom: str # Original chromosome
orig_start: int # Original start position (0-based)
orig_end: int # Original end position
hg38_chrom: str # hg38 chromosome
hg38_start: int|str # hg38 start position (0-based) or "." for failed mapping
hg38_end: int|str # hg38 end position or "." for failed mapping
status: str # Mapping status (UNIQUE, REGION, FAILED)
# VcfAnnotation
fields: List[str] # Original VCF fields
hg38_annotations: Dict[str, str] # hg38 coordinate annotations
# Workflow
VCF to BED conversion:
- Extract variant positions from VCF
- Handle END and SVLEN for structural variants
- Generate unique identifiers for each variant
BED sorting and preparation:
- Sort BED file by chromosome and position
- Prepare for CrossMap processing
Coordinate liftover:
- Run CrossMap to lift coordinates
- Track unlifted coordinates
- Process multi-mapped regions
Annotation integration:
- Modify VCF header with new INFO fields
- Annotate VCF entries with lifted coordinates
- Handle different mapping status types
Output generation:
- Generate annotated VCF output
- Optionally create BED output with tabix index
# Mapping resolution
The tool employs logic for handling multi-mapped coordinates:
- First pass identifies uniquely mapped coordinates
- A second pass using CrossMap's region mapping resolves multi-mapped coordinates
- Coordinates that fail to map are annotated with a FAILED status
# Error handling
- Command execution includes retry logic with exponential backoff
- Atomic file operations protect against partial writes
- Validation of input and intermediate files
- Detailed logging with configurable verbosity
# File format specifications
# Annotated VCF
The tool adds the following INFO fields to the VCF:
Field | Type | Description |
---|---|---|
hg38_chr | String | Chromosome in hg38 |
hg38_start | Integer | Start position in hg38 (1-based) |
hg38_end | Integer | End position in hg38 |
hg38_coord | String | Coordinates in hg38 (format: chrom:start-end) |
hg38_map | String | Mapping status (UNIQUE, REGION, FAILED) |
# Annotated BED
The annotated BED file contains:
orig_chrom orig_start orig_end hg38_coord hg38_chrom hg38_start hg38_end status
# Performance considerations
- Streaming file processing for memory efficiency
- Temporary file management to prevent disk usage issues
- Command retries with exponential backoff for resilience
- Tabix indexing for efficient coordinate lookup
- Uses subprocess for external command execution
# Environmental requirements
- Python 3.12+
- CrossMap external dependency
- Standard Unix utilities (sort, bgzip, tabix)
- Sufficient disk space for temporary files