# Liftover

NOTE

This documentation is a work in progress and is incomplete.

Please contact developers for more details.

This page describes the utility for genomic coordinates liftover and annotation.

# VCF Liftover Tool - user guide

# Overview

The VCF Liftover Tool enables annotation of the variant call format (VCF) files with coordinates from a different reference genome (currently hard-coded to GRCh38/hg38). This is particularly useful when one needs to analyze the variants in the context of the reference genome different from the one in which variant calling has been done.

# Key features

  • Coordinate conversion: Converts genomic coordinates from one reference genome to another using CrossMap
  • Annotation: Adds hg38 coordinates to the INFO field of VCF entries
  • Structural Variant support: Correctly handles structural variants with END and SVLEN annotations
  • Multiple mapping resolution: Intelligently resolves coordinates that map to multiple locations
  • BED output option: Optionally outputs the lifted coordinates in BED format with tabix indexing

# Prerequisites

Before using the tool, ensure you have:

  1. Python 3.12+
  2. The following Python packages:
    • argparse: Command-line argument parsing
    • subprocess: Executing external commands like CrossMap
    • tempfile: Creating temporary directories and files
    • pathlib (Path): Platform-independent path handling
    • pysam: Handling VCF/BCF files and tabix indexing
    • logging: Structured logging
    • shutil: File operations like copying
    • re: Regular expression support
    • time: Handling delays in retry logic
    • typing: Type annotation support
    • os: File system operations
    • dataclasses: Data class functionality
  3. External tools:
    • CrossMap
    • bgzip
    • tabix
    • sort (standard Unix utility)
  4. A chain file for the reference genome conversion (e.g., hg19ToHg38.over.chain)

# Basic usage

python annotate_with_liftover.py input.vcf hg19ToHg38.over.chain output.vcf

# Command line arguments

Argument Description
input_vcf Path to the input VCF file to be annotated
chain_file Path to the chain file used for coordinate liftover
outfile Path where the annotated VCF file will be saved
--debug Enable debug mode for verbose logging
--output-bed Also output the annotated coordinates as a BED file

# Examples

# Basic annotation

python annotate_with_liftover.py sample.vcf hg19ToHg38.over.chain sample.hg38.vcf

# Annotation with BED output

python annotate_with_liftover.py sample.vcf hg19ToHg38.over.chain sample.hg38.vcf --output-bed

This will create both sample.hg38.vcf and sample.hg38.vcf.bed.gz (with a tabix index).

# Debug mode

python annotate_with_liftover.py sample.vcf hg19ToHg38.over.chain sample.hg38.vcf --debug

# Understanding the output

The annotated VCF will include the following new INFO fields:

Field Description
hg38_chr Chromosome in hg38
hg38_start Start position in hg38 (1-based)
hg38_end End position in hg38
hg38_coord Coordinates in hg38 (chromosome:start-end format)
hg38_map Mapping status (UNIQUE, REGION, or FAILED)

# Mapping status values

  • UNIQUE: The original coordinates mapped uniquely to a single location in hg38
  • REGION: The original coordinates mapped to multiple locations, but a primary mapping was determined
  • FAILED: The original coordinates could not be mapped to hg38

# Working with Structural Variants

The tool automatically detects structural variants by looking for:

  1. The END tag in the INFO field
  2. The SVLEN tag in the INFO field

For variants where these are not available, the tool uses the reference allele length to determine the end position.

# Troubleshooting

If you encounter issues:

  1. Enable debug mode with the --debug flag for detailed logging
  2. Check that your input VCF is properly formatted
  3. Verify that the chain file is appropriate for your genome conversion
  4. Ensure all required dependencies are installed and in your PATH

# Chromosome naming conventions

The tool automatically handles chromosome naming differences between reference genomes (e.g., "1" vs "chr1"). The output coordinates will be formatted according to the hg38 convention.

# VCF Liftover Tool - technical specification

# Architecture overview

The VCF Liftover Tool implements a pipeline for converting genomic coordinates between reference genomes and annotating variant calls. The tool uses a modular design pattern with specialized classes that handle specific aspects of the workflow.

# Class structure

classDiagram
    Pipeline --> BEDGenerator
    Pipeline --> BEDSorter
    Pipeline --> LiftOverHandler
    Pipeline --> CoordinateProcessor
    Pipeline --> VCFAnnotator
    Pipeline --> VCFHeaderProcessor
    
    BEDGenerator --> GenomeConfig
    LiftOverHandler --> CommandRunner
    CoordinateProcessor --> GenomeConfig
    VCFAnnotator --> GenomeConfig
    
    BEDSorter --> CommandRunner
    FileHandler <-- Pipeline
    FileHandler <-- LiftOverHandler
    
    class Pipeline {
        +__init__(input_vcf, chain_file, outfile, debug, output_bed)
        +run()
        -_validate_inputs()
    }
    
    class GenomeConfig {
        +use_chr_prefix: bool
        +default_chr_prefix: str
        +format_chrom(chrom)
        +normalize_chrom(chrom)
    }
    
    class BEDGenerator {
        +vcf_to_bed(input_vcf)
    }
    
    class LiftOverHandler {
        +lift_coordinates(bed_path)
        +run_region_mapping(multi_path)
    }
    
    class CoordinateProcessor {
        +process_lifted_coordinates(lifted_path, unlifted_path, original_bed_path, cmd_runner, chain_file)
    }
    
    class CommandRunner {
        +run(cmd, check, capture_output)
    }

# Key components

# 1. GenomeConfig

This class handles chromosome naming conventions between different reference genomes:

  • use_chr_prefix: Boolean flag for chromosome prefix usage
  • format_chrom(): Formats chromosome names according to configuration
  • normalize_chrom(): Normalizes chromosome names for consistent comparison

# 2. BEDGenerator

Converts VCF records to BED format for coordinate liftover:

  • Properly handles structural variants using END and SVLEN fields
  • Maintains variant identity for tracking through the liftover process
  • Avoids duplicate entries with a tracking mechanism

# 3. LiftOverHandler

Manages the coordinate liftover process using CrossMap:

  • Handles both regular coordinate lifting and region mapping
  • Processes multi-mapped regions with specialized logic
  • Maintains unlifted coordinates for complete variant tracking

# 4. CoordinateProcessor

Process lifted coordinates and generates annotated BED lines:

  • Handles multi-mapped coordinates using region mapping
  • Tracks mapping status (UNIQUE, REGION, FAILED)
  • Provides statistics on mapping results

# 5. VCFAnnotator & VCFHeaderProcessor

Annotate the VCF with lifted coordinates:

  • Adds new INFO fields for hg38 coordinates
  • Modifies VCF header to include field definitions
  • Maps lifted coordinates back to original VCF entries

# 6. CommandRunner & FileHandler

Utility classes for execution control and file operations:

  • Implements retry logic with exponential backoff
  • Provides atomic file operations for robustness
  • Handles compression and indexing of output files

# Data types

# BedLine

chrom: str         # Chromosome name
start: int         # 0-based start position
end: int           # End position
name: str          # Identifier (original coordinates)

# LiftedBedLine

orig_chrom: str    # Original chromosome
orig_start: int    # Original start position (0-based)
orig_end: int      # Original end position
hg38_chrom: str    # hg38 chromosome
hg38_start: int|str # hg38 start position (0-based) or "." for failed mapping
hg38_end: int|str  # hg38 end position or "." for failed mapping
status: str        # Mapping status (UNIQUE, REGION, FAILED)

# VcfAnnotation

fields: List[str]             # Original VCF fields
hg38_annotations: Dict[str, str] # hg38 coordinate annotations

# Workflow

  1. VCF to BED conversion:

    • Extract variant positions from VCF
    • Handle END and SVLEN for structural variants
    • Generate unique identifiers for each variant
  2. BED sorting and preparation:

    • Sort BED file by chromosome and position
    • Prepare for CrossMap processing
  3. Coordinate liftover:

    • Run CrossMap to lift coordinates
    • Track unlifted coordinates
    • Process multi-mapped regions
  4. Annotation integration:

    • Modify VCF header with new INFO fields
    • Annotate VCF entries with lifted coordinates
    • Handle different mapping status types
  5. Output generation:

    • Generate annotated VCF output
    • Optionally create BED output with tabix index

# Mapping resolution

The tool employs logic for handling multi-mapped coordinates:

  1. First pass identifies uniquely mapped coordinates
  2. A second pass using CrossMap's region mapping resolves multi-mapped coordinates
  3. Coordinates that fail to map are annotated with a FAILED status

# Error handling

  • Command execution includes retry logic with exponential backoff
  • Atomic file operations protect against partial writes
  • Validation of input and intermediate files
  • Detailed logging with configurable verbosity

# File format specifications

# Annotated VCF

The tool adds the following INFO fields to the VCF:

Field Type Description
hg38_chr String Chromosome in hg38
hg38_start Integer Start position in hg38 (1-based)
hg38_end Integer End position in hg38
hg38_coord String Coordinates in hg38 (format: chrom:start-end)
hg38_map String Mapping status (UNIQUE, REGION, FAILED)

# Annotated BED

The annotated BED file contains:

orig_chrom  orig_start  orig_end  hg38_coord  hg38_chrom  hg38_start  hg38_end  status

# Performance considerations

  • Streaming file processing for memory efficiency
  • Temporary file management to prevent disk usage issues
  • Command retries with exponential backoff for resilience
  • Tabix indexing for efficient coordinate lookup
  • Uses subprocess for external command execution

# Environmental requirements

  • Python 3.12+
  • CrossMap external dependency
  • Standard Unix utilities (sort, bgzip, tabix)
  • Sufficient disk space for temporary files
Last Updated: 4/28/2025, 11:05:40 AM