Variant-Centered Sequence Windows

Generate sequence windows around variants for model predictions.

Overview

Extract sequence windows centered on variants for genomic model predictions, variant effect analysis, and PAM site disruption analysis.

Main Functions

get_alt_ref_sequences()

Generate reference and alternate sequence pairs centered on each variant.

# Note: get_alt_ref_sequences is a generator that yields chunks
results = list(sl.get_alt_ref_sequences(
    reference_fn=reference,
    variants_fn=variants,
    seq_len=1000,  # 1000 bp windows
    encode=True     # One-hot encoded
))
# Unpack from the first chunk
alt_seqs, ref_seqs, metadata = results[0]

Parameters:

get_alt_ref_sequences(
    reference_fn,           # Reference genome (path, Fasta, or dict)
    variants_fn,            # Variants (path or DataFrame)
    seq_len,                # int: Total sequence length
    encode=True,            # bool: Return encoded sequences
    chunk_size=1,           # int: Variants per chunk
    encoder=None            # Optional custom encoder
) -> tuple[array/list, array/list, list]

Returns: (ref_seqs, alt_seqs, metadata)

  • Variants centered at position seq_len // 2

  • encode=True: shape (n_variants, 4, seq_len)

  • encode=False: list of strings

get_pam_disrupting_alt_sequences()

Identify variants that disrupt PAM sites (e.g., for CRISPR analysis).

results = sl.get_pam_disrupting_alt_sequences(
    reference_fn=reference,
    variants_fn=variants,
    seq_len=1000,
    max_pam_distance=50,  # Search within 50 bp of variant
    pam_sequence="NGG"     # SpCas9 PAM
)

Parameters:

get_pam_disrupting_alt_sequences(
    reference_fn,           # Reference genome
    variants_fn,            # Variants
    seq_len,                # int: Sequence window length
    max_pam_distance,       # int: Max distance from variant to PAM
    pam_sequence="NGG",     # str: PAM sequence (supports IUPAC codes)
    encode=True,            # bool: Return encoded sequences
    n_chunks=1,             # int: Number of chunks for processing
    encoder=None            # Optional custom encoder
) -> dict

Returns dict: {'variants': DataFrame, 'pam_intact': sequences, 'pam_disrupted': sequences}

Supported PAM sequences: Any PAM sequence specified using (IUPAC degenerate bases)[https://en.wikipedia.org/wiki/Nucleic_acid_notation].

Metadata Structure

Each variant returns metadata with essential information:

{
    'chrom': str,              # Chromosome name
    'window_start': int,       # Window start (0-based)
    'window_end': int,         # Window end (0-based exclusive)
    'variant_pos0': int,       # Variant position (0-based)
    'variant_pos1': int,       # Variant position (1-based VCF)
    'ref': str,                # Reference allele
    'alt': str,                # Alternate allele
    'variant_type': str,       # SNV, INS, DEL, SV_INV, etc.

    # Optional fields for structural variants:
    'sym_variant_end': int,    # END position for <INV>, <DUP>
    'mate_chrom': str,         # Mate chromosome (BND only)
    'mate_pos': int,           # Mate position (BND only)
    'orientation_1': str,      # Orientation (BND only)
    'orientation_2': str,      # Orientation (BND only)
    'fusion_name': str         # Fusion name (BND only)
}

:::{tip} The variant_type field is automatically determined using the classification logic shown in the Variant Classification Flow Chart. :::

Output Formats

Encoded (encode=True): numpy arrays or PyTorch tensors, shape (n_variants, 4, seq_len)

Raw (encode=False): list of strings

Examples

Basic Sequence Generation

import supremo_lite as sl
from pyfaidx import Fasta

# File locations
reference = 'reference.fa'
variants = 'variants.vcf'

# Generate 500bp windows
results = list(sl.get_alt_ref_sequences(
    reference_fn=reference,
    variants_fn=variants,
    seq_len=500,
    encode=True
))
alt_seqs, ref_seqs, metadata = results[0]

print(f"Generated {len(metadata)} sequence pairs")
print(f"Reference shape: {ref_seqs.shape}")
print(f"Alternate shape: {alt_seqs.shape}")

PAM Disruption Analysis

pam_results = sl.get_pam_disrupting_alt_sequences(
    reference_fn=reference,
    variants_fn=variants,
    seq_len=1000,
    max_pam_distance=50,
    pam_sequence="NGG"
)

See Also