Variant-Centered Sequence Windows
Generate sequence windows around variants for model predictions.
Overview
Extract sequence windows centered on variants for genomic model predictions, variant effect analysis, and PAM site disruption analysis.
Main Functions
get_alt_ref_sequences()
Generate reference and alternate sequence pairs centered on each variant.
# Note: get_alt_ref_sequences is a generator that yields chunks
results = list(sl.get_alt_ref_sequences(
reference_fn=reference,
variants_fn=variants,
seq_len=1000, # 1000 bp windows
encode=True # One-hot encoded
))
# Unpack from the first chunk
alt_seqs, ref_seqs, metadata = results[0]
Parameters:
get_alt_ref_sequences(
reference_fn, # Reference genome (path, Fasta, or dict)
variants_fn, # Variants (path or DataFrame)
seq_len, # int: Total sequence length
encode=True, # bool: Return encoded sequences
chunk_size=1, # int: Variants per chunk
encoder=None # Optional custom encoder
) -> tuple[array/list, array/list, list]
Returns: (ref_seqs, alt_seqs, metadata)
Variants centered at position
seq_len // 2encode=True: shape(n_variants, 4, seq_len)encode=False: list of strings
get_pam_disrupting_alt_sequences()
Identify variants that disrupt PAM sites (e.g., for CRISPR analysis).
results = sl.get_pam_disrupting_alt_sequences(
reference_fn=reference,
variants_fn=variants,
seq_len=1000,
max_pam_distance=50, # Search within 50 bp of variant
pam_sequence="NGG" # SpCas9 PAM
)
Parameters:
get_pam_disrupting_alt_sequences(
reference_fn, # Reference genome
variants_fn, # Variants
seq_len, # int: Sequence window length
max_pam_distance, # int: Max distance from variant to PAM
pam_sequence="NGG", # str: PAM sequence (supports IUPAC codes)
encode=True, # bool: Return encoded sequences
n_chunks=1, # int: Number of chunks for processing
encoder=None # Optional custom encoder
) -> dict
Returns dict: {'variants': DataFrame, 'pam_intact': sequences, 'pam_disrupted': sequences}
Supported PAM sequences: Any PAM sequence specified using (IUPAC degenerate bases)[https://en.wikipedia.org/wiki/Nucleic_acid_notation].
Metadata Structure
Each variant returns metadata with essential information:
{
'chrom': str, # Chromosome name
'window_start': int, # Window start (0-based)
'window_end': int, # Window end (0-based exclusive)
'variant_pos0': int, # Variant position (0-based)
'variant_pos1': int, # Variant position (1-based VCF)
'ref': str, # Reference allele
'alt': str, # Alternate allele
'variant_type': str, # SNV, INS, DEL, SV_INV, etc.
# Optional fields for structural variants:
'sym_variant_end': int, # END position for <INV>, <DUP>
'mate_chrom': str, # Mate chromosome (BND only)
'mate_pos': int, # Mate position (BND only)
'orientation_1': str, # Orientation (BND only)
'orientation_2': str, # Orientation (BND only)
'fusion_name': str # Fusion name (BND only)
}
:::{tip}
The variant_type field is automatically determined using the classification logic shown in the Variant Classification Flow Chart.
:::
Output Formats
Encoded (encode=True): numpy arrays or PyTorch tensors, shape (n_variants, 4, seq_len)
Raw (encode=False): list of strings
Examples
Basic Sequence Generation
import supremo_lite as sl
from pyfaidx import Fasta
# File locations
reference = 'reference.fa'
variants = 'variants.vcf'
# Generate 500bp windows
results = list(sl.get_alt_ref_sequences(
reference_fn=reference,
variants_fn=variants,
seq_len=500,
encode=True
))
alt_seqs, ref_seqs, metadata = results[0]
print(f"Generated {len(metadata)} sequence pairs")
print(f"Reference shape: {ref_seqs.shape}")
print(f"Alternate shape: {alt_seqs.shape}")
PAM Disruption Analysis
pam_results = sl.get_pam_disrupting_alt_sequences(
reference_fn=reference,
variants_fn=variants,
seq_len=1000,
max_pam_distance=50,
pam_sequence="NGG"
)
See Also
PAM Disruption Analysis - Detailed guide for CRISPR PAM site analysis