Getting Started with supremo_lite

Introduction to core functionality for generating personalized genome sequences and variant-centered windows.

What is supremo_lite?

supremo_lite is a lightweight Python package for:

  • Generating personalized genome sequences from reference genomes and variant files

  • Creating variant-centered sequence windows for model predictions

  • Performing in-silico saturation mutagenesis

  • Aligning predictions across reference and variant sequences

Designed to be memory-efficient and model-agnostic, supporting both PyTorch tensors and NumPy arrays.

Installation

If you haven’t already installed supremo_lite:

pip install supremo-lite

Basic Imports

import supremo_lite as sl
import numpy as np
import matplotlib.pyplot as plt
from pyfaidx import Fasta
import pandas as pd

print(f"supremo_lite version: {sl.__version__}")

## Optional packages:

# Check PyTorch availability
print(f"PyTorch available: {sl.TORCH_AVAILABLE}")

# Check brisket availability (faster 1h encoding)
print(f"brisket available: {sl.BRISKET_AVAILABLE}")
supremo_lite version: 1.0.0
PyTorch available: True
brisket available: True

DNA Sequence Encoding

supremo_lite uses one-hot encoding: A=[1,0,0,0], C=[0,1,0,0], G=[0,0,1,0], T=[0,0,0,1].

Functions automatically return PyTorch tensors when available, otherwise NumPy arrays.

Loading Test Data

For this tutorial, we’ll use small test data included with the package. This allows for quick demonstrations without requiring large genomic datasets.

import os

# Find the test data directory
# Assuming we're in the package directory structure
test_data_dir = "../../tests/data"

# Load reference genome
reference_path = os.path.join(test_data_dir, "test_genome.fa")
reference = Fasta(reference_path)

print("Reference genome chromosomes:")
for chrom in reference.keys():
    seq_len = len(reference[chrom])
    print(f"  {chrom}: {seq_len} bp")

# Show first 80 bp of chr1
print(f"\nchr1 sequence (first 80 bp):")
print(reference["chr1"][:80].seq)
Reference genome chromosomes:
  chr1: 80 bp
  chr2: 80 bp
  chr3: 80 bp
  chr4: 80 bp
  chr5: 80 bp
  chr6: 80 bp

chr1 sequence (first 80 bp):
ATGAATATAATATTTTCGAGAATTACTCCTTTTGGAAATGGAACATTATGCGTTTTAAGAGTTTCTGGTAACAATATATT

Reading VCF Files

supremo_lite provides utilities to read VCF (Variant Call Format) files:

# Load SNP variants
snp_vcf_path = os.path.join(test_data_dir, "snp", "snp.vcf")
snp_variants = sl.read_vcf(snp_vcf_path)

print("SNP variants:")
print(snp_variants)
print(f"\nLoaded {len(snp_variants)} SNP variant(s)")
SNP variants:
  chrom  pos1 id ref alt    info  vcf_line variant_type
0  chr1     2  .   T   G  DP=100         8          SNV
1  chr1    31  .   T   A  DP=100         9          SNV
2  chr2    19  .   A   G  DP=100        10          SNV
3  chr2    57  .   C   G  DP=100        11          SNV

Loaded 4 SNP variant(s)

Basic Usage: Creating a Personalized Genome

The most fundamental operation is applying variants to a reference genome to create a personalized genome:

# Create personalized genome (returns encoded sequences by default)
personal_genome = sl.get_personal_genome(
    reference_fn=reference,
    variants_fn=snp_variants,
    encode=True,
    verbose=True,  # Show progress
)

# Check the result
print(f"\nPersonalized genome chromosomes: {list(personal_genome.keys())}")
print(f"chr1 encoded shape: {personal_genome['chr1'].shape}")
print(f"Data type: {type(personal_genome['chr1'])}")
🧬 Processing 4 variants across 2 chromosomes
   Phase 1: 4 standard variants (SNV, MNV, INS, DEL, SV_DUP, SV_INV)
   Phase 2: 0 BND variants for semantic classification
🔄 Processing chromosome chr1: 2 variants (2 SNV)
  ✅ Applied 2/2 variants (0 skipped)
🔄 Processing chromosome chr2: 2 variants (2 SNV)
  ✅ Applied 2/2 variants (0 skipped)
🧬 Completed: 4/4 variants processed → 6 sequences

Personalized genome chromosomes: ['chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6']
chr1 encoded shape: torch.Size([4, 80])
Data type: <class 'torch.Tensor'>

Getting Raw Sequence Strings

You can also get the sequences as strings instead of encoded arrays:

# Get raw sequences
personal_genome_raw = sl.get_personal_genome(
    reference_fn=reference, variants_fn=snp_variants, encode=False
)

# Compare reference and personalized sequences
ref_seq = reference["chr1"][:80].seq
pers_seq = personal_genome_raw["chr1"][:80]

print("Reference chr1 (first 80 bp):")
print(ref_seq)
print("\nPersonalized chr1 (first 80 bp):")
print(pers_seq)

# Highlight the difference
print("\nDifferences (position: ref → alt):")
for i, (r, p) in enumerate(zip(ref_seq, pers_seq)):
    if r != p:
        print(f"  Position {i}: {r}{p}")
Reference chr1 (first 80 bp):
ATGAATATAATATTTTCGAGAATTACTCCTTTTGGAAATGGAACATTATGCGTTTTAAGAGTTTCTGGTAACAATATATT

Personalized chr1 (first 80 bp):
AGGAATATAATATTTTCGAGAATTACTCCTATTGGAAATGGAACATTATGCGTTTTAAGAGTTTCTGGTAACAATATATT

Differences (position: ref → alt):
  Position 1: T → G
  Position 30: T → A

Generating Variant-Centered Windows

Often you want to extract sequence windows centered on each variant, which is useful for model predictions:

# Generate windows around variants
# Note: get_alt_ref_sequences is a generator that yields chunks
results = list(
    sl.get_alt_ref_sequences(
        reference_fn=reference,
        variants_fn=snp_variants,
        seq_len=40,  # 40 bp windows
        encode=False,  # Get strings for visualization
    )
)

# Unpack from the first chunk
alt_seqs, ref_seqs, metadata = results[0]

print(f"Generated {len(metadata)} sequence pair(s)")
print("\nMetadata:")
print(metadata.head())  # Show first few variants

print("\nReference sequence (2nd variant):")
print(ref_seqs[1])
print("\nAlternate sequence (2nd variant):")
print(alt_seqs[1])

# Find the variant position in the window
variant_offset = len(ref_seqs[1][3]) // 2
print(f"\nVariant is centered at position {variant_offset} in the window")
print(f"Reference allele: {ref_seqs[1][3][variant_offset]}")
print(f"Alternate allele: {alt_seqs[1][3][variant_offset]}")
Generated 4 sequence pair(s)

Metadata:
  chrom  window_start  window_end  variant_pos0  variant_pos1 ref alt  \
0  chr1             0          40             1             2   T   G   
1  chr1            10          50            30            31   T   A   
2  chr2             0          40            18            19   A   G   
3  chr2            36          76            56            57   C   G   

  variant_type  
0          SNV  
1          SNV  
2          SNV  
3          SNV  

Reference sequence (2nd variant):
('chr1', np.int64(10), np.int64(50), 'TATTTTCGAGAATTACTCCTTTTGGAAATGGAACATTATG')

Alternate sequence (2nd variant):
('chr1', np.int64(10), np.int64(50), 'TATTTTCGAGAATTACTCCTATTGGAAATGGAACATTATG')

Variant is centered at position 20 in the window
Reference allele: T
Alternate allele: A

Sequence Utilities

supremo_lite provides useful utilities for working with DNA sequences:

# Reverse complement
sequence = "ATCGATCG"
rc_sequence = sl.rc_str(sequence)

print(f"Original:          {sequence}")
print(f"Reverse complement: {rc_sequence}")

# Encode and decode
encoded = sl.encode_seq(sequence)  # Automatically uses brisket if installed
decoded = sl.decode_seq(encoded)

print(f"\nDecoded matches original: {decoded == sequence}")

# Reverse complement of encoded sequence
rc_encoded = sl.rc(encoded)
rc_decoded = sl.decode_seq(rc_encoded)

print(f"RC from encoded:   {rc_decoded}")
print(f"RC from string:    {rc_sequence}")
print(f"Match: {rc_decoded == rc_sequence}")
Original:          ATCGATCG
Reverse complement: CGATCGAT

Decoded matches original: True
RC from encoded:   CGATCGAT
RC from string:    CGATCGAT
Match: True

Working with Encoded Sequences

Let’s see how to work with encoded sequences for downstream analysis:

# Generate encoded windows
results = list(
    sl.get_alt_ref_sequences(
        reference_fn=reference,
        variants_fn=snp_variants,
        seq_len=40,
        encode=True,  # Get encoded arrays/tensors
    )
)

# Unpack from the first chunk
alt_seqs_enc, ref_seqs_enc, metadata = results[0]

print(f"Reference sequences shape: {ref_seqs_enc.shape}")
print(f"Alternate sequences shape: {alt_seqs_enc.shape}")
print(f"Data type: {type(ref_seqs_enc)}")

# Visualize the first sequence pair
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 6))

# Convert to numpy if needed for visualization
ref_viz = (
    ref_seqs_enc[2].numpy() if hasattr(ref_seqs_enc[2], "numpy") else ref_seqs_enc[2]
)
alt_viz = (
    alt_seqs_enc[2].numpy() if hasattr(alt_seqs_enc[2], "numpy") else alt_seqs_enc[2]
)

# Reference sequence - data is already in (4, L) format
im1 = ax1.imshow(ref_viz, cmap="Blues", aspect="auto")
ax1.set_yticks([0, 1, 2, 3])
ax1.set_yticklabels(["A", "C", "G", "T"])
ax1.set_title("Reference Sequence (Encoded) - Shape: (4, 40)")
ax1.set_ylabel("Nucleotide Channel")
ax1.axvline(x=20, color="red", linestyle="--", alpha=0.5, label="Variant position")
ax1.legend()

# Alternate sequence - data is already in (4, L) format
im2 = ax2.imshow(alt_viz, cmap="Oranges", aspect="auto")
ax2.set_yticks([0, 1, 2, 3])
ax2.set_yticklabels(["A", "C", "G", "T"])
ax2.set_title("Alternate Sequence (Encoded) - Shape: (4, 40)")
ax2.set_xlabel("Position in window")
ax2.set_ylabel("Nucleotide Channel")
ax2.axvline(x=20, color="red", linestyle="--", alpha=0.5, label="Variant position")
ax2.legend()

plt.tight_layout()
plt.show()

print("\nThe variant at position 20 shows different encoding between ref and alt")
print(
    "Note: Sequences are encoded in channel-first format (4, L) for compatibility with PyTorch models"
)
Reference sequences shape: torch.Size([4, 4, 40])
Alternate sequences shape: torch.Size([4, 4, 40])
Data type: <class 'torch.Tensor'>
../_images/758110acddbb8d57cef334f7a00a012f34994ec40e54b26f5a163856d578ca90.png
The variant at position 20 shows different encoding between ref and alt
Note: Sequences are encoded in channel-first format (4, L) for compatibility with PyTorch models

Next Steps

Continue with the other notebooks: