Getting Started with supremo_lite

Introduction to core functionality for generating personalized genome sequences and variant-centered windows.

What is supremo_lite?

supremo_lite is a lightweight Python package for:

Generating personalized genome sequences from reference genomes and variant files
Creating variant-centered sequence windows for model predictions
Performing in-silico saturation mutagenesis
Aligning predictions across reference and variant sequences

Designed to be memory-efficient and model-agnostic, supporting both PyTorch tensors and NumPy arrays.

Installation

If you haven’t already installed supremo_lite:

pip install supremo-lite

Basic Imports

import supremo_lite as sl
import numpy as np
import matplotlib.pyplot as plt
from pyfaidx import Fasta
import pandas as pd

print(f"supremo_lite version: {sl.__version__}")

## Optional packages:

# Check PyTorch availability
print(f"PyTorch available: {sl.TORCH_AVAILABLE}")

# Check brisket availability (faster 1h encoding)
print(f"brisket available: {sl.BRISKET_AVAILABLE}")

supremo_lite version: 1.0.0
PyTorch available: True
brisket available: True

DNA Sequence Encoding

supremo_lite uses one-hot encoding: A=[1,0,0,0], C=[0,1,0,0], G=[0,0,1,0], T=[0,0,0,1].

Functions automatically return PyTorch tensors when available, otherwise NumPy arrays.

Loading Test Data

For this tutorial, we’ll use small test data included with the package. This allows for quick demonstrations without requiring large genomic datasets.

import os

# Find the test data directory
# Assuming we're in the package directory structure
test_data_dir = "../../tests/data"

# Load reference genome
reference_path = os.path.join(test_data_dir, "test_genome.fa")
reference = Fasta(reference_path)

print("Reference genome chromosomes:")
for chrom in reference.keys():
    seq_len = len(reference[chrom])
    print(f"  {chrom}: {seq_len} bp")

# Show first 80 bp of chr1
print(f"\nchr1 sequence (first 80 bp):")
print(reference["chr1"][:80].seq)

Reference genome chromosomes:
  chr1: 80 bp
  chr2: 80 bp
  chr3: 80 bp
  chr4: 80 bp
  chr5: 80 bp
  chr6: 80 bp

chr1 sequence (first 80 bp):
ATGAATATAATATTTTCGAGAATTACTCCTTTTGGAAATGGAACATTATGCGTTTTAAGAGTTTCTGGTAACAATATATT

Reading VCF Files

supremo_lite provides utilities to read VCF (Variant Call Format) files:

# Load SNP variants
snp_vcf_path = os.path.join(test_data_dir, "snp", "snp.vcf")
snp_variants = sl.read_vcf(snp_vcf_path)

print("SNP variants:")
print(snp_variants)
print(f"\nLoaded {len(snp_variants)} SNP variant(s)")

SNP variants:
  chrom  pos1 id ref alt    info  vcf_line variant_type
0  chr1     2  .   T   G  DP=100         8          SNV
1  chr1    31  .   T   A  DP=100         9          SNV
2  chr2    19  .   A   G  DP=100        10          SNV
3  chr2    57  .   C   G  DP=100        11          SNV

Loaded 4 SNP variant(s)

Basic Usage: Creating a Personalized Genome

The most fundamental operation is applying variants to a reference genome to create a personalized genome:

# Create personalized genome (returns encoded sequences by default)
personal_genome = sl.get_personal_genome(
    reference_fn=reference,
    variants_fn=snp_variants,
    encode=True,
    verbose=True,  # Show progress
)

# Check the result
print(f"\nPersonalized genome chromosomes: {list(personal_genome.keys())}")
print(f"chr1 encoded shape: {personal_genome['chr1'].shape}")
print(f"Data type: {type(personal_genome['chr1'])}")

🧬 Processing 4 variants across 2 chromosomes
   Phase 1: 4 standard variants (SNV, MNV, INS, DEL, SV_DUP, SV_INV)
   Phase 2: 0 BND variants for semantic classification
🔄 Processing chromosome chr1: 2 variants (2 SNV)
  ✅ Applied 2/2 variants (0 skipped)
🔄 Processing chromosome chr2: 2 variants (2 SNV)
  ✅ Applied 2/2 variants (0 skipped)
🧬 Completed: 4/4 variants processed → 6 sequences

Personalized genome chromosomes: ['chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6']
chr1 encoded shape: torch.Size([4, 80])
Data type: <class 'torch.Tensor'>

Getting Raw Sequence Strings

You can also get the sequences as strings instead of encoded arrays:

# Get raw sequences
personal_genome_raw = sl.get_personal_genome(
    reference_fn=reference, variants_fn=snp_variants, encode=False
)

# Compare reference and personalized sequences
ref_seq = reference["chr1"][:80].seq
pers_seq = personal_genome_raw["chr1"][:80]

print("Reference chr1 (first 80 bp):")
print(ref_seq)
print("\nPersonalized chr1 (first 80 bp):")
print(pers_seq)

# Highlight the difference
print("\nDifferences (position: ref → alt):")
for i, (r, p) in enumerate(zip(ref_seq, pers_seq)):
    if r != p:
        print(f"  Position {i}: {r} → {p}")

Reference chr1 (first 80 bp):
ATGAATATAATATTTTCGAGAATTACTCCTTTTGGAAATGGAACATTATGCGTTTTAAGAGTTTCTGGTAACAATATATT

Personalized chr1 (first 80 bp):
AGGAATATAATATTTTCGAGAATTACTCCTATTGGAAATGGAACATTATGCGTTTTAAGAGTTTCTGGTAACAATATATT

Differences (position: ref → alt):
  Position 1: T → G
  Position 30: T → A

Generating Variant-Centered Windows

Often you want to extract sequence windows centered on each variant, which is useful for model predictions:

# Generate windows around variants
# Note: get_alt_ref_sequences is a generator that yields chunks
results = list(
    sl.get_alt_ref_sequences(
        reference_fn=reference,
        variants_fn=snp_variants,
        seq_len=40,  # 40 bp windows
        encode=False,  # Get strings for visualization
    )
)

# Unpack from the first chunk
alt_seqs, ref_seqs, metadata = results[0]

print(f"Generated {len(metadata)} sequence pair(s)")
print("\nMetadata:")
print(metadata.head())  # Show first few variants

print("\nReference sequence (2nd variant):")
print(ref_seqs[1])
print("\nAlternate sequence (2nd variant):")
print(alt_seqs[1])

# Find the variant position in the window
variant_offset = len(ref_seqs[1][3]) // 2
print(f"\nVariant is centered at position {variant_offset} in the window")
print(f"Reference allele: {ref_seqs[1][3][variant_offset]}")
print(f"Alternate allele: {alt_seqs[1][3][variant_offset]}")

Generated 4 sequence pair(s)

Metadata:
  chrom  window_start  window_end  variant_pos0  variant_pos1 ref alt  \
0  chr1             0          40             1             2   T   G   
1  chr1            10          50            30            31   T   A   
2  chr2             0          40            18            19   A   G   
3  chr2            36          76            56            57   C   G   

  variant_type  
0          SNV  
1          SNV  
2          SNV  
3          SNV  

Reference sequence (2nd variant):
('chr1', np.int64(10), np.int64(50), 'TATTTTCGAGAATTACTCCTTTTGGAAATGGAACATTATG')

Alternate sequence (2nd variant):
('chr1', np.int64(10), np.int64(50), 'TATTTTCGAGAATTACTCCTATTGGAAATGGAACATTATG')

Variant is centered at position 20 in the window
Reference allele: T
Alternate allele: A

Sequence Utilities

supremo_lite provides useful utilities for working with DNA sequences:

# Reverse complement
sequence = "ATCGATCG"
rc_sequence = sl.rc_str(sequence)

print(f"Original:          {sequence}")
print(f"Reverse complement: {rc_sequence}")

# Encode and decode
encoded = sl.encode_seq(sequence)  # Automatically uses brisket if installed
decoded = sl.decode_seq(encoded)

print(f"\nDecoded matches original: {decoded == sequence}")

# Reverse complement of encoded sequence
rc_encoded = sl.rc(encoded)
rc_decoded = sl.decode_seq(rc_encoded)

print(f"RC from encoded:   {rc_decoded}")
print(f"RC from string:    {rc_sequence}")
print(f"Match: {rc_decoded == rc_sequence}")

Original:          ATCGATCG
Reverse complement: CGATCGAT

Decoded matches original: True
RC from encoded:   CGATCGAT
RC from string:    CGATCGAT
Match: True

Working with Encoded Sequences

Let’s see how to work with encoded sequences for downstream analysis:

# Generate encoded windows
results = list(
    sl.get_alt_ref_sequences(
        reference_fn=reference,
        variants_fn=snp_variants,
        seq_len=40,
        encode=True,  # Get encoded arrays/tensors
    )
)

# Unpack from the first chunk
alt_seqs_enc, ref_seqs_enc, metadata = results[0]

print(f"Reference sequences shape: {ref_seqs_enc.shape}")
print(f"Alternate sequences shape: {alt_seqs_enc.shape}")
print(f"Data type: {type(ref_seqs_enc)}")

# Visualize the first sequence pair
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 6))

# Convert to numpy if needed for visualization
ref_viz = (
    ref_seqs_enc[2].numpy() if hasattr(ref_seqs_enc[2], "numpy") else ref_seqs_enc[2]
)
alt_viz = (
    alt_seqs_enc[2].numpy() if hasattr(alt_seqs_enc[2], "numpy") else alt_seqs_enc[2]
)

# Reference sequence - data is already in (4, L) format
im1 = ax1.imshow(ref_viz, cmap="Blues", aspect="auto")
ax1.set_yticks([0, 1, 2, 3])
ax1.set_yticklabels(["A", "C", "G", "T"])
ax1.set_title("Reference Sequence (Encoded) - Shape: (4, 40)")
ax1.set_ylabel("Nucleotide Channel")
ax1.axvline(x=20, color="red", linestyle="--", alpha=0.5, label="Variant position")
ax1.legend()

# Alternate sequence - data is already in (4, L) format
im2 = ax2.imshow(alt_viz, cmap="Oranges", aspect="auto")
ax2.set_yticks([0, 1, 2, 3])
ax2.set_yticklabels(["A", "C", "G", "T"])
ax2.set_title("Alternate Sequence (Encoded) - Shape: (4, 40)")
ax2.set_xlabel("Position in window")
ax2.set_ylabel("Nucleotide Channel")
ax2.axvline(x=20, color="red", linestyle="--", alpha=0.5, label="Variant position")
ax2.legend()

plt.tight_layout()
plt.show()

print("\nThe variant at position 20 shows different encoding between ref and alt")
print(
    "Note: Sequences are encoded in channel-first format (4, L) for compatibility with PyTorch models"
)

Reference sequences shape: torch.Size([4, 4, 40])
Alternate sequences shape: torch.Size([4, 4, 40])
Data type: <class 'torch.Tensor'>

../_images/758110acddbb8d57cef334f7a00a012f34994ec40e54b26f5a163856d578ca90.png

The variant at position 20 shows different encoding between ref and alt
Note: Sequences are encoded in channel-first format (4, L) for compatibility with PyTorch models

Next Steps

Continue with the other notebooks:

02_personalized_genomes.ipynb - Genome personalization workflows
03_prediction_alignment.ipynb - Model predictions and alignment