Personalized Genome Generation
Comprehensive guide to creating personalized genomes using supremo_lite.
Learning Objectives
By the end of this notebook, you will understand:
How different variant types (SNV, INS, DEL, MNV) are applied
Chromosome order preservation in outputs
Verbose mode for debugging variant application
Skipped variant reporting
Chunked processing for large VCF files
Best practices for genome personalization
Setup
import supremo_lite as sl
import numpy as np
import matplotlib.pyplot as plt
from pyfaidx import Fasta
import pandas as pd
import os
# Set up paths to test data
test_data_dir = "../../tests/data"
reference_path = os.path.join(test_data_dir, "test_genome.fa")
multi_vcf_path = os.path.join(test_data_dir, "multi", "multi.vcf")
# Load reference genome
reference = Fasta(reference_path)
print(f"supremo_lite version: {sl.__version__}")
print(f"Reference genome loaded: {list(reference.keys())}")
supremo_lite version: 1.0.0
Reference genome loaded: ['chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6']
Understanding Variant Types
supremo_lite automatically classifies variants, for more information on how this classification is done and which VCF representations are supported, see the variant classification flowchart.
# Load variants
variants = sl.read_vcf(multi_vcf_path, classify_variants=True)
print("Variants in VCF:")
print(variants)
Variants in VCF:
chrom pos1 id ref alt info vcf_line variant_type
0 chr1 10 . A T DP=100 9 SNV
1 chr1 11 . T TATATAG DP=100 10 INS
2 chr1 11 . TATTTTCGAGAAT T DP=100 11 DEL
3 chr1 11 . T TGGGGG DP=100 12 INS
4 chr1 25 . A AATATAGG DP=100 13 INS
5 chr1 50 . G GGGGGG DP=100 14 INS
6 chr1 52 . GTTTTA G DP=100 15 DEL
7 chr1 76 . ATATT AGATT DP=100 16 MNV
Basic Genome Personalization
Let’s create a personalized genome and examine the results:
# Create personalized genome with verbose output
personal_genome = sl.get_personal_genome(
reference_fn=reference,
variants_fn=variants,
encode=False, # Get string sequences for easy comparison
)
print("\nGenerated personalized genome:")
for chrom in personal_genome.keys():
print(f"{chrom}: {len(personal_genome[chrom])} bp")
Generated personalized genome:
chr1: 93 bp
chr2: 80 bp
chr3: 80 bp
chr4: 80 bp
chr5: 80 bp
chr6: 80 bp
Examining Variant Application
Let’s look at specific regions to see how each variant type was applied:
# Function to show variant context
def show_variant_context(chrom, pos1, ref, alt, reference, personal_genome, window=20):
"""Display reference and personalized sequence around a variant."""
# VCF uses 1-based, Python uses 0-based
pos0 = pos1 - 1
# Get reference context
ref_start = max(0, pos0 - window)
ref_end = min(len(reference[chrom]), pos0 + len(ref) + window)
ref_context = reference[chrom][ref_start:ref_end].seq
# Calculate where variant appears in context
var_offset = pos0 - ref_start
# Get personalized context (may have different length due to indel)
# For personalized, we need to account for the length change
len_diff = len(alt) - len(ref)
pers_end = ref_end + len_diff
pers_context = personal_genome[chrom][ref_start:pers_end]
print(f"{chrom}:{pos1} {ref} → {alt}")
print(f"\nReference: {ref_context}")
print(f" {' ' * var_offset}{'^' * len(ref)}")
print(f"Personalized: {pers_context}")
print(f" {' ' * var_offset}{'^' * len(alt)}")
print()
# Show examples of each variant type
print("=" * 60)
print("VARIANT APPLICATION EXAMPLES")
print("=" * 60)
for _, row in variants.head(4).iterrows(): # Show first 4 variants
print(f"\n{row['variant_type']} variant:")
print("-" * 60)
show_variant_context(
row["chrom"], row["pos1"], row["ref"], row["alt"], reference, personal_genome
)
============================================================
VARIANT APPLICATION EXAMPLES
============================================================
SNV variant:
------------------------------------------------------------
chr1:10 A → T
Reference: ATGAATATAATATTTTCGAGAATTACTCCT
^
Personalized: ATGAATATATTATATAGATTTTCGAGAATT
^
INS variant:
------------------------------------------------------------
chr1:11 T → TATATAG
Reference: ATGAATATAATATTTTCGAGAATTACTCCTT
^
Personalized: ATGAATATATTATATAGATTTTCGAGAATTAATATAG
^^^^^^^
DEL variant:
------------------------------------------------------------
chr1:11 TATTTTCGAGAAT → T
Reference: ATGAATATAATATTTTCGAGAATTACTCCTTTTGGAAATGGAA
^^^^^^^^^^^^^
Personalized: ATGAATATATTATATAGATTTTCGAGAATTA
^
INS variant:
------------------------------------------------------------
chr1:11 T → TGGGGG
Reference: ATGAATATAATATTTTCGAGAATTACTCCTT
^
Personalized: ATGAATATATTATATAGATTTTCGAGAATTAATATA
^^^^^^
Skipped Variants and Overlap Detection
When variants overlap or can’t be applied, supremo_lite tracks and reports them in verbose mode. Let’s see this in action:
# The multi.vcf file contains overlapping variants
# Create personalized genome with verbose output
personal_genome = sl.get_personal_genome(
reference_fn=reference,
variants_fn=variants,
encode=False, # Get string sequences for easy comparison
verbose=True, # Show detailed processing information
)
🧬 Processing 8 variants across 1 chromosomes
Phase 1: 8 standard variants (SNV, MNV, INS, DEL, SV_DUP, SV_INV)
Phase 2: 0 BND variants for semantic classification
🔄 Processing chromosome chr1: 8 variants (4 INS, 2 DEL, 1 SNV, 1 MNV)
⚠️ Skipped 2 variant(s):
• overlap with previous variant: VCF line(s) 11, 12 at chr1:11
✅ Applied 6/8 variants (2 skipped)
🧬 Completed: 8/8 variants processed → 6 sequences
Sequence Length Changes from Indels
Insertions and deletions change the length of chromosomes:
# Calculate expected length changes
print("Chromosome length analysis:")
print("=" * 60)
for chrom in reference.keys():
ref_len = len(reference[chrom])
pers_len = len(personal_genome[chrom])
# Calculate total length change from variants
chrom_variants = variants[variants["chrom"] == chrom]
total_change = sum(
len(row["alt"]) - len(row["ref"]) for _, row in chrom_variants.iterrows()
)
# Note: total_change may differ from actual due to overlapping variants
actual_change = pers_len - ref_len
print(f"\n{chrom}:")
print(f" Reference length: {ref_len:,} bp")
print(f" Personalized length: {pers_len:,} bp")
print(f" Actual change: {actual_change:+,} bp")
print(f" Expected change*: {total_change:+,} bp")
print(f" Variants on chrom: {len(chrom_variants)}")
if total_change != actual_change:
print(f" ⚠️ Difference due to overlapping/skipped variants")
Chromosome length analysis:
============================================================
chr1:
Reference length: 80 bp
Personalized length: 93 bp
Actual change: +13 bp
Expected change*: +6 bp
Variants on chrom: 8
⚠️ Difference due to overlapping/skipped variants
chr2:
Reference length: 80 bp
Personalized length: 80 bp
Actual change: +0 bp
Expected change*: +0 bp
Variants on chrom: 0
chr3:
Reference length: 80 bp
Personalized length: 80 bp
Actual change: +0 bp
Expected change*: +0 bp
Variants on chrom: 0
chr4:
Reference length: 80 bp
Personalized length: 80 bp
Actual change: +0 bp
Expected change*: +0 bp
Variants on chrom: 0
chr5:
Reference length: 80 bp
Personalized length: 80 bp
Actual change: +0 bp
Expected change*: +0 bp
Variants on chrom: 0
chr6:
Reference length: 80 bp
Personalized length: 80 bp
Actual change: +0 bp
Expected change*: +0 bp
Variants on chrom: 0
Best Practices
1. Always use verbose mode during development
personal_genome = sl.get_personal_genome(
reference_fn=reference,
variants_fn=variants,
verbose=True # See what's happening!
)
2. Use chunked processing for large VCF files
personal_genome = sl.get_personal_genome(
reference_fn=reference,
variants_fn='large_variants.vcf',
chunk_size=10000 # Process vcf in 10k chunks (default is 1)
)
Next Steps
03_prediction_alignment.ipynb - Generate variant centered window sequences and align model predictions