Personalized Genome Generation

Comprehensive guide to creating personalized genomes using supremo_lite.

Learning Objectives

By the end of this notebook, you will understand:

How different variant types (SNV, INS, DEL, MNV) are applied
Chromosome order preservation in outputs
Verbose mode for debugging variant application
Skipped variant reporting
Chunked processing for large VCF files
Best practices for genome personalization

Setup

import supremo_lite as sl
import numpy as np
import matplotlib.pyplot as plt
from pyfaidx import Fasta
import pandas as pd
import os

# Set up paths to test data
test_data_dir = "../../tests/data"
reference_path = os.path.join(test_data_dir, "test_genome.fa")
multi_vcf_path = os.path.join(test_data_dir, "multi", "multi.vcf")

# Load reference genome
reference = Fasta(reference_path)

print(f"supremo_lite version: {sl.__version__}")
print(f"Reference genome loaded: {list(reference.keys())}")

supremo_lite version: 1.0.0
Reference genome loaded: ['chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6']

Understanding Variant Types

supremo_lite automatically classifies variants, for more information on how this classification is done and which VCF representations are supported, see the variant classification flowchart.

# Load variants
variants = sl.read_vcf(multi_vcf_path, classify_variants=True)

print("Variants in VCF:")
print(variants)

Variants in VCF:
  chrom  pos1 id            ref       alt    info  vcf_line variant_type
chr1    10  .              A         T  DP=100         9          SNV
chr1    11  .              T   TATATAG  DP=100        10          INS
chr1    11  .  TATTTTCGAGAAT         T  DP=100        11          DEL
chr1    11  .              T    TGGGGG  DP=100        12          INS
chr1    25  .              A  AATATAGG  DP=100        13          INS
chr1    50  .              G    GGGGGG  DP=100        14          INS
chr1    52  .         GTTTTA         G  DP=100        15          DEL
chr1    76  .          ATATT     AGATT  DP=100        16          MNV

Basic Genome Personalization

Let’s create a personalized genome and examine the results:

# Create personalized genome with verbose output
personal_genome = sl.get_personal_genome(
    reference_fn=reference,
    variants_fn=variants,
    encode=False,  # Get string sequences for easy comparison
)

print("\nGenerated personalized genome:")
for chrom in personal_genome.keys():
    print(f"{chrom}: {len(personal_genome[chrom])} bp")

Generated personalized genome:
chr1: 93 bp
chr2: 80 bp
chr3: 80 bp
chr4: 80 bp
chr5: 80 bp
chr6: 80 bp

Examining Variant Application

Let’s look at specific regions to see how each variant type was applied:

# Function to show variant context
def show_variant_context(chrom, pos1, ref, alt, reference, personal_genome, window=20):
    """Display reference and personalized sequence around a variant."""
    # VCF uses 1-based, Python uses 0-based
    pos0 = pos1 - 1

    # Get reference context
    ref_start = max(0, pos0 - window)
    ref_end = min(len(reference[chrom]), pos0 + len(ref) + window)
    ref_context = reference[chrom][ref_start:ref_end].seq

    # Calculate where variant appears in context
    var_offset = pos0 - ref_start

    # Get personalized context (may have different length due to indel)
    # For personalized, we need to account for the length change
    len_diff = len(alt) - len(ref)
    pers_end = ref_end + len_diff
    pers_context = personal_genome[chrom][ref_start:pers_end]

    print(f"{chrom}:{pos1} {ref} → {alt}")
    print(f"\nReference:    {ref_context}")
    print(f"              {' ' * var_offset}{'^' * len(ref)}")
    print(f"Personalized: {pers_context}")
    print(f"              {' ' * var_offset}{'^' * len(alt)}")
    print()


# Show examples of each variant type
print("=" * 60)
print("VARIANT APPLICATION EXAMPLES")
print("=" * 60)

for _, row in variants.head(4).iterrows():  # Show first 4 variants
    print(f"\n{row['variant_type']} variant:")
    print("-" * 60)
    show_variant_context(
        row["chrom"], row["pos1"], row["ref"], row["alt"], reference, personal_genome
    )

============================================================
VARIANT APPLICATION EXAMPLES
============================================================

SNV variant:
------------------------------------------------------------
chr1:10 A → T

Reference:    ATGAATATAATATTTTCGAGAATTACTCCT
                       ^
Personalized: ATGAATATATTATATAGATTTTCGAGAATT
                       ^


INS variant:
------------------------------------------------------------
chr1:11 T → TATATAG

Reference:    ATGAATATAATATTTTCGAGAATTACTCCTT
                        ^
Personalized: ATGAATATATTATATAGATTTTCGAGAATTAATATAG
                        ^^^^^^^


DEL variant:
------------------------------------------------------------
chr1:11 TATTTTCGAGAAT → T

Reference:    ATGAATATAATATTTTCGAGAATTACTCCTTTTGGAAATGGAA
                        ^^^^^^^^^^^^^
Personalized: ATGAATATATTATATAGATTTTCGAGAATTA
                        ^


INS variant:
------------------------------------------------------------
chr1:11 T → TGGGGG

Reference:    ATGAATATAATATTTTCGAGAATTACTCCTT
                        ^
Personalized: ATGAATATATTATATAGATTTTCGAGAATTAATATA
                        ^^^^^^

Skipped Variants and Overlap Detection

When variants overlap or can’t be applied, supremo_lite tracks and reports them in verbose mode. Let’s see this in action:

# The multi.vcf file contains overlapping variants

# Create personalized genome with verbose output
personal_genome = sl.get_personal_genome(
    reference_fn=reference,
    variants_fn=variants,
    encode=False,  # Get string sequences for easy comparison
    verbose=True,  # Show detailed processing information
)

🧬 Processing 8 variants across 1 chromosomes
   Phase 1: 8 standard variants (SNV, MNV, INS, DEL, SV_DUP, SV_INV)
   Phase 2: 0 BND variants for semantic classification
🔄 Processing chromosome chr1: 8 variants (4 INS, 2 DEL, 1 SNV, 1 MNV)
  ⚠️  Skipped 2 variant(s):
     • overlap with previous variant: VCF line(s) 11, 12 at chr1:11
  ✅ Applied 6/8 variants (2 skipped)
🧬 Completed: 8/8 variants processed → 6 sequences

Sequence Length Changes from Indels

Insertions and deletions change the length of chromosomes:

# Calculate expected length changes
print("Chromosome length analysis:")
print("=" * 60)

for chrom in reference.keys():
    ref_len = len(reference[chrom])
    pers_len = len(personal_genome[chrom])

    # Calculate total length change from variants
    chrom_variants = variants[variants["chrom"] == chrom]
    total_change = sum(
        len(row["alt"]) - len(row["ref"]) for _, row in chrom_variants.iterrows()
    )

    # Note: total_change may differ from actual due to overlapping variants
    actual_change = pers_len - ref_len

    print(f"\n{chrom}:")
    print(f"  Reference length:    {ref_len:,} bp")
    print(f"  Personalized length: {pers_len:,} bp")
    print(f"  Actual change:       {actual_change:+,} bp")
    print(f"  Expected change*:    {total_change:+,} bp")
    print(f"  Variants on chrom:   {len(chrom_variants)}")

    if total_change != actual_change:
        print(f"  ⚠️  Difference due to overlapping/skipped variants")

Chromosome length analysis:
============================================================

chr1:
  Reference length:    80 bp
  Personalized length: 93 bp
  Actual change:       +13 bp
  Expected change*:    +6 bp
  Variants on chrom:   8
  ⚠️  Difference due to overlapping/skipped variants

chr2:
  Reference length:    80 bp
  Personalized length: 80 bp
  Actual change:       +0 bp
  Expected change*:    +0 bp
  Variants on chrom:   0

chr3:
  Reference length:    80 bp
  Personalized length: 80 bp
  Actual change:       +0 bp
  Expected change*:    +0 bp
  Variants on chrom:   0

chr4:
  Reference length:    80 bp
  Personalized length: 80 bp
  Actual change:       +0 bp
  Expected change*:    +0 bp
  Variants on chrom:   0

chr5:
  Reference length:    80 bp
  Personalized length: 80 bp
  Actual change:       +0 bp
  Expected change*:    +0 bp
  Variants on chrom:   0

chr6:
  Reference length:    80 bp
  Personalized length: 80 bp
  Actual change:       +0 bp
  Expected change*:    +0 bp
  Variants on chrom:   0

Best Practices

1. Always use verbose mode during development

personal_genome = sl.get_personal_genome(
    reference_fn=reference,
    variants_fn=variants,
    verbose=True  # See what's happening!
)

2. Use chunked processing for large VCF files

personal_genome = sl.get_personal_genome(
    reference_fn=reference,
    variants_fn='large_variants.vcf',
    chunk_size=10000  # Process vcf in 10k chunks (default is 1)
)

Next Steps

03_prediction_alignment.ipynb - Generate variant centered window sequences and align model predictions