Personalized Genome Generation

Comprehensive guide to creating personalized genomes using supremo_lite.

Learning Objectives

By the end of this notebook, you will understand:

  • How different variant types (SNV, INS, DEL, MNV) are applied

  • Chromosome order preservation in outputs

  • Verbose mode for debugging variant application

  • Skipped variant reporting

  • Chunked processing for large VCF files

  • Best practices for genome personalization

Setup

import supremo_lite as sl
import numpy as np
import matplotlib.pyplot as plt
from pyfaidx import Fasta
import pandas as pd
import os

# Set up paths to test data
test_data_dir = "../../tests/data"
reference_path = os.path.join(test_data_dir, "test_genome.fa")
multi_vcf_path = os.path.join(test_data_dir, "multi", "multi.vcf")

# Load reference genome
reference = Fasta(reference_path)

print(f"supremo_lite version: {sl.__version__}")
print(f"Reference genome loaded: {list(reference.keys())}")
supremo_lite version: 1.0.0
Reference genome loaded: ['chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6']

Understanding Variant Types

supremo_lite automatically classifies variants, for more information on how this classification is done and which VCF representations are supported, see the variant classification flowchart.

# Load variants
variants = sl.read_vcf(multi_vcf_path, classify_variants=True)

print("Variants in VCF:")
print(variants)
Variants in VCF:
  chrom  pos1 id            ref       alt    info  vcf_line variant_type
0  chr1    10  .              A         T  DP=100         9          SNV
1  chr1    11  .              T   TATATAG  DP=100        10          INS
2  chr1    11  .  TATTTTCGAGAAT         T  DP=100        11          DEL
3  chr1    11  .              T    TGGGGG  DP=100        12          INS
4  chr1    25  .              A  AATATAGG  DP=100        13          INS
5  chr1    50  .              G    GGGGGG  DP=100        14          INS
6  chr1    52  .         GTTTTA         G  DP=100        15          DEL
7  chr1    76  .          ATATT     AGATT  DP=100        16          MNV

Basic Genome Personalization

Let’s create a personalized genome and examine the results:

# Create personalized genome with verbose output
personal_genome = sl.get_personal_genome(
    reference_fn=reference,
    variants_fn=variants,
    encode=False,  # Get string sequences for easy comparison
)

print("\nGenerated personalized genome:")
for chrom in personal_genome.keys():
    print(f"{chrom}: {len(personal_genome[chrom])} bp")
Generated personalized genome:
chr1: 93 bp
chr2: 80 bp
chr3: 80 bp
chr4: 80 bp
chr5: 80 bp
chr6: 80 bp

Examining Variant Application

Let’s look at specific regions to see how each variant type was applied:

# Function to show variant context
def show_variant_context(chrom, pos1, ref, alt, reference, personal_genome, window=20):
    """Display reference and personalized sequence around a variant."""
    # VCF uses 1-based, Python uses 0-based
    pos0 = pos1 - 1

    # Get reference context
    ref_start = max(0, pos0 - window)
    ref_end = min(len(reference[chrom]), pos0 + len(ref) + window)
    ref_context = reference[chrom][ref_start:ref_end].seq

    # Calculate where variant appears in context
    var_offset = pos0 - ref_start

    # Get personalized context (may have different length due to indel)
    # For personalized, we need to account for the length change
    len_diff = len(alt) - len(ref)
    pers_end = ref_end + len_diff
    pers_context = personal_genome[chrom][ref_start:pers_end]

    print(f"{chrom}:{pos1} {ref}{alt}")
    print(f"\nReference:    {ref_context}")
    print(f"              {' ' * var_offset}{'^' * len(ref)}")
    print(f"Personalized: {pers_context}")
    print(f"              {' ' * var_offset}{'^' * len(alt)}")
    print()


# Show examples of each variant type
print("=" * 60)
print("VARIANT APPLICATION EXAMPLES")
print("=" * 60)

for _, row in variants.head(4).iterrows():  # Show first 4 variants
    print(f"\n{row['variant_type']} variant:")
    print("-" * 60)
    show_variant_context(
        row["chrom"], row["pos1"], row["ref"], row["alt"], reference, personal_genome
    )
============================================================
VARIANT APPLICATION EXAMPLES
============================================================

SNV variant:
------------------------------------------------------------
chr1:10 A → T

Reference:    ATGAATATAATATTTTCGAGAATTACTCCT
                       ^
Personalized: ATGAATATATTATATAGATTTTCGAGAATT
                       ^


INS variant:
------------------------------------------------------------
chr1:11 T → TATATAG

Reference:    ATGAATATAATATTTTCGAGAATTACTCCTT
                        ^
Personalized: ATGAATATATTATATAGATTTTCGAGAATTAATATAG
                        ^^^^^^^


DEL variant:
------------------------------------------------------------
chr1:11 TATTTTCGAGAAT → T

Reference:    ATGAATATAATATTTTCGAGAATTACTCCTTTTGGAAATGGAA
                        ^^^^^^^^^^^^^
Personalized: ATGAATATATTATATAGATTTTCGAGAATTA
                        ^


INS variant:
------------------------------------------------------------
chr1:11 T → TGGGGG

Reference:    ATGAATATAATATTTTCGAGAATTACTCCTT
                        ^
Personalized: ATGAATATATTATATAGATTTTCGAGAATTAATATA
                        ^^^^^^

Skipped Variants and Overlap Detection

When variants overlap or can’t be applied, supremo_lite tracks and reports them in verbose mode. Let’s see this in action:

# The multi.vcf file contains overlapping variants

# Create personalized genome with verbose output
personal_genome = sl.get_personal_genome(
    reference_fn=reference,
    variants_fn=variants,
    encode=False,  # Get string sequences for easy comparison
    verbose=True,  # Show detailed processing information
)
🧬 Processing 8 variants across 1 chromosomes
   Phase 1: 8 standard variants (SNV, MNV, INS, DEL, SV_DUP, SV_INV)
   Phase 2: 0 BND variants for semantic classification
🔄 Processing chromosome chr1: 8 variants (4 INS, 2 DEL, 1 SNV, 1 MNV)
  ⚠️  Skipped 2 variant(s):
     • overlap with previous variant: VCF line(s) 11, 12 at chr1:11
  ✅ Applied 6/8 variants (2 skipped)
🧬 Completed: 8/8 variants processed → 6 sequences

Sequence Length Changes from Indels

Insertions and deletions change the length of chromosomes:

# Calculate expected length changes
print("Chromosome length analysis:")
print("=" * 60)

for chrom in reference.keys():
    ref_len = len(reference[chrom])
    pers_len = len(personal_genome[chrom])

    # Calculate total length change from variants
    chrom_variants = variants[variants["chrom"] == chrom]
    total_change = sum(
        len(row["alt"]) - len(row["ref"]) for _, row in chrom_variants.iterrows()
    )

    # Note: total_change may differ from actual due to overlapping variants
    actual_change = pers_len - ref_len

    print(f"\n{chrom}:")
    print(f"  Reference length:    {ref_len:,} bp")
    print(f"  Personalized length: {pers_len:,} bp")
    print(f"  Actual change:       {actual_change:+,} bp")
    print(f"  Expected change*:    {total_change:+,} bp")
    print(f"  Variants on chrom:   {len(chrom_variants)}")

    if total_change != actual_change:
        print(f"  ⚠️  Difference due to overlapping/skipped variants")
Chromosome length analysis:
============================================================

chr1:
  Reference length:    80 bp
  Personalized length: 93 bp
  Actual change:       +13 bp
  Expected change*:    +6 bp
  Variants on chrom:   8
  ⚠️  Difference due to overlapping/skipped variants

chr2:
  Reference length:    80 bp
  Personalized length: 80 bp
  Actual change:       +0 bp
  Expected change*:    +0 bp
  Variants on chrom:   0

chr3:
  Reference length:    80 bp
  Personalized length: 80 bp
  Actual change:       +0 bp
  Expected change*:    +0 bp
  Variants on chrom:   0

chr4:
  Reference length:    80 bp
  Personalized length: 80 bp
  Actual change:       +0 bp
  Expected change*:    +0 bp
  Variants on chrom:   0

chr5:
  Reference length:    80 bp
  Personalized length: 80 bp
  Actual change:       +0 bp
  Expected change*:    +0 bp
  Variants on chrom:   0

chr6:
  Reference length:    80 bp
  Personalized length: 80 bp
  Actual change:       +0 bp
  Expected change*:    +0 bp
  Variants on chrom:   0

Best Practices

1. Always use verbose mode during development

personal_genome = sl.get_personal_genome(
    reference_fn=reference,
    variants_fn=variants,
    verbose=True  # See what's happening!
)

2. Use chunked processing for large VCF files

personal_genome = sl.get_personal_genome(
    reference_fn=reference,
    variants_fn='large_variants.vcf',
    chunk_size=10000  # Process vcf in 10k chunks (default is 1)
)

Next Steps