PAM Disruption Analysis: Identifying CRISPR-Resistant Variants
Learn how to use supremo_lite to identify genetic variants that disrupt Protospacer Adjacent Motif (PAM) sites, making genomic loci resistant to repeated CRISPR editing.
Example PAM sequences:
SpCas9: NGG (e.g., AGG, TGG, CGG, GGG)
SaCas9: NNGRRT (more complex pattern)
Setup
import supremo_lite as sl
import numpy as np
from pyfaidx import Fasta
import pandas as pd
import os
import re
print(f"supremo_lite version: {sl.__version__}")
# Load test data
test_data_dir = "../../tests/data"
reference = Fasta(os.path.join(test_data_dir, "test_genome.fa"))
print(f"\nReference genome loaded: {list(reference.keys())}")
# Show chr6 sequence
chr6_seq = reference["chr6"][:80].seq
print(f"\nchr6 sequence:")
print(chr6_seq)
print(" ^^^")
print("PAM sites for SpCas9 (NGG)")
supremo_lite version: 1.0.0
Reference genome loaded: ['chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6']
chr6 sequence:
AAAGCAAATTCAAATCATCCAAGAATGCCACTTGGAATTTGCGATATTTTTGTTTTTTTTTTTTTAATATTTTACAAAAT
^^^
PAM sites for SpCas9 (NGG)
Basic PAM Disruption: SNV Example
# Create a variant that disrupts the NGG PAM
variant_snv = pd.DataFrame(
[{"chrom": "chr6", "pos1": 34, "id": ".", "ref": "G", "alt": "T"}]
)
Running PAM Disruption Analysis
Now let’s use get_pam_disrupting_alt_sequences() to identify this PAM-disrupting variant:
# Run PAM disruption analysis
gen_snv = sl.get_pam_disrupting_alt_sequences(
reference_fn=reference,
variants_fn=variant_snv,
seq_len=20, # 20bp window
max_pam_distance=10, # Search within 10bp of variant
pam_sequence="NGG", # SpCas9 PAM
encode=False, # Get raw strings for visualization
n_chunks=1,
)
# Unpack the generator to get results
alt_seqs, ref_seqs, metadata = next(gen_snv)
print("PAM Disruption Analysis Results:")
print("=" * 50)
print(f"Number of PAM-disrupting variants: {len(metadata)}")
print(f"\nMetadata (first few columns):")
print(
metadata[
[
"chrom",
"variant_pos1",
"ref",
"alt",
"pam_ref_sequence",
"pam_alt_sequence",
"pam_distance",
]
].to_string()
)
print(f"\n\nReference sequence (no variant):")
for i, (chrom, start, end, seq) in enumerate(ref_seqs):
print(f" {i}: {chrom}:{start}-{end}")
print(f" {seq}")
print(f"\nAlternate sequence (variant applied):")
for i, (chrom, start, end, seq) in enumerate(alt_seqs):
print(f" {i}: {chrom}:{start}-{end}")
print(f" {seq}")
print(f"\n✓ The TGG PAM in reference becomes TTG in alternate (no longer matches NGG)")
PAM Disruption Analysis Results:
==================================================
Number of PAM-disrupting variants: 1
Metadata (first few columns):
chrom variant_pos1 ref alt pam_ref_sequence pam_alt_sequence pam_distance
0 chr6 34 G T TGG TTG 1
Reference sequence (no variant):
0: chr6:23-43
AATGCCACTTGGAATTTGCG
Alternate sequence (variant applied):
0: chr6:23-43
AATGCCACTTTGAATTTGCG
✓ The TGG PAM in reference becomes TTG in alternate (no longer matches NGG)
Understanding the Output Structure
The function returns a generator that yields tuples for memory-efficient processing:
for alt_seqs, ref_seqs, metadata in get_pam_disrupting_alt_sequences(...):
# Process each chunk
pass
# Or get all results at once with n_chunks=1
alt_seqs, ref_seqs, metadata = next(get_pam_disrupting_alt_sequences(..., n_chunks=1))
Return Values
Each yield produces a 3-tuple:
1. alt_seqs: Sequences with variant applied
Format when
encode=False: List of(chrom, start, end, sequence)tuplesFormat when
encode=True: Stacked array/tensor of shape(n_variants, 4, seq_len)
2. ref_seqs: Reference sequences (no variant)
Same format as
alt_seqsUse these as baseline for comparison with variant sequences
3. metadata: DataFrame with comprehensive variant and PAM information
Metadata Columns
Standard variant information:
chrom: Chromosome namewindow_start,window_end: Window boundaries (0-based)variant_pos0,variant_pos1: Variant position (0-based and 1-based)ref,alt: Reference and alternate allelesvariant_type: Variant classification (SNV, INS, DEL, etc.)
PAM-specific information:
pam_site_pos: Position of PAM site within the window (0-based)pam_ref_sequence: PAM sequence in reference (e.g., “TGG”)pam_alt_sequence: PAM sequence after variant (e.g., “TTG”)pam_distance: Distance from variant to PAM start
Next Steps
01_getting_started.ipynb - Basic supremo_lite functionality
02_personalized_genomes.ipynb - Genome personalization workflows
03_prediction_alignment.ipynb - Align model predictions across variants
PAM Disruption Guide - Detailed documentation and API reference