Skip to content

Clustering

Ash O'Farrell edited this page Dec 3, 2025 · 4 revisions

Definition of a cluster

This is not the same as our definition of a cluster with regard to covid, nor does it match the definition of a cluster that some epis use.

A cluster is a set of samples where all samples within that cluster are within X genetic distance ("nominal maximum genetic distance") from another sample in that cluster, where X is derived from parsing branch length on samples based on placement on a phylogenetic tree by UShER. For tuberculosis, epidemiologists tend to use a value of X=12 SNPs or X=10 SNPs. Strictly speaking, branch length is not necessarily the exact same value as the number of SNPs between two samples, but at this scale these values are very close, so I will usually refer to clusters as "20 SNP clusters" etc when I actually mean "clusters for which all samples are within a genetic distance of 20 to at least one other sample within that cluster."

Geographic location, date of sample isolation, or shared household status are epidemiologically important details, but we currently do not have those details. As such, they have no influence on our definition of a cluster. However, if that metadata is added in the future, it could be displayed in Microreact.

Chaining

Consider samples A, B, and C. A:B = 9 SNPs apart, B:C = 4 SNPs apart, A:C = 12 SNPs apart. If we were only considering A and C, they would not be within a 10 SNP cluster. However, since we're considering B, all three of them are considered part of a 10 SNP cluster. B is "chaining" A and C together.

Applying this definition to CalTBNet

For CalTBNet, we have decided to use X=20 as our nominal maximum genetic distance. Within these 20 SNP clusters, we additionally search for 10 SNP subclusters, and within those, 5 SNP subsubclusters. Currently, CDPH is mostly interested in 10 SNP clusters, as the current science indicates that samples that do not cluster within 10 SNPs are less likely to be indicative of direct transmission.

Every cluster (be they 20 SNP, 10 SNP, or 5 SNP) is given a unique persistent ID. To properly keep track of clusters across runs, we maintain a list of persistent IDs for each cluster and the sample IDs of samples known to be in that cluster. In this sense, our persistent clusters are defined by the sample IDs of samples within them. For this reason, we must assume the genetic information of what a sample ID meaningfully represents is immutable (in other words: we should never overwrite nor reuse a sample ID).

However, UShER is non-deterministic. The addition of new samples, or even just random chance, can influence placement of old samples on the tree. Ironically, this means the genetic distance between two immutable samples is mutable. For this reason, samples sometimes move in and out of clusters.

Consequences/Implementation details

  • All 5 SNP clusters have a 10 SNP cluster parent
  • All 10 SNP clusters have a 20 SNP cluster parent
  • A parent cluster can have multiple children, but a child cluster can only have one (direct) parent
  • It is possible for a given sample ID to be in 0, 1, 2, or 3 clusters
    • 0: The sample doesn't cluster with anything
    • 1: The sample is in a 20 cluster
    • 2: The sample is in a 20 cluster, and a 10 cluster within that 20 cluster
    • 3: The sample is in a 20 cluster, and a 10 cluster within that 20 cluster, and a 5 cluster within that 10 cluster
    • It is not possible for a sample to be part of two different clusters of the same genetic distance - ie, a sample cannot be in two different 20 SNP clusters at the same time, however, it could theoretically move between two different clusters of the same distance over time
  • The actual genetic distance between a 20/10/5 cluster may not actually be 20, 10, or 5 due to the effects of chaining
    • In some circumstances, chaining leads to rather large 5 SNP clusters which are almost identical to their 10 SNP parent. This isn't really an issue in and of itself, but these clusters are more prone to splitting.
  • A cluster's UUID can be any string, but currently, new cluster IDs are automatically assigned as a zfilled six digit number
    • A 20 SNP cluster's numeric ID will usually be a smaller number than that of its subclusters (if it has any), but this is not a strict rule and should not be relied upon

Tracking changes to clusters

This part is similar in concept to how it works for covid, although the wrapper script is a little different due to tuberculosis having a concept of subclusters.

  1. [usher] We place all CalTBNet samples on a phylogenetic tree
  2. [find_clusters.py] Using the phylogenetic tree from THIS run, we identify clusters and assign them a temporary non-persistent cluster ID (workdir_cluster_id within my scripts). This is done blindly, with no knowledge of persistent cluster IDs from previous runs. This generates a TSV of sample IDs and their associated temporary cluster IDs from THIS run.
  3. [process_clusters.py] A python wrapper calls a perl script which compares the TSV of sample IDS and their temporary cluster IDs from THIS run, with a TSV of sample IDs and their associated persistent cluster ID from the PREVIOUS run. The perl script reassigns the temporary cluster IDs from THIS run with the correct persistent cluster IDs.
  • The perl script does its best to handle samples dropping in and out of clusters
  • The perl script also attempts to handle cases where clusters merge, split, or split-and-merge (the split case is rare in tuberculosis)
  • Brand new clusters are given brand new UUIDs by the python wrapper
  • A cluster can lose all of its samples in some circumstances, which may have implications for tracking clusters over time
  1. [process_clusters.py] We now know which samples are part of which clusters, which clusters have been updated, and what those updates are (gaining new samples, losing old samples, etc). See more on Microreact here. This step is unique to tuberculosis, but covid has a rough equivalent where trees are uploaded to taxonium.
  2. [summarize_changes.py] A human-readable summary is generated

Important

For tuberculosis, we do #3 thrice.
a) First of all, we only consider cluster IDs associated with 20 SNP clusters
b) Then we only consider cluster IDs associated with 10 SNP clusters
c) Finally, we only consider cluster IDs associated with 5 SNP clusters
Due to how subclusters work (see above), the actual parent-child relationship of clusters is self-evident by the sample IDs within them. Nevertheless, parent-child relationship is also tracked on a final output JSON file generated by process_clusters.py

Clone this wiki locally