Skip to content

Clustering

Ash O'Farrell edited this page Dec 30, 2025 · 4 revisions

Definition of a cluster

This is not the same as our definition of a cluster with regard to covid, nor does it match the definition of a cluster that some epis use.

A cluster is a set of samples where all samples within that cluster are within X genetic distance ("nominal maximum genetic distance") from another sample in that cluster, where X is derived from parsing branch length on samples based on placement on a phylogenetic tree by UShER. For tuberculosis, epidemiologists tend to use a value of X=12 SNPs or X=10 SNPs.

What is branch length?

For tuberculosis, UShER uses MAPLE-formatted diff files for its samples. These files are like VCFs but much smaller and simpler. Variation is represented as either SNP or mask (for the most part, indels are masked), and anything not mentioned in the file is considered reference.

Strictly speaking, branch length is not necessarily the exact same value as the number of SNPs between two samples. For example, if sample A calls SNP T at some position n, but sample B lacks sufficient coverage at position n, we can't be sure if B "matches" sample A. It could be reference, it could share A's SNP, in rare cases it could even be an entirely different SNP (we've seen it happen)! As such, the SNP count is a little ambiguous, and we instead rely upon branch length determined by UShER.

Nevertheless, sometimes we causally refer to clusters as "20 SNP clusters" etc when what is actually meant is "clusters for which all samples are within a genetic distance of 20 to at least one other sample within that cluster." Going forward I will try to use more precise terminology like "20-distance cluster" or "20-dis cluster" or simply "20 cluster."

Important

It is possible, and valid, for the branch length between near-identical samples to be zero. However, as of Dec 2025, there is a known bug in Microreact preventing it from rendering nwks correctly if the nwk's branch lengths are all zero. The latest version of Tree Nine attempts to detect when this is happening and show a warning to the user.

What is NOT considered in the definition of a cluster?

Geographic location, date of sample isolation, or shared household status are epidemiologically important details, but we currently do not have those details. As such, they have no influence on our definition of a cluster. However, if that metadata is added in the future, it could be displayed in Microreact.

Chaining

Consider samples A, B, and C. A:B = 9 SNPs apart, B:C = 4 SNPs apart, A:C = 12 SNPs apart. If we were only considering A and C, they would not be within a 10 SNP cluster. However, since we're considering B, all three of them are considered part of a 10 SNP cluster. B is "chaining" A and C together.

Applying this definition to CalTBNet

For CalTBNet, we have decided to use X=20 as our nominal maximum genetic distance. Within these 20-dis clusters, we additionally search for 10-dis subclusters, and within those, 5-dis subsubclusters. Currently, CDPH is mostly interested in 10-dis clusters, as the current science indicates that samples that do not cluster within 10 SNPs are less likely to be indicative of direct transmission.

Every cluster (be they 20, 10, or 5) is given a unique persistent ID. To properly keep track of clusters across runs, we maintain a list of persistent IDs for each cluster and the sample IDs of samples known to be in that cluster. In this sense, our persistent clusters are defined by the sample IDs of samples within them. For this reason, we must assume the genetic information of what a sample ID meaningfully represents is immutable (in other words: we should never overwrite nor reuse a sample ID).

However, UShER is non-deterministic. The addition of new samples, or even just random chance, can influence placement of old samples on the tree. Ironically, this means the genetic distance between two immutable samples is mutable. For this reason, samples sometimes move in and out of clusters.

Consequences/Implementation details

  • All 5-dis clusters have a 10-dis cluster parent
  • All 10-dis clusters have a 20-dis cluster parent
  • A parent cluster can have multiple children, but a child cluster can only have one (direct) parent
  • It is possible for a given sample ID to be in 0, 1, 2, or 3 clusters
    • 0: The sample doesn't cluster with anything
    • 1: The sample is in a 20 cluster
    • 2: The sample is in a 20 cluster, and a 10 cluster within that 20 cluster
    • 3: The sample is in a 20 cluster, and a 10 cluster within that 20 cluster, and a 5 cluster within that 10 cluster
    • It is not possible for a sample to be part of two different clusters of the same genetic distance - ie, a sample cannot be in two different 20 SNP clusters at the same time, however, it could theoretically move between two different clusters of the same distance over time
  • It is possible for a child cluster to have the exact same samples as its parent cluster
    • Example: If A and B have a branch length of two between them, but nothing else is anywhere near them, they will form a 20 cluster, a 10 cluster, and a 5 cluster. All three clusters will be given unique IDs even though they are essentially identical, containing the same two samples and by extension the same nwks and distance matrices... because in the future, more samples might get added that shake things up (perhaps new sample C will be within 20 of A and B but not within 10 of either).
  • The actual genetic distance between a 20/10/5 cluster may not actually be 20, 10, or 5 due to the effects of chaining
    • We refer to largest value on a distance matrix in a cluster to be its "matrix max"
      • It is possible for matrix max to be 0
    • In some circumstances, chaining leads to rather large 5 SNP clusters which are almost identical to their 10 SNP parent. This isn't really an issue in and of itself, but these clusters are more prone to splitting.
  • A cluster's UUID can be any string, but currently, new cluster IDs are automatically assigned as a zfilled six digit number
    • A 20 SNP cluster's numeric ID will usually be a smaller number than that of its subclusters (if it has any), but this is not a strict rule and should not be relied upon

Tracking changes to clusters

This part is similar in concept to how it works for covid, although the wrapper script is a little different due to tuberculosis having a concept of subclusters.

  1. [usher] We place all CalTBNet samples on a phylogenetic tree
  2. [find_clusters.py] Using the phylogenetic tree from THIS run, we identify clusters and assign them a temporary non-persistent cluster ID (workdir_cluster_id within my scripts). This is done blindly, with no knowledge of persistent cluster IDs from previous runs. This generates a TSV of sample IDs and their associated temporary cluster IDs from THIS run.
  3. [process_clusters.py] A python wrapper calls a perl script which compares the TSV of sample IDS and their temporary cluster IDs from THIS run, with a TSV of sample IDs and their associated persistent cluster ID from the PREVIOUS run. The perl script reassigns the temporary cluster IDs from THIS run with the correct persistent cluster IDs.
  • The perl script does its best to handle samples dropping in and out of clusters
  • The perl script also attempts to handle cases where clusters merge, split, or split-and-merge (the split case is rare in tuberculosis)
  • Brand new clusters are given brand new UUIDs by the python wrapper
  • A cluster can lose all of its samples in some circumstances, which may have implications for tracking clusters over time
  1. [process_clusters.py] We now know which samples are part of which clusters, which clusters have been updated, and what those updates are (gaining new samples, losing old samples, etc). See more on Microreact here. This step is unique to tuberculosis, but covid has a rough equivalent where trees are uploaded to taxonium.
  2. [summarize_changes.py] A human-readable summary is generated

Important

For tuberculosis, we do #3 thrice.
a) First of all, we only consider cluster IDs associated with 20 clusters
b) Then we only consider cluster IDs associated with 10 clusters
c) Finally, we only consider cluster IDs associated with 5 clusters
Due to how subclusters work (see above), the actual parent-child relationship of clusters is self-evident by the sample IDs within them. Nevertheless, parent-child relationship is also tracked on a final output JSON file generated by process_clusters.py

Clone this wiki locally