-
Notifications
You must be signed in to change notification settings - Fork 23
Generating Mutational Spectra
Yo Akiyama edited this page Jun 27, 2023
·
4 revisions
Generating mutational signatures using signatureanalyzer. For a comprehensive description of mutational signatures, their relevance, and references, please see the Catalogue of Somatic Mutations in Cancer, or COSMIC, here, as well as The repertoire of mutational signatures in human cancer. The following document is a reference for easy creation of spectra using .mafs.
For more information about the .maf format, see here. The following examples show generating mutational spectra from an hg19 .maf using signatureanalyzer. First, load your maf as a pandas dataframe.
The following columns are required:
Tumor_Sample_BarcodeChromosomeStart_PositionReference_AlleleTumor_Seq_Allele2Variant_Type
The following columns are optional:
-
ref_context: will map context if not provided
import signatureanalyzer as sa
import pandas as pd
maf_df = pd.read_csv(<PATH_TO_MAF>, sep='\t').loc[:,[
'Hugo_Symbol',
'Tumor_Sample_Barcode',
'Chromosome',
'Start_Position',
'Reference_Allele',
'Tumor_Seq_Allele2',
'Variant_Type'
]]
print(maf_df.head())
| Hugo_Symbol | Tumor_Sample_Barcode | Chromosome | Start_Position | Reference_Allele | Tumor_Seq_Allele2 | Variant_Type | |
|---|---|---|---|---|---|---|---|
| 0 | URGCP | sample_192 | 7 | 43916856 | C | A | SNP |
| 1 | CLCN1 | sample_127 | 7 | 143048832 | C | A | SNP |
| 2 | NAV2 | sample_354 | 11 | 20113762 | A | G | SNP |
| 3 | TUBB8P7 | sample_32 | 16 | 90162224 | G | T | SNP |
| ... | CHFR | sample_35 | 12 | 133438086 | C | A | SNP |
- This encodes the 96-base context
-
note: two forms of this exist - either input should work
- word: ACAG --> (REF)(MUT)(LEFT NT)(RIGHT NT)
- arrow: A[A>C]G --> (LEFT NT)(REF)>(MUT)(RIGHT NT)
- REQUIRES a 2-bit human genome build
_,spectra_sbs = sa.spectra.get_spectra_from_maf(maf_df, reference='cosmic3_exome', hgfile='hg19.2bit')
print(spectra_sbs.head().iloc[:,:5])
Arrow
| sample_551 | sample_135 | sample_118 | sample_191 | sample_124 | |
|---|---|---|---|---|---|
| A[A>C]A | 1 | 2 | 0 | 1 | 2 |
| A[A>C]C | 0 | 2 | 0 | 0 | 0 |
| A[A>C]G | 0 | 6 | 2 | 2 | 0 |
| A[A>C]T | 0 | 4 | 0 | 0 | 2 |
| ... | 1 | 2 | 0 | 1 | 0 |
or
Word
| context96.word | sample_551 | sample_135 | sample_118 | sample_191 | sample_124 |
|---|---|---|---|---|---|
| ACAA | 1 | 2 | 0 | 1 | 2 |
| ACAC | 0 | 2 | 0 | 0 | 0 |
| ACAG | 0 | 6 | 2 | 2 | 0 |
| ACAT | 0 | 4 | 0 | 0 | 2 |
| ... | 1 | 2 | 0 | 1 | 0 |
- This encodes the 78-base context
_,spectra_dbs = sa.spectra.get_spectra_from_maf(maf_df, reference='cosmic3_DBS')
print(spectra_dbs.head())
| context78.word | sample_0 | sample_1 | sample_2 | sample_3 | sample_4 |
|---|---|---|---|---|---|
| AC>CA | 0 | 0 | 0 | 0 | 0 |
| AC>CG | 0 | 0 | 0 | 0 | 0 |
| AC>CT | 0 | 0 | 0 | 0 | 0 |
| AC>GA | 0 | 0 | 0 | 0 | 0 |
| ... | 0 | 0 | 0 | 0 | 0 |
- This encodes the 83-base context
- REQUIRES a 2-bit human genome build
_,spectra_id = sa.spectra.get_spectra_from_maf(maf_df, reference='cosmic3_ID', hgfile='hg19.2bit')
print(spectra_id.head().iloc[:,:5])
| context83.word | sample_0 | sample_1 | sample_2 | sample_3 | sample_4 |
|---|---|---|---|---|---|
| Cdel1 | 0 | 1 | 3 | 1 | 0 |
| Cdel2 | 0 | 4 | 3 | 6 | 0 |
| Cdel3 | 2 | 2 | 2 | 9 | 0 |
| Cdel4 | 0 | 0 | 0 | 3 | 0 |
| ... | 0 | 0 | 1 | 3 | 0 |
- This encodes the 1536-base context
-
note: two forms of this exist - either input should work
- word: ACAGAT --> (REF)(MUT)(L-2)(L-1)(R+1)(R+2)
- arrow: AA[A>C]GT --> (L-2)(L-1)(REF)>(ALT)(R+2)
- REQUIRES a 2-bit human genome build
_,spectra_sbs = sa.spectra.get_spectra_from_maf(maf_df, reference='pcawg_SBS', hgfile='hg19.2bit')
print(spectra_sbs.head().iloc[:,:5])
Arrow
| sample_551 | sample_135 | sample_118 | sample_191 | sample_124 | |
|---|---|---|---|---|---|
| AA[T>A]AA | 1 | 2 | 0 | 1 | 2 |
| AA[T>A]AC | 0 | 2 | 0 | 0 | 0 |
| AA[T>A]AG | 0 | 6 | 2 | 2 | 0 |
| AA[T>A]AT | 0 | 4 | 0 | 0 | 2 |
| ... | 1 | 2 | 0 | 1 | 0 |
or
Word
| context1536.word | sample_551 | sample_135 | sample_118 | sample_191 | sample_124 |
|---|---|---|---|---|---|
| TAAAAA | 1 | 2 | 0 | 1 | 2 |
| TAAAAC | 0 | 2 | 0 | 0 | 0 |
| TAAAAG | 0 | 6 | 2 | 2 | 0 |
| TAAAAT | 0 | 4 | 0 | 0 | 2 |
| ... | 1 | 2 | 0 | 1 | 0 |
- This encodes the 1536-base + 78-base + 83-base context
- REQUIRES a 2-bit human genome build
- Either SBS form should work (Arrow vs Word)
_,spectra_id = sa.spectra.get_spectra_from_maf(maf_df, reference='pcawg_COMPOSITE', hgfile='hg19.2bit')
print(spectra_id.head().iloc[:,:5])
| context.pcawg | sample_0 | sample_1 | sample_2 | sample_3 | sample_4 |
|---|---|---|---|---|---|
| AA[T>A]AA | 0 | 1 | 3 | 1 | 0 |
| AA[T>A]AC | 0 | 4 | 3 | 6 | 0 |
| AA[T>A]AG | 2 | 2 | 2 | 9 | 0 |
| AA[T>A]AT | 0 | 0 | 0 | 3 | 0 |
| ... | 0 | 0 | 0 | 0 | 0 |
| AC>CA | 0 | 0 | 0 | 3 | 0 |
| AC>CG | 0 | 0 | 0 | 3 | 0 |
| ... | 0 | 0 | 1 | 3 | 0 |
| Cdel1 | 0 | 1 | 3 | 1 | 0 |
| Cdel2 | 0 | 4 | 3 | 6 | 0 |
| Cdel3 | 2 | 2 | 2 | 9 | 0 |
| Cdel4 | 0 | 0 | 0 | 3 | 0 |
| ... | 0 | 0 | 1 | 3 | 0 |
- This encodes the 96-base + 78-base + 83-base context
- REQUIRES a 2-bit human genome build
- Either SBS form should work (Arrow vs Word)
_,spectra_id = sa.spectra.get_spectra_from_maf(maf_df, reference='pcawg_COMPOSITE96', hgfile='hg19.2bit')
print(spectra_id.head().iloc[:,:5])
| context.pcawg | sample_0 | sample_1 | sample_2 | sample_3 | sample_4 |
|---|---|---|---|---|---|
| A[C>A]A | 0 | 1 | 3 | 1 | 0 |
| A[C>A]C | 0 | 4 | 3 | 6 | 0 |
| A[C>A]G | 2 | 2 | 2 | 9 | 0 |
| A[C>A]T | 0 | 0 | 0 | 3 | 0 |
| ... | 0 | 0 | 0 | 0 | 0 |
| AC>CA | 0 | 0 | 0 | 3 | 0 |
| AC>CG | 0 | 0 | 0 | 3 | 0 |
| ... | 0 | 0 | 1 | 3 | 0 |
| Cdel1 | 0 | 1 | 3 | 1 | 0 |
| Cdel2 | 0 | 4 | 3 | 6 | 0 |
| Cdel3 | 2 | 2 | 2 | 9 | 0 |
| Cdel4 | 0 | 0 | 0 | 3 | 0 |
| ... | 0 | 0 | 1 | 3 | 0 |
- This encodes the 1536-base + 83-base context
- REQUIRES a 2-bit human genome build
- Either SBS form should work (Arrow vs Word)
_,spectra_id = sa.spectra.get_spectra_from_maf(maf_df, reference='pcawg_SBS_ID', hgfile='hg19.2bit')
print(spectra_id.head().iloc[:,:5])
| context.pcawg | sample_0 | sample_1 | sample_2 | sample_3 | sample_4 |
|---|---|---|---|---|---|
| AA[T>A]AA | 0 | 1 | 3 | 1 | 0 |
| AA[T>A]AC | 0 | 4 | 3 | 6 | 0 |
| AA[T>A]AG | 2 | 2 | 2 | 9 | 0 |
| AA[T>A]AT | 0 | 0 | 0 | 3 | 0 |
| ... | 0 | 0 | 1 | 3 | 0 |
| Cdel1 | 0 | 1 | 3 | 1 | 0 |
| Cdel2 | 0 | 4 | 3 | 6 | 0 |
| Cdel3 | 2 | 2 | 2 | 9 | 0 |
| Cdel4 | 0 | 0 | 0 | 3 | 0 |
| ... | 0 | 0 | 1 | 3 | 0 |
- This encodes the 96-base + 83-base context
- REQUIRES a 2-bit human genome build
- Either SBS form should work (Arrow vs Word)
_,spectra_id = sa.spectra.get_spectra_from_maf(maf_df, reference='pcawg_SBS96_ID', hgfile='hg19.2bit')
print(spectra_id.head().iloc[:,:5])
| context.pcawg | sample_0 | sample_1 | sample_2 | sample_3 | sample_4 |
|---|---|---|---|---|---|
| A[C>A]A | 0 | 1 | 3 | 1 | 0 |
| A[C>A]C | 0 | 4 | 3 | 6 | 0 |
| A[C>A]G | 2 | 2 | 2 | 9 | 0 |
| A[C>A]T | 0 | 0 | 0 | 3 | 0 |
| ... | 0 | 0 | 0 | 0 | 0 |
| Cdel1 | 0 | 1 | 3 | 1 | 0 |
| Cdel2 | 0 | 4 | 3 | 6 | 0 |
| Cdel3 | 2 | 2 | 2 | 9 | 0 |
| Cdel4 | 0 | 0 | 0 | 3 | 0 |
| ... | 0 | 0 | 1 | 3 | 0 |
- This encodes the 1536-base + 8-base context
- REQUIRES a 2-bit human genome build
_,spectra_id = sa.spectra.get_spectra_from_maf(maf_df, reference='polymerase_msi', hgfile='hg19.2bit')
print(spectra_id.head().iloc[:,:5])
| context.pcawg | sample_0 | sample_1 | sample_2 | sample_3 | sample_4 |
|---|---|---|---|---|---|
| AA[T>A]AA | 0 | 1 | 3 | 1 | 0 |
| AA[T>A]AC | 0 | 4 | 3 | 6 | 0 |
| AA[T>A]AG | 2 | 2 | 2 | 9 | 0 |
| AA[T>A]AT | 0 | 0 | 0 | 3 | 0 |
| ... | 0 | 0 | 1 | 3 | 0 |
| INS4 | 0 | 1 | 3 | 1 | 0 |
| DEL1 | 0 | 4 | 3 | 6 | 0 |
| DEL3 | 2 | 2 | 2 | 9 | 0 |
| DEL4 | 0 | 0 | 0 | 3 | 0 |
- This encodes the 1536-base + 8-base context
- REQUIRES a 2-bit human genome build
_,spectra_id = sa.spectra.get_spectra_from_maf(maf_df, reference='polymerase_msi96', hgfile='hg19.2bit')
print(spectra_id.head().iloc[:,:5])
| context.pcawg | sample_0 | sample_1 | sample_2 | sample_3 | sample_4 |
|---|---|---|---|---|---|
| A[C>A]A | 0 | 1 | 3 | 1 | 0 |
| A[C>A]C | 0 | 4 | 3 | 6 | 0 |
| A[C>A]G | 2 | 2 | 2 | 9 | 0 |
| A[C>A]T | 0 | 0 | 0 | 3 | 0 |
| ... | 0 | 0 | 1 | 3 | 0 |
| INS4 | 0 | 1 | 3 | 1 | 0 |
| DEL1 | 0 | 4 | 3 | 6 | 0 |
| DEL3 | 2 | 2 | 2 | 9 | 0 |
| DEL4 | 0 | 0 | 0 | 3 | 0 |