This document summarizes the two main components of the bioinformatics analysis that was used to generate and parsed data for the paper Deep diversification of an AAV capsid protein by machine learning. For machine learning models see this.
A processed version of the data is available in the data folder (this should look similar to what the processing pipeline outputs). For additional annotation (e.g. model scores), and training data browse through these datasets. For raw sequencing data see NCBI. Additional meta-data and artifacts to reproduce the results can be found in this Dropbox link (too big to host on github, NCBI did not support these directory structure).
- Synthesis pipeline
- Step 1: Assembles the nucleotide sequence for the corresponding protein sequence variants such that it can be generated and processe with the desired cloning strategy.
- Step 2: Tests the dataframe produced by Step 1 to ensure that the library has the intended composition, while additionally testing that the correct RE sites are in each sequence. This step produces the files that are sent to Agilent for synthesis.
- Step 3: Simulates the cloning process in silico, to ensure that the library can be successfully produced with the set of primers, plasmid backbone, and other molecular parameters.
- Parsing pipeline
- Step 1: Merge fastq files using PEAR.
- Step 2: Count the number of variants across sequencing files.
- Step 3: Compute selection scores based on the raw count files.
Details below.
Takes the AA sequences designed by ML and produces nucleotide sequences to be printed for synthesis such that it is compatible with our cloning strategy.
Pandas
Numpy
BioPython
PyDNA
editdistanceAssembles the nucleotide sequence for the corresponding protein sequence variants such that it can be generated and processe with the desired cloning strategy.
Note: We used barcodes in our original design but actually never used them as identifiers for variants.
Barcode designs:
-
barcodes16-1.txtfrom John A. Hawkins et al. PNAS 2018 https://www.pnas.org/content/115/27/E6217 (not used for analysis) or if barcodes already chosen: -
c1barcodes16-1_app_BsrBI.txtthese are a selected group of barcodes compatible with our cloning strategy.
Designed Variants:
-
chip1_GAS_nredundant.csvthe ML designed variants -
backfill_random_doubles.csvrandom doubles to backfill the chip if there is room -
singles.csvset of all single mutations to the WT
Primer files:
-
skpp15-forward.fastaforward primers -
skpp15-reverse.fastareverse primers
-
chip_df.csvcontains the library sequences -
[Optional]
c1barcodes16-1_app_BsrBI.txtas selected barcodes
Tests the dataframe produced by Step 1 to ensure that the library has the intended composition, while additionally testing that the correct RE sites are in each sequence. This step produces the files that are sent to Agilent for synthesis.
chip_df.csvcontains the library sequences
chip_for_agilent.txtthis is what is sent to Agilent
Simulates the cloning process in silico, to ensure that the library can be successfully produced with the set of primers, plasmid backbone, and other molecular parameters.
Primer files
-
skpp15-forward.fastaforward primer -
skpp15-reverse.fastareverse primer -
chip_df.csvcontains the library sequences
Takes the fastq nucleotide sequences from experimental sequencing runs and maps them back to original AA sequences and computes selection scores (We performed two sequencing runs, hence step 1 and 2 should be run on both sets before combining them on step 3)
PEAR
Pandas
Biopython
Merge fastq files using PEAR.
fastq files in experimental run foldercontains all the fastq filesmanifest file for samplescontains the mapping between file names and the relevant samples
merged files in Parsed_data/mergedmerged fastq files
Count the number of variants across sequencing files.
merged files in Parsed_data/mergedmerged fastq filesdesigned_variants.csvset of designed AAs and corresponding coding nucleotides
files in Parsed_data/librarymerged fastq filesraw_counts_raw_counts_NextSeq_run<run_num>.csvraw counts
Compute selection scores based on the count files.
raw_counts_raw_counts_NextSeq_run1.csvraw counts from run1 sequencingraw_counts_raw_counts_NextSeq_run2.csvraw counts from run2 sequencing (3x)chip_df.csv[this is the output of the synthesis pipeline] set of designed AAs and corresponding coding nucleotides
library_w_selection_scores.csvcomputed selection scores for the libraries together.