Skip to content

Chapter 3: Initiating a bioinformatics project

tyhsu389 edited this page Dec 11, 2018 · 4 revisions

4 Steps to Defining a Bioinformatics Project

As a combination, sequencing and bioinformatics combined can be seen as a laboratory developed test (LDT). Overall, developing the test should follow these steps:

  1. Defining the project scope, and identifying the stakeholders and funding sources
  2. Building wet and dry laboratory capacity
  3. Piloting the project
  4. Building production level workflows

The inclusion of bioinformatics will add substantial tasks to each step (Figure 4). For the first step, bioinformaticians should be added as stakeholders. They can assist with assay development, as well as determine sample size (needed for statistical significance), reporting standards, and pipeline turnaround time. The second step task will be the most difficult and unprecedented, and involves obtaining the sequencers and reagents, as well as finding the compute infrastructure to support bioinformatics analyses.

Figure 4. Implementing a bioinformatics project. Sequencing and bioinformatics combined can be viewed as a laboratory-developed test (LDT). However, the addition of bioinformatics adds several hurdles to each of the tasks above. First, project planning will require input from bioinformaticians, who can assist laboratorians with assay design and epidemiologists with report types. Second, building infrastructure for bioinformatics analyses will involve purchasing or renting compute machines and associated equipment, as well as development of a data policy. Third, the pilot project will require troubleshooting between laboratory staff, bioinformaticians, and epidemiologists in order to ensure accurate and appropriate reporting. Lastly, creating a production level pipeline will not only involve quality management and CLIA standards in the laboratory, but also in the bioinformatics analyses.

The third step, piloting the project, will require constant communication between the laboratory and bioinformatics team for troubleshooting, as well as the bioinformatics team and epidemiology for reporting. Take this scenario for example: the laboratory sequences several mumps samples, but the bioinformatician cannot find mumps within the sequencing reads during his/her analyses. This could potentially result from low sample input, problems during library preparation, incorrect assumptions about specific bioinformatics tools, mistakes within the bioinformatics pipeline, or problems with sequence databases. Even after these steps are corrected, the bioinformatician will still need to work with epidemiologists to create useful deliverables.

Step 1: Defining the project

Bioinformatics is a field that attempts to make sense of large biological datasets, which may include DNA, RNA, proteins, and metabolites. For a given sample, bioinformatics tools may i) reconstruct the biological state (for e.g, assembling the genome, calling genes, determining what genes were expressed), and then ii) use these states to find relationships between samples (for e.g, case vs control, or sample vs. reference), or between samples and metadata (for e.g, location, environmental factors). One misconception about bioinformatics is that simply having a large dataset will generate hypotheses and actionable conclusions. Although it is possible, the best bioinformatics analyses are done when the objective for the project is clear, which then influences how samples are processed and how metadata are collected.

For example, the MA State Laboratory works on the Undetected Respiratory Disease Outbreak (URDO) project in collaboration with the CDC and several other states. The objective of this project is to rapidly identify pathogens that cause respiratory illnesses. This could be accomplished through multiple laboratory assays, including:

Laboratory Assays Pros/Cons
16S rRNA sequencing Can identify unknown/non-culturable pathogens in low biomass samples, but usually limited to genus level resolution
Metagenomics shotgun sequencing Can identify unknown/non-culturable pathogens potentially at strain level resolution (depending on community composition). Also provides gene information. Need higher biomass.
Targeted amplification (PCR) Must know organism in advance in order to create custom primers, can have low biomass samples. Can also provide gene or marker information.
DNA selection (SHERLOCK) (Gootenberg et al., 2018) Same as “Targeted amplification” above.

The chosen laboratory assay then determines which bioinformatics workflow will be implemented. Commercial pipelines exist for 16S rRNA and metagenomics, but may not for custom pathogen panels. If using open source software, the bioinformatic tools may need to be optimized or even customized to achieve the objective. Furthermore, each of these pipelines may utilize differing amounts of compute power and run time. If epidemiologists and clinicians need specific turnaround times, this will further influence which laboratory assay and bioinformatics approaches are chosen.

Overall, planning is estimated to take 3-6 months (Figure 4). Outlining the objective will help clarify:

  • How the LDT (sequencing + bioinformatics) will improve upon current tests
  • What should be reported, and its turnaround time
  • Who the stakeholders are, and which personnel within laboratory, bioinformatics, epidemiology, and information technology (IT) are responsible for carrying out the project
  • How the project will be funded
  • What equipment needs to be purchased

Clone this wiki locally