Skip to content

Instructions for pipeline testing #27

@claraqin

Description

@claraqin

Migrating this from an email thread for easier viewing.

Scope of Testing

For now, we will be testing the steps in the figure below between "NEON Data Products" and "ASV tables with taxonomy," except that we will not be generating the taxonomy tables because this takes too much processing time.

Screen Shot 2020-10-26 at 12 42 36 PM

Our Technical Working Group has suggested that this testing should occur in two phases. In Phase 1, we test the pipeline to ensure that the pipeline is simply able to run from start to finish on a variety of operating systems. In Phase 2, we will ask volunteers to read through the docs and provide suggestions on how to make the package more flexible and user-friendly. For now, we are only asking you to conduct Phase 1 testing.

Instructions for Phase 1 Testing

Start by pulling the codebase from https://github.com/claraqin/NEON_soil_microbe_processing.

Install cutadapt if you have not previously done so.

  • Installation instructions can be found here: https://cutadapt.readthedocs.io/en/stable/installation.html
  • This is where many people run into issues because of Python dependencies. If you cannot install cutadapt, then ignore the ITS pipeline and test the 16S processing pipeline only. (The 16S pipeline does not require cutadapt.)

Update the parameters in the "params.R" file, which can be found in the "code" subdirectory.

  • Most of the parameters will not need to be updated because they are either adaptable or will not be referenced in this scope of testing.
  • However, you may need to update the CUTADAPT_PATH parameter if you are testing the ITS pipeline.
  • If you are using a Mac, you may also wish to update the MULTITHREAD parameter. By default, multithreading is turned off for Windows computers.

Download the sequence metadata for testing at this Google Drive link, decompress the zipfile, and drop the contents (two files) into the project directory (the base directory of the repository that you just cloned).

  • In the future, this step will be replaced by a function made specifically for downloading sequence metadata from NEON. But for now, we need to use a workaround because of compatibility issues on NEON's end which will be resolved later this year.

The code for testing can be found in the "testing" subdirectory. This subdirectory contains temporary versions of our vignettes that I made for testing only. Start with the download-neon-data-metadataworkaround.Rmd vignette.

  • You will probably have to update the "root.dir" RMarkdown parameter at the top of the script. It should refer to the absolute filepath of the project root directory (e.g. .../neonSoilMicrobeProcessing).
  • Note that the R package dependencies, specified in lines 32-36, must be installed before this vignette will run properly.
  • In lines 81-82, you will have the option to download either the metadata for ITS sequences or the metadata for 16S sequences (or both). Please respond to this Issue thread to let the other testers know which target gene(s) you will test.
  • In lines 89-101, different options of subsetting parameters are provided. You could attempt to download and process the entire dataset if you'd like, but I do not even have an estimate of the full download size because these metadata tables include both published and pre-published NEON data. If you do subset the data, please respond to this Issue thread to let the other testers know which subset(s) you will test.

Then move to either the process-its-sequence-to-seqtabs.Rmd or process-16s-sequence-to-seqtabs.Rmd vignettes, depending on which subset of the data you selected.

  • You will probably have to update the "root.dir" RMarkdown parameter at the top of the script. It should refer to the absolute filepath of the project root directory (e.g. .../neonSoilMicrobeProcessing).
  • Note that the R package dependencies, specified in lines 30-34, must be installed before this vignette will run properly.
  • Both vignettes contain a header which says "All code below is NOT run in this version of the vignette." Please run only the code above this header.
  • Note that each sequencing run (the unit by which we are subsetting) takes anywhere between 1 and 4 hours to process, depending on the size of the run and the speed of your processor. I've found that 8 GB of RAM is usually sufficient for running this pipeline, but occasionally more RAM is needed.

Reporting Back

If any issues or fatal errors arise, please let me know by replying to me individually (unless of course it seems obvious that it would affect all testers).

Whether you run into a fatal error or are able to complete the pipeline error-free, please report back on this thread and include in your post the output of devtools::session_info().

Current Volunteer Assignments

  • Kabir has tested the ITS pipeline on a Mac for the following subset of data: c("B69PP", "B69RF", "B69RN", "B9994", "BDR3T", "BF8M2", "BF8W6", "BFDG8", "BMCBD", "BMCC4", "BNBWL").
  • Dan is currently testing the 16S pipeline on a Mac for the following subset of data: c("B69PP", "B69RF", "B69RN", "B9994", "BDNB6", "BF462", "BF8M2", "BFDG8", "BJ8RK", "BMC64", "BMCBJ".
  • Kai is currently testing the 16 pipeline on a Windows VM and printing the results in this Issue thread: Pipeline testing on Windows #26

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions