Skip to content

Releases: lilab-bcb/cumulus_feature_barcoding

2.0.0

20 May 07:39
f96ee53

Choose a tag to compare

Acknowledgment

The work of adding UMI correction feature was initiated by Jack Kamm. And Jack further demonstrated the usefulness of UMI correction + PCR chimeric filtering to clean up the noise in CRISPR data that are deeply sequenced. Here, we would like to give our huge thanks to Jack for his inspiration and help on improving this software!

New Features

  • Add UMI correction with methods introduced in [Smith, et al. 2017]:
    • Use directional method by default. Other methods available: cluster, adjacency. Specify non-default method via --umi-correct-method option.
    • New structure of report txt file to include stats after UMI correction.
  • For crispr feature type data, further perform PCR chimeric filtering:
    • No UMI count cutoff by default. Users can specify a non-zero cutoff via --umi-count-cutoff option.
    • Chimeric filtering by ratio threshold 0.5 per Barcode+UMI combination by default. Users can specify a non-default cutoff via --read-ratio-cutoff.

Other Important Changes

  • Ignore UMIs containing N's when processing reads.
  • If --max-mismatch-feature is non-zero, add mutated indexes in BFS way (previously it's DFS).
    • Due BFS way, if the specified --max-mismatch-feature is too high, reset it to a lower mismatch (i.e. the smallest mismatch that encounters ambiguous mutated feature sequences), instead of failure.
  • Remove --max-mismatch-cell and --umi-length, and make them decided by the chemistry type.
  • Remove --feature, and make feature type a required input. Available options: hashing, citeseq, cmo, crispr, and adt (when both hashing and citeseq features are in the same sample).
  • Add --genome option to allow write genome reference name to the output count matrices.

Output Format Changes

  • Count matrices are in sparse format and in 10x hdf5 format.
  • UMI tables are in a simplified 10x hdf5 format (.molecule_info.h5), instead of .stat..csv.gz:
    • Datasets /barcode_idx and /barcodes: /barcode_idx stores each molecule's cell barcode index, with name found in /barcodes via this index.
    • Datasets /feature_idx and /features: /feature_idx stores each molecule's feature index, with name found in /features via this index.
    • Dataset /umi: Each molecule's UMI sequence in string.
    • Dataset /count: Each molecule's read count in integer.
  • For crispr samples, 3 count matrices are generated: .raw.h5 for raw count matrix, .umi_correct.h5 for count matrix after UMI correction, .chimeric_filtered.h5 for count matrix after UMI correction + PCR chimeric filtering.
  • For other antibody type samples, 2 count matrices are generated: .raw.h5 for raw count matrix, .umi_correct.h5 for count matrix after UMI correction.

Bug Fix

  • Fix a bug of indexing -1 when processing feature barcode files of 3 columns (i.e. contain modality column).
  • Fix a bug in chemistry auto-detection which leads to rare cases that the software fails to detect low quality samples of top 2 chemistries having similar matched reads in cell barcodes.

1.0.0

05 Mar 00:20
af148b2

Choose a tag to compare

  • Chemistry auto-detection by testing the first 10,000 R1 reads against all possible cell barcode inclusion lists based on --chemistry:
    • Need to put all 10x cell barcode files in one folder, and specified in command required argument cell_barcode_dir.
    • Use the new lists for SC3Pv3 and SC3Pv4 chemistries since Cell Ranger v9.0.
  • Automatically decide totalseq_type (for antibody assays), umi_len, barcode_pos and max_mismatch_cell accordingly.
  • Remove --convert-cell-barcode option as it will be automatically detected.

0.11.4

05 Feb 20:37
0eef6d8

Choose a tag to compare

  • Support UTF encoding cell and feature barcode files as input (PR #28 by @yihming )
  • Early stop if no FASTQ file is found in input directory, with user-friendly error message (PR #28 by @yihming )

0.11.3

11 Mar 19:46
f249353

Choose a tag to compare

Fix an issue on parsing feature barcode file with multiple modalities (PR #27 by @yihming )

0.11.2

16 Oct 18:50
a7481b0

Choose a tag to compare

Fix whitespace issue with Windows (PR #26 by @yihming )

0.11.1

16 Nov 06:02

Choose a tag to compare

Fix bug in ingesting reads (PR #25 by @bli25 )

0.11.0

18 Aug 18:28
b483047

Choose a tag to compare

This release contains the following changes (PR #24 by @bli25 ) :

  • Add support on writing in BGZF format.
  • Bug fix in izlib.h and improved error message.

0.10.0

22 Jun 19:53
09cecb1

Choose a tag to compare

Fastq file reading parser:

  • Achieve faster multi-threaded fastq file reading by a simplified reimplementation of FQFeeder (PR #23 by @bli25 )

0.9.0

12 Jun 09:48
861bbfb

Choose a tag to compare

  • Decompressing:
    • Use isa-l to replace zlib for faster decompression
    • Use slw287r's izlib.h as the interface to interact with kseq.h. (PR #20 PR #21 by @bli25 )
  • Compressing:
    • Use libdeflate for faster compression
    • Add compress.hpp that enable single-threaded and multi-threaded compression (PR #22 by @bli25 )
  • In input arguments:
    • Accept gzipped cell barcode file again
    • Add -p option for multi-threaded compression
  • In output:
    • The sufficient statistics file is gzipped again, i.e.output_name.stat.csv.gz.

0.8.0

30 Apr 23:56

Choose a tag to compare

  • On processing gzipped FASTQ files:
    • Remove boost library dependency.
    • Instead, use zlib and Heng Li's kseq library for fast I/O processing. (PR #17 by @tony-kuo ; PR #18 by @bli25 )
  • In input arguments:
    • No longer accept gzipped cell barcode file, i.e. only .txt format is accepted.
  • In output:
    • The sufficient statistics file output_name.stat.csv is no longer gzipped, but in .csv format.
    • Add output_name.report.txt to report statistics related to number of reads.