Skip to content

2.0.0

Latest

Choose a tag to compare

@yihming yihming released this 20 May 07:39
· 1 commit to main since this release
f96ee53

Acknowledgment

The work of adding UMI correction feature was initiated by Jack Kamm. And Jack further demonstrated the usefulness of UMI correction + PCR chimeric filtering to clean up the noise in CRISPR data that are deeply sequenced. Here, we would like to give our huge thanks to Jack for his inspiration and help on improving this software!

New Features

  • Add UMI correction with methods introduced in [Smith, et al. 2017]:
    • Use directional method by default. Other methods available: cluster, adjacency. Specify non-default method via --umi-correct-method option.
    • New structure of report txt file to include stats after UMI correction.
  • For crispr feature type data, further perform PCR chimeric filtering:
    • No UMI count cutoff by default. Users can specify a non-zero cutoff via --umi-count-cutoff option.
    • Chimeric filtering by ratio threshold 0.5 per Barcode+UMI combination by default. Users can specify a non-default cutoff via --read-ratio-cutoff.

Other Important Changes

  • Ignore UMIs containing N's when processing reads.
  • If --max-mismatch-feature is non-zero, add mutated indexes in BFS way (previously it's DFS).
    • Due BFS way, if the specified --max-mismatch-feature is too high, reset it to a lower mismatch (i.e. the smallest mismatch that encounters ambiguous mutated feature sequences), instead of failure.
  • Remove --max-mismatch-cell and --umi-length, and make them decided by the chemistry type.
  • Remove --feature, and make feature type a required input. Available options: hashing, citeseq, cmo, crispr, and adt (when both hashing and citeseq features are in the same sample).
  • Add --genome option to allow write genome reference name to the output count matrices.

Output Format Changes

  • Count matrices are in sparse format and in 10x hdf5 format.
  • UMI tables are in a simplified 10x hdf5 format (.molecule_info.h5), instead of .stat..csv.gz:
    • Datasets /barcode_idx and /barcodes: /barcode_idx stores each molecule's cell barcode index, with name found in /barcodes via this index.
    • Datasets /feature_idx and /features: /feature_idx stores each molecule's feature index, with name found in /features via this index.
    • Dataset /umi: Each molecule's UMI sequence in string.
    • Dataset /count: Each molecule's read count in integer.
  • For crispr samples, 3 count matrices are generated: .raw.h5 for raw count matrix, .umi_correct.h5 for count matrix after UMI correction, .chimeric_filtered.h5 for count matrix after UMI correction + PCR chimeric filtering.
  • For other antibody type samples, 2 count matrices are generated: .raw.h5 for raw count matrix, .umi_correct.h5 for count matrix after UMI correction.

Bug Fix

  • Fix a bug of indexing -1 when processing feature barcode files of 3 columns (i.e. contain modality column).
  • Fix a bug in chemistry auto-detection which leads to rare cases that the software fails to detect low quality samples of top 2 chemistries having similar matched reads in cell barcodes.