Acknowledgment
The work of adding UMI correction feature was initiated by Jack Kamm. And Jack further demonstrated the usefulness of UMI correction + PCR chimeric filtering to clean up the noise in CRISPR data that are deeply sequenced. Here, we would like to give our huge thanks to Jack for his inspiration and help on improving this software!
New Features
- Add UMI correction with methods introduced in [Smith, et al. 2017]:
- Use
directionalmethod by default. Other methods available:cluster,adjacency. Specify non-default method via--umi-correct-methodoption. - New structure of report txt file to include stats after UMI correction.
- Use
- For
crisprfeature type data, further perform PCR chimeric filtering:- No UMI count cutoff by default. Users can specify a non-zero cutoff via
--umi-count-cutoffoption. - Chimeric filtering by ratio threshold
0.5per Barcode+UMI combination by default. Users can specify a non-default cutoff via--read-ratio-cutoff.
- No UMI count cutoff by default. Users can specify a non-zero cutoff via
Other Important Changes
- Ignore UMIs containing
N's when processing reads. - If
--max-mismatch-featureis non-zero, add mutated indexes in BFS way (previously it's DFS).- Due BFS way, if the specified
--max-mismatch-featureis too high, reset it to a lower mismatch (i.e. the smallest mismatch that encounters ambiguous mutated feature sequences), instead of failure.
- Due BFS way, if the specified
- Remove
--max-mismatch-celland--umi-length, and make them decided by the chemistry type. - Remove
--feature, and make feature type a required input. Available options:hashing,citeseq,cmo,crispr, andadt(when bothhashingandciteseqfeatures are in the same sample). - Add
--genomeoption to allow write genome reference name to the output count matrices.
Output Format Changes
- Count matrices are in sparse format and in 10x hdf5 format.
- UMI tables are in a simplified 10x hdf5 format (
.molecule_info.h5), instead of.stat..csv.gz:- Datasets
/barcode_idxand/barcodes:/barcode_idxstores each molecule's cell barcode index, with name found in/barcodesvia this index. - Datasets
/feature_idxand/features:/feature_idxstores each molecule's feature index, with name found in/featuresvia this index. - Dataset
/umi: Each molecule's UMI sequence in string. - Dataset
/count: Each molecule's read count in integer.
- Datasets
- For
crisprsamples, 3 count matrices are generated:.raw.h5for raw count matrix,.umi_correct.h5for count matrix after UMI correction,.chimeric_filtered.h5for count matrix after UMI correction + PCR chimeric filtering. - For other antibody type samples, 2 count matrices are generated:
.raw.h5for raw count matrix,.umi_correct.h5for count matrix after UMI correction.
Bug Fix
- Fix a bug of indexing
-1when processing feature barcode files of 3 columns (i.e. contain modality column). - Fix a bug in chemistry auto-detection which leads to rare cases that the software fails to detect low quality samples of top 2 chemistries having similar matched reads in cell barcodes.