-
Notifications
You must be signed in to change notification settings - Fork 5
Open
Description
Benchmarking against commonly used libraries using public data. The details listed here are a starting point and may change.
Data:
Start with Cleveland, Hickey and Forni (2012). A widely used pig dataset with 6473 pedigreed individuals of which 3534 have been genotyped. The genotype data includes 52843 biallelic SNVs with missing values already imputed.
Scope:
Commonly used estimators applied to diploid data.
genomic_relationship(estimator='VanRaden')pedigree_kinship(method='diploid')hybrid_relationship()pc_relate()(?)pedigree_inbreeding()(? not a matrix)
Comparisons:
- R:
- AGHmatrix
- sommer
- ASRGenomics (calls AGHmatrix for some functions)
Notes/Concerns:
- VanRaden estimator:
- Performance will largely be measuring a single matrix multiplication and hence this will primarily be a test of underlying linalg libs. These are highly parallelized so dask won't improve performance on a single machine.
- Some of the commonly used implementations (especially AGHmatrix) do a lot of additional validation and computation such as mean imputation and filtering by minor allele frequency. This requires some thought about what constitutes a reasonable comparison. One option would be to time the equivalent operations in sgkit. However, these operations being optional in sgkit is a strength that we should emphasize (they are often unnecessary).
- Potential to emphasize the advantage of dask here if we have a suitably large dataset that benefits from being spread across multiple nodes.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels