Benchmark relationship matrices

Benchmarking against commonly used libraries using public data. The details listed here are a starting point and may change.


**Data:**
Start with [Cleveland, Hickey and Forni (2012)](https://pubmed.ncbi.nlm.nih.gov/22540034/). A widely used pig dataset with 6473 pedigreed individuals of which 3534 have been genotyped. The genotype data includes 52843 biallelic SNVs with missing values already imputed.


**Scope:**
Commonly used estimators applied to diploid data.

- `genomic_relationship(estimator='VanRaden')`
- `pedigree_kinship(method='diploid')`
- `hybrid_relationship()`
- `pc_relate()` (?)
- `pedigree_inbreeding()` (? not a matrix)


**Comparisons:**
- R:
  - AGHmatrix
  - sommer
  - ASRGenomics (calls AGHmatrix for some functions)


**Notes/Concerns:**

- VanRaden estimator:
  - Performance will largely be measuring a single matrix multiplication and hence this will primarily be a test of underlying linalg libs. These are highly parallelized so dask won't improve performance on a single machine.
  - Some of the commonly used implementations (especially AGHmatrix) do a lot of additional validation and computation such as mean imputation and filtering by minor allele frequency. This requires some thought about what constitutes a reasonable comparison. One option would be to time the equivalent operations in sgkit. However, these operations being optional in sgkit is a strength that we should emphasize (they are often unnecessary).
  - Potential to emphasize the advantage of dask here if we have a suitably large dataset that benefits from being spread across multiple nodes.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark relationship matrices #42

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Benchmark relationship matrices #42

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions