Skip to content

Benchmark relationship matrices #42

@timothymillar

Description

@timothymillar

Benchmarking against commonly used libraries using public data. The details listed here are a starting point and may change.

Data:
Start with Cleveland, Hickey and Forni (2012). A widely used pig dataset with 6473 pedigreed individuals of which 3534 have been genotyped. The genotype data includes 52843 biallelic SNVs with missing values already imputed.

Scope:
Commonly used estimators applied to diploid data.

  • genomic_relationship(estimator='VanRaden')
  • pedigree_kinship(method='diploid')
  • hybrid_relationship()
  • pc_relate() (?)
  • pedigree_inbreeding() (? not a matrix)

Comparisons:

  • R:
    • AGHmatrix
    • sommer
    • ASRGenomics (calls AGHmatrix for some functions)

Notes/Concerns:

  • VanRaden estimator:
    • Performance will largely be measuring a single matrix multiplication and hence this will primarily be a test of underlying linalg libs. These are highly parallelized so dask won't improve performance on a single machine.
    • Some of the commonly used implementations (especially AGHmatrix) do a lot of additional validation and computation such as mean imputation and filtering by minor allele frequency. This requires some thought about what constitutes a reasonable comparison. One option would be to time the equivalent operations in sgkit. However, these operations being optional in sgkit is a strength that we should emphasize (they are often unnecessary).
    • Potential to emphasize the advantage of dask here if we have a suitably large dataset that benefits from being spread across multiple nodes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions