-
Notifications
You must be signed in to change notification settings - Fork 8
Data Quality Interest Group Annual Report for 2022
Lee Belbin, Arthur Chapman, Paul Morris, John Wieczorek
Task Group 2 has been active since January 2017, about four years longer than its four main members would have anticipated. We all thought “How hard could it be?” The answer was “Harder than we thought!” We have invested well over two years full time into this project. There were multiple times over the past five years where we thought we were 95% done, but we were wrong. Were we dumb, I doubt it? The authors (other than the lead author) are highly experienced in biodiversity data quality, Darwin Core and data testing. Neither were we lazy.
Why has it gone so slowly? It is mostly due to the complexity of the task and the inability to meet face-to-face. Zoom just doesn’t cut it for this type of work. We achieved most at our one face-to face meeting in Gainesville in 2018. There are hopefully useful lessons in this for similar projects.
We now have a solid base where future evolution such as tests for specific environments is made relatively easy. The major components of this project are the 99 tests themselves (https://github.com/tdwg/bdq/issues?q=is%3Aissue+is%3Aopen+label%3ATest), the parameters for these tests (see https://github.com/tdwg/bdq/issues/122), a vocabulary of the terms used in the framework (https://github.com/tdwg/bdq/issues/152) and the test data (https://github.com/tdwg/bdq/tree/master/tg2/core/testdata).
We remain focused on what we call core tests; those that provide power in evaluating ‘data quality’/‘fitness for use’, are widely applicable and are relatively easy to implement. The test specifications we have settled on are a label (split into a test class, a target Darwin Core term and an ‘action’), a GUID, a simple English description, test class (validation, amendment, measure or issue), applicable Darwin Core Class, Information Elements (the Darwin Core terms required for the test), Expected Responses (an explanation of how the test works from an implementation perspective), data quality dimension (from the Fitness for Use Framework), warning type, parameters (implementation dependent options), source authority (external references required), an example data scenario, source (the origin of the test), references, link to reference implementations, link to source code and notes (explanations of subtle or not so subtle aspects of the test).
Our 99 tests have been stable for over a year. We have generated most of the test data using an agreed template containing for each test data scenario, the fields that identify the test the data applies to, an identifier, a within test identifier (e.g., test data scenario 3 for test #25), the input data, the expected output data, the response status (e.g., “internal prerequisites not met”), the response result (e.g., “not compliant”) and a comment.
What remains to be done? We need to complete the test data, gather non-normative documentation, and transform our work into a BIS Standard. TG2 is over 95% complete so we would still welcome anyone who is interested to learn about biodiversity data quality and to contribute.