-
Notifications
You must be signed in to change notification settings - Fork 8
Annual Report TDWG Data Quality Interest Group for 2020
The TDWG-GBIF Data Quality Interest Group (DQIG) was formally approved by the TDWG Executive in October 2014. Arthur Chapman and Antonio Saraiva are elected co-coordinators of the group.
Seminars and Interest Group meetings were held at Biodiversity Next (TDWG2019) in Leiden and at the Virtual TDWG2020 Working Sessions.
The Interest Group manages four Task Groups. Three Task Groups were established in 2014, and a fourth in 2018, viz
- TG1 – A Framework for Data Quality – leader: Allan Koch Veiga
- TG2 – Data Quality Tests and Assertions – leader: Lee Belbin
- TG3 – Use Case Library – leader: Miles Nicholls
- TG4 – Vocabularies of Values – leader: Paula Zermoglio
Discussions with the broad membership are held on the Data Quality GitHub (bdq) repository, which has been established on the TDWG GitHub site. The repository has been used extensively over the past twelve months - especially by Task Group 2 (see Issues). In total 187 Issues have been raised with 130 currently open (99 of which are the current Tests). A further 57 issues have been closed. All documents related to the Interest Group are linked in the Wiki. A TDWG Data Quality Interest Group Listserve (tdgw-bdq) is also used to provide another form of communication on activities.
In 2020, the Data Quality Interest Group published a paper in the TDWG Journal Biodiversity Information Science and Standards: Developing Standards for Improved Data Quality and for Selecting Fit for Use Biodiversity Data (https://doi.org/10.3897/biss.4.5088).
Significant highlights for the year were
- Publication of the paper Developing Standards for Improved Data Quality and for Selecting Fit for Use Biodiversity Data
- This has been a busy year of work on Task Group 2 (see below).
- Plans to wind up Task Groups 1 and 3.
- A lengthy meeting was held of the Interest Group in a Café in Leiden in October 2019.
- Significant participation at the BiodiversityNext conference in Leiden (October 2019).
- Significant participation at the TDWG Virtual Working Sessions in September 2020.
In the past twelve months, the DQIG has held many, many on-line meetings using Zoom, although time zones don’t always allow these to be as efficient as they may otherwise be.
One physical meeting was held during the year.
- A planned meeting in Bariloche, Argentina in March 2020 had to be cancelled at the last minute due to the COVID-19 pandemic, which meant that a lot more work had to be carried out with virtual meetings and discussions on Zoom.
- A productive (6hr) meeting was held in Leiden during the Biodiversity Next conference. Tim Robertson and John Waller from GBIF were able to attend parts of the meeting, which provided significant interchange with GBIF. A report of the meetings held in conjunction with Biodiversity Next can be seen on the BDQ Wiki at https://github.com/tdwg/bdq/wiki/Summary-of-meetings-held-during-Biodiversity-Next-in-Leiden,-October-2019
- A Pre-Conference Workshop (Best Practices for Development of Vocabularies of Value) was held at the Biodiversity Next Conference in Leiden in October 2019 see report in at https://github.com/tdwg/bdq/wiki/Summary-of-meetings-held-during-Biodiversity-Next-in-Leiden,-October-2019 .
- The Vocabularies Task Group (TG4) held a Workshop at the TDWG Working Sessions on 24 September 2020.
- Interest Group members participated in many of the sessions of other Task Groups and Interest Groups at TDWG Working Sessions on 25 September 2020. The Annotations IG is being restarted and this will be important for the BDQ IG – especially with respect to the Annotations from TG2.
- Due to COVID-19, there have been few interactions with other groups and most of the work of the Interest Group has been carried out virtually.
Task group 1 is led by Allan Koch Veiga from the University of Säo Paulo, Brazil.
Develop a conceptual framework that serves as a common ground for a collaborative mapping of DQ needs and DQ methods, tools and DQ reports for DQ Assessment and Management based on data fitness for use.
- The main goal of the task group has been achieved and work is currently underway to finalise the Task Group and get ready for closing it.
- A contribution was made on the Framework (Section 3) in Chapman et al. (2020). This, along with the previous paper (Viega et al. 2017) will form the basis of the final report for the Task Group.
- A final recommendation from the Task Group will be for a new Task Group to develop a representation of the framework as a TDWG Technical Specification (see Task Group 5, below).
It is proposed that the next steps be:
- consolidate and document all the outcomes from Task Group 1 in a comprehensive public final report.
- It is planned to wind up TG1 in the next six months.
- A proposed new Task Group (see Task Group 5, below) to develop a representation of the framework as a TDWG Technical Specification.
Task Group 2 is led by Lee Belbin who is the Science Advisor to The Atlas of Living Australia and who works from Tasmania, Australia.
To report evaluations of ‘data quality’ of occurrence records by developing:
- A standard core (fundamental) set of tests and associated assertions based on Darwin Core terms
- A standard suite of fields to describe each test
- A broad deployment of the tests, from collector to aggregator
- A set of principles resulting from developing the tests
- Software that provides an example implementation of each test. Test data that can be used to validate a test installation. A publication that captures the knowledge that has been established during the creation of the tests/assertions
The core tests have proven to be far more complex than any of the team had anticipated. Several times over the past 3 years we believed we had finalized the tests, only to find new issues that required a fresh understanding and subsequent edits. For example, the most recent dropping of the two tests related to dwc:identificationQualifier-
- TG2-VALIDATION_IDENTIFICATIONQUALIFIER_DETECTED (https://github.com/tdwg/bdq/issues/97) and
- TG2-AMENDMENT_IDENTIFICATIONQUALIFIER_FROM_TAXON (https://github.com/tdwg/bdq/issues/106).
This decision resulted from a review of dwc:identificationQualifier values in GBIF records and an evaluation of expected values based on the Darwin Core definition of the term by Paula Zermoglio. Aside from there being many values, the term expects the qualifier in relation to a given taxonomic name, and rules of open nomenclature are unevenly adopted across data records to reliably parse and detect dwc:identificationQualifier for these tests to be effective.
Similarly, a more recent review of the NAME-related tests initially detected an absence of a test for SCIENTIFICNAME_EMPTY to match those for higher taxonomic units. We had used the term ‘polynomial’ to mean the non-authorship part of dwc:scientificName.
What has occurred during the past year:
- Months of work on discussions and edits to the GitHub issues (=tests), mainly via Zoom and email.
- We had hoped to have a face-to-face meeting in Bariloche early in 2020 but the Corona virus stopped that. This was unfortunate as we needed this meeting to discuss the remaining complex issues as noted above. Attempting to address such issues by Zoom has been far less efficient.
- We are occasionally re-visiting decisions made earlier. An indication that we have been doing this work for many years.
- We have now standardized all the test parameters for the 99 CORE tests: https://github.com/tdwg/bdq/labels/Test. Much work has gone into standardizing the phrasing and terminology within the Expected response field of the tests.
- Two of the test fields that have taken most of our time to resolve have been ‘Parameters’ and what we now call ‘bdq:sourceAuthority’. These are now complete. The work on ‘Parameters’ has feed in to TG4 on Vocabularies of Value (see https://github.com/tdwg/bdq/wiki/Vocabularies-needed-for-Darwin-Core-terms prepared by TG4).
- We have published the work from the Task Group (Section 5 in Chapman et al. 2020), and detailed the tests in a spreadsheet as Supplementary Material 4 to that paper.
- The vocabulary of terms has been extended (https://github.com/tdwg/bdq/issues/152) and published as Supplementary Material 1 in Chapman et al. 2020.
- Test datasets for the tests is continuing with 30 tests covered as at 1st July 2020.
- Two Case Studies were conducted on a subset of the tests (using tests associated with the Date fields) – one using Kurator and one using Java. These were documented as Section 6 in Chapman et al. (2020) along with Supplementary Material 3 in that paper.
- We recognize the dependence on the work of the Annotations Interest Group for the results from the tests to have maximal impact. It is important that test results stay with the records. We are pleased to see that this Interest Group is likely to be reconstituted shortly.
- Some minor issues are still outstanding, but these should be addressed in coming months.
- The building of test datasets that can be used to validate the installation of the test code is ongoing. Discussions on building tests for each of the tests is in process, with some issues still to be worked out on the structure of the test data.
- Code has been written to extract the parameters of each of the tests to RDF and we believe that this will form the basis of the proposed TDWG standard for the Tests and Assertions.
- Some work is still required on standardising responses arising from the tests/assertions. Results of discussions can be seen in https://github.com/tdwg/bdq/issues/142. This issue is linked with the work of the Annotations Interest Group (http://www.tdwg.org/activities/annotations/). This work had to be postponed due to the Annotations IG not having a leader over the past 18 months. We believe that the assertions arising from the tests should conform with the standard that will be proposed from this Interest Group. This will, we hope, imply that assertions will always be retained (and travel with) the associated data.
- Develop a technical specification (see Task Group 5, below)
Task Group 3 is led by Miles Nicholls from The Atlas of Living Australia, Canberra, Australia.
To document use cases in a structured format based on Toward A Conceptual Framework for the Assessment and Management of the Fitness for Use of Biodiversity Data (Veiga et al. 2017). The use case template will be placed in a collaborative editing environment for completion and discussion.
- The collection of data quality use cases was conducted through the analysis of a number of published papers and a series of interviews and surveys. The information was collected using a use case template and collated into a Use Case library. The details of this process are in described in Section 4 (Chapman et al. 2020) Collecting User Stories that lead to Data Quality Profiles. Details are documented in Supplementary Material 2 to Chapman et al. 2020.
- The concept of Data Quality use cases is currently being implemented in the Atlas of Living Australia, Data Quality project as “Data Quality Profiles” a set of pre-defined data filters that can be applied to search results. More information on the project and test environment can be found here: https://www.ala.org.au/data-quality-project/
- As per the goals described in the task group charter a set of use cases in use by agencies and user communities has been assembled and so this Task Group has been wound up. Potential future activities in this space would be to generate and validate a small set of standard data quality profiles for common use cases as well as a default profile that would be broadly applicable.
- It is planned to Wind up TG3 in the next few months.
Task Group 4 is led by Paula Zermoglio from Vertnet, Argentina.
To create a framework within which to build biodiversity data vocabularies, particularly by developing a standard format (best current practice) for building TDWG vocabularies concerning the values used under the Darwin Core terms.
The activities of this task group have been greatly delayed due to personal affairs of the convener. The main outcome expected for this year, namely the development of a best current practice for building of TDWG vocabularies, has not yet been achieved. A revised timeline for accomplishing this task is proposed below.
With respect to the activities proposed in the Charter, during the informed period the group has:
- Refined the previously made list of vocabularies needed for terms of the Darwin Core standard, based on the standard itself and on the needs identified and revised by Task Group 2, Tests and Assertions. https://github.com/tdwg/bdq/wiki/Vocabularies-needed-for-Darwin-Core-terms
- Continued to collect and assess already existing vocabularies across the community. These are gathered in a Google Spreadsheet, linked on the GitHub Repository. https://docs.google.com/spreadsheets/d/1SDbtZxEzg0t10OSNDPJN0XSye6mMOTTCIBH3xh-HUYA/edit?usp=sharing
- Updated the state of data shared through aggregators in relation to the use of controlled values, in particular those shared through the GBIF network. https://github.com/tdwg/dwc-qa/tree/master/data
- Organized a session during TDWG 2020 Working Sessions (ITG04). There were around 88 participants.
The workplan for the next twelve months includes the activities proposed in the charter:
- Development of a best current practice for building of TDWG vocabularies and an exemplary vocabulary (in progress).
- Collection and assessment of already existing vocabularies across the community. (ongoing progress).
- Identification of domain-specific groups that may be involved in the preparation of vocabularies (in progress).
The activities of the Task Group have been greatly delayed. We here propose a new timeline for completing the above-mentioned activities:
| Activity | 2020 | 2021 | ||||||
|---|---|---|---|---|---|---|---|---|
| Oct | Nov | Dec | Jan | Feb | Mar | Apr | May | |
| Development of a Draft Best Current Practice | X | X | X | X | ||||
| Submission for evaluation by the Executive Committee | X | |||||||
| Adjustments of the Draft based on the EC comments | X | X | ||||||
| Public review * | X | |||||||
| Development of exemplary vocabulary(ies) | X | X | X | X | X | |||
| Collection and assessment of already existing vocabularies | X | X | X | X | X | X | X | X |
| Identification of domain-specific groups that may be involved in the preparation of vocabularies | X | X | X | X | X | X | X | X |
* if no further adjustments are requested
- The Task Group does not plan to propose a new data standard or any modification to existing ones but intends to provide a best current practice for building TDWG vocabularies of values.
To develop Technical Specifications from the outputs from Task Groups 1 and 2.
- The first step will to be to establish a Charter and have it approved by the TDWG Executive Committee
- Provide a version freeze of the existing test descriptors, such that test implementors can work from a fixed common target.
- Develop, guided by the existing OWL representation of the framework, and the RDF representation of the test descriptions, a draft technical specification for the biodiversity data quality framework vocabulary and relations.
- Develop a draft specification on how to identify example, synthetic, and semi-synthetic biodiversity data.
It is planned that the next twelve months will be a major implementation phase, with continual liaison with key data aggregators (ALA, GBIF, iDigBio, SIBBR, OBIS), data custodians and data users.
- Finalize and get sign off on the CORE tests
- Finalise Code and the Test Datasets for the Tests and Assertions
- Submit Tests and Assertions as a TDWG Standard
- Submit Best Current Practice for building vocabularies of values as a TDWG BCP.
- Liaise with Annotations Interest Group on standardising Assertions as Annotations
- Continue liaison with ALA, GBIF, iDigBio and others on harmonizing/aligning Data Quality procedures.
- Encourage uptake of standard tests and assertions
- Outreach and dissemination of information
- Finalize, Document and Close Task Groups 1 and 3
- Initiate a new Task Group to develop a representation of the framework as a TDWG Technical Specification (details still to be worked out).
Chapman AD, Belbin L, Zermoglio PF, Wieczorek J, Morris PJ, Nicholls M, Rees ER, Veiga AK, Thompson A, Saraiva AM, James SA, Gendreau C, Benson A, Schigel D (2020). Developing Standards for Improved Data Quality and for Selecting Fit for Use Biodiversity Data. Biodiversity Information Science and Standards 4: e50889. https://doi.org/10.3897/biss.4.50889
- Chapman AD, Wieczorek JR, Belbin L, Morris PJ, Koch AK (2020b). Suppl. material 1: Vocabulary of Terms used for the TDWG Task Group on Data Quality Tests and Assertions. https://biss.pensoft.net/article/download/suppl/5449427/.
- Rees ER, Nicholls M (2020). Suppl. material 2: Data Quality Use Case Study Result. https://biss.pensoft.net/article/download/suppl/5255738/.
- Wieczorek JR (2020). Suppl. material 3: Counts of occurrence records in 2019-04-15 snapshot of GBIF for date-related validation tests. https://biss.pensoft.net/article/download/suppl/5255739/.
- Belbin L, Chapman AD, Wieczorek JR, Zermoglio PF, Morris PJ (2020). Suppl. material 4: TG2 Test Descriptions. https://biss.pensoft.net/article/download/suppl/5411743/.
Veiga AK, Saraiva AM, Chapman AD, Morris PJ, Gendreau C, Schigel D, & Robertson TJ (2017). A conceptual framework for quality assessment and management of biodiversity data. PLOS ONE 12 (6): https://doi.org/10.1371/journal.pone.0178731