-
Notifications
You must be signed in to change notification settings - Fork 8
Annual Report TDWG Data Quality Interest Group for 2019
The TDWG-GBIF Data Quality Interest Group (DQIG) was formally approved by the TDWG Executive in October 2014. Arthur Chapman and Antonio Saraiva are elected co-coordinators of the group. Seminars and Interest Group meetings were held at TDWG2018 and are planned for TDWG2019.
The Interest Group manages four Task Groups. Three Task Groups were established in 2014, and a fourth in 2018, viz
- TG1 – A Framework for Data Quality – leader: Allan Koch Veiga
- TG2 – Data Quality Tests and Assertions – leader: Lee Belbin
- TG3 – Use Case Library – leader: Miles Nicholls
- TG4 – Vocabularies of Values – Leader Paula Zermoglio
Discussions with the broad membership are held on Data Quality GitHub (bdq) which has been established on the TDWG GitHub site. The GitHub has been used extensively over the past twelve months - especially by Task Group 2 (Issues). In total 181 Issues have been raised with 132 currently open (101 of which are the current Tests). A further 32 issues have been closed. All documents related to the Interest Group are linked in the Wiki. A TDWG Data Quality Interest Group Listserve (tdgw-bdq) is also used to provide another form of communication on activities.
The Interest Group is currently writing a paper for the TDWG Journal Biodiversity Information Science and Standards: Developing Standards for Improved Data Quality and for Selecting Fit for Use Biodiversity Data. It is hoped to have the paper completed and submitted shortly after the 2019 TDWG meeting. There has been some hold-up with this paper as we have needed to finalize a Case Study, and this has caused a refining of some of the tests.
Significant highlights for the year were
-
This has been a busy year of work on Task Group 2 (see below).
-
A highlight was the joint TDWG/SPNHC meeting held in Dunedin in 2018. This was a good opportunity to hold discussions and to swap ideas with members of one of our larger user groups, the Natural History community. Some joint meetings were held on Data Quality and these were extremely informative and productive.
In the past twelve months, the DQIG has held many on-line meetings using Zoom, although time zones don’t always allow these to be as efficient as they may otherwise be. One physical meeting was held during the year.
-
No meetings were held between TDWG Symposia this year, largely due to the cost. Arthur Chapman did travel to Bariloche, Argentina to work with John Wieczorek and Paula Zermoglio on a separate project, and this did afford time to also discuss some aspects of the work of the Interest Group.
-
Since the last report, a full day meeting was held on the Sunday preceding TDWG2018 in Dunedin, New Zealand. Joint discussions were held with SPNHC members on data quality at the Dunedin meetings. A report on those and other meetings held at TDWG2018 can be found on the GitHub at https://github.com/tdwg/bdq/wiki/Summary-of-DQIG-meetings-held-during-TDWG2018,-Dunedin,-New-Zealand-(Aug-2018).
-
Throughout the year there has been extensive interchange and liaison with key data aggregators and others, including GBIF, ALA, iDigBio, VertNet, OBIS, SABIN and Kurator.
-
An informal meeting is planned for the Sunday before TDWG2019 in Leiden, The Netherlands, and a TG4 workshop will be held on Vocabularies of Value on the Monday.
-
Several symposia were presented at the TDWG2018 in Dunedin.
-
In November 2018, Arthur Chapman ran Data Fitness for Use training course for the South African National Biodiversity Institute (SANBI) in Kirstenbosch, Cape Town, South Africa. A presentation was made on the work of the Interest Group.
Task group 1 is led by Allan Koch Veiga from the University of Säo Paulo, Brazil.
Develop a conceptual framework that serves as a common ground for a collaborative mapping of DQ needs and DQ methods, tools and DQ reports for DQ Assessment and Management based on data fitness for use. Achievements 2018-2019:
- The main goal of the task group has been achieved and has been consolidated by the engagement of the other task groups on using the Framework as a common backbone, and by putting the framework in practice in initiatives such as the Kurator Project and the Online Pollen Catalogs Network (RCPol).
- A contribution is being made to a new paper, entitled Developing Standards for Improved Data Quality and for Selecting Fit for Use Biodiversity Data. This will spread and consolidate the principles and concepts in a more practical way
- For the same purpose, in August 2018, the conceptual framework was presented in a webinar of the Biodiversity Informatics Training Curriculum (https://www.youtube.com/watch?v=FJ7HLjl5_fg)
- The framework was tested
- in a use case in the Museum of Comparative Zoology of Harvard University,
- by implementation in at least two projects (Kurator and RCPol),
- and documented in a PhD thesis and in a full paper in PLOS ONE, and has the subject of a lot of discussions around DQ in the context of Biodiversity Informatics.
It is proposed that the next steps be:
- Consolidate and document all the outcomes around Task Group 1 in a comprehensive public final report
- Wind up the Task Group and plan new efforts on the application of the framework to different contexts and projects. Recommendations/plans on possible wind up of the Task Group
- It is planned to wind up TG1 in the next twelve months
- A recommended is proposed for a new Task Group to work on the development of a Standard. This is yet to be defined.
Task Group 2 is led by Lee Belbin who is the Science Advisor to The Atlas of Living Australia and who works from Tasmania, Australia.
To provide a report of the practical tests, rules, assertions, software and workflows associated with data quality of biodiversity-related records. To achieve this by developing a Core set of tests and assertions for documenting and improving the quality of Biodiversity Data.
- Many hundreds of hours of discussion and work has occupied key members of the group throughout the year, on Zoom, via the GitHub, by phone and by email.
- Unfortunately, costs and the pressure of work, prevented a physical meeting in the past twelve months.
- We have standardised the ‘Expected Responses’ for all 101 tests in relation to structure and terms such as EMPTY and ‘specified source authority’. The current set of tests can be seen at https://github.com/tdwg/bdq/labels/Test
- Over the past three months, all tests have been reviewed considering discussions regarding the concept of ‘parameterisation’. We needed to identify the criteria around which 39 of the tests require some form of authoritative or default resource such as a taxonomic names authority or a vocabulary. This led to the acceptance of the need for thesauri to facilitate the AMENDMENT subset of tests. While some thesauri exist, others will have to be established from scans of Darwin Core records. This will feed in to TG4 on Vocabularies of Value (see https://github.com/tdwg/bdq/wiki/Vocabularies-needed-for-Darwin-Core-terms prepared by TG4).
- Code for all Date/Time-related tests has been written and applied to large subsets of GBIF records. Code is currently being written for the other tests. We have maintained the process of applying the VALIDATIONs, then attempting relevant AMENDMENTs, then re-checking all the VALIDATIONs. This enabled us to review the effectiveness of the tests and in some cases, tests were dropped on the criteria of rare instances where the test would flag an issue.
- The above test results are currently being written up as a Case Study for incorporation into the paper Developing Standards for Improved Data Quality and for Selecting Fit for Use Biodiversity Data
- A vocabulary of terms has been developed (https://github.com/tdwg/bdq/issues/152). This will be published as a Supplementary file in the paper on Developing Standards for Improved Data Quality and for Selecting Fit for Use Biodiversity Data
- Completed the TG2 components of the manuscript on Developing Standards for Improved Data Quality and for Selecting Fit for Use Biodiversity Data.
- It is hoped that at least a draft suite of validation data and generic code for each test will be completed by late 2019 or early 2020. Plans for 2019-2020:
- Our final discussion on Parameterisation and the use of vocabularies, thesauri and references will require a review of the Expected Responses of the relevant tests.
- The building of test datasets that can be used to validate the installation of the test code. Code has been written to extract the parameters of each of the tests to RDF and we believe that this will form the basis of the proposed TDWG standard for the Tests and Assertions.
- Some work is still required on standardising responses arising from the tests/assertions. Results of discussions can be seen in https://github.com/tdwg/bdq/issues/142. This issue is linked with the work of the Annotations Interest Group (http://www.tdwg.org/activities/annotations/. This work had to be postponed due to the Annotations IG not having a leader over the past 18 months. We believe that the assertions arising from the tests should conform with the standard that will be proposed from this Interest Group. This will, we hope, imply that assertions will always be retained (and travel with) the associated data.
Task Group 3 is led by Miles Nicholls from The Atlas of Living Australia, Canberra, Australia.
To document use cases in a structured format based on Toward A Conceptual Framework for the Assessment and Management of the Fitness for Use of Biodiversity Data (Veiga et al. 2017). The use case template will be placed in a collaborative editing environment for completion and discussion.
There has been minimal activity in TG3
- No additional use cases have been submitted in the available form or spreadsheet since October 2017. Collecting Data Quality use cases requires active follow up. Plans for 2019-2020
- Using the methods developed by the task group: collect, document and implement key use cases in the Atlas of Living Australia as filters or views
- Prepare a comprehensive report as part of the paper to be published in the Biodiversity Information Science and Standards Journal later in 2019. Recommendations/plans on possible wind up of the Task Group
- It is planned to Wind up TG3 in the next twelve months
Task Group 4 is led by Paula Zermoglio from Vertnet, Argentina.
To create a framework within which to build biodiversity data vocabularies, particularly by developing a standard format for building TDWG vocabularies concerning the values used under the Darwin Core terms.
The activities of this task group have been greatly delayed due to personal affairs of the convener. The main outcome expected for this year, namely the development of a current best practice for building of TDWG vocabularies, has not yet been delivered. With respect to the goals proposed in the Charter, the group has:
- Developed a scoping document which is available through the BDQ GitHub Repository. https://github.com/tdwg/bdq/blob/master/tg4/README.md
- Prepared a list of vocabularies needed for terms of the Darwin Core standard, based on the standard itself and on the needs identified by Task Group 2, Tests and Assertions. The list is available through the BDQ GitHub Repository (https://github.com/tdwg/bdq/wiki/Vocabularies-needed-for-Darwin-Core-terms).
- Collected and assessed already existing vocabularies across the community. These are gathered in a Google Spreadsheet, linked on the GitHub Repository. (this is in constant progress as more resources become available).
- Evaluated the state of data shared through aggregators in relation to the use of controlled values. Analyses were performed and are available through a wiki page in the BDQ GitHub Repository.
- Started to develop a current best practice for building of TDWG vocabularies. Documents are not available for public review yet. We will take the opportunity given by the vocabularies workshop during TDWG 2019 to discuss some of the relevant topics.
- Organized a pre-conference workshop for TDWG 2019.
The workplan for the next twelve months includes the activities proposed in the charter:
- Development of a common repository for TDWG vocabularies-of-values (Currently using GitHub): the relevance of this topic is being re-evaluated, considering possibilities such as TDWG or GBIF as the main repository.
- Development of a current best practice for building of TDWG vocabularies and an exemplary vocabulary (in progress).
- Collection and assessment of already existing vocabularies across the community. (ongoing progress).
- Identification of domain-specific groups that may be involved in the preparation of vocabularies. Plans for Standards, etc.
- We do not plan to propose a new standard or any modification to existing ones but intend to provide a current best practice for building TDWG vocabularies.
Meetings at TDWG2019 (Biodiversity Next) will set the workplan for the next twelve months. We hope to use the Symposium and discussions at TDWG to identify and recruit new people to help progress the work of the group over the next twelve months. It is planned that the next twelve months will be a major implementation phase, with continual liaison with key data aggregators (ALA, GBIF, iDigBio, SIBBR, OBIS), data custodians and data users. Tentative plans that we will take to the meeting at TDWG2019 include:
- Complete and submit paper to Biodiversity Information Science and Standards
- Finalize and get sign off on the CORE tests
- Finalise Code and the Test Datasets for the Tests and Assertions
- Submit Tests and Assertions as a TDWG Standard
- Liaise with Annotations Interest Group on standardising Assertions as Annotations
- Hold working meeting for TG2 in first half of 2020 (funds and locations permitting)
- Continue liaison with ALA, GBIF, iDigBio and others on harmonizing/aligning Data Quality procedures.
- Encourage uptake of standard tests and assertions
- Outreach and dissemination of information
- Finalize, Document and Close Task Groups 1 and 3.