-
Notifications
You must be signed in to change notification settings - Fork 8
Annual Report TDWG Data Quality Interest Group for 2018
The TDWG-GBIF Data Quality Interest Group (DQIG) was formally approved by the TDWG Executive in October 2014. Arthur Chapman and Antonio Saraiva are elected co-coordinators of the group.
Seminars and Interest Group meetings were held at TDWG2017 and are planned for TDWG2018, with two between-TDWG meetings held during the year. A major contribution to the SPNCH Data Quality Symposium is also planned at the joint SPNHC/TDWG meeting in Dunedin, New Zealand in August 2018.
The Interest Group manages four Task Groups. Three Task Groups were established in 2014, viz
- TG1 – A Framework for Data Quality – leader: Allan Koch Veiga
- TG2 – Data Quality Tests and Assertions – leader: Lee Belbin
- TG3 – Use Case Library – leader: Miles Nicholls
One new Task Group was established and approved by the Executive since TDWG 2017.
- TG4 – Vocabularies – Leader Paula Zermoglio
It was decided that the proposed Task Group on Invasive Organism Information was better placed elsewhere in TDWG and was not a good fit with the Data Quality Interest Group, although it was agreed that continued liaison with that Group was essential.
Discussions with the broad membership are held on Data Quality GitHub (bdq) which has been established on the TDWG GitHub site. The GitHub has been used extensively over the past twelve months - especialy by Task Group 2 (Issues). All documents related to the Interest Group are linked in the Wiki. A TDWG Data Quality Interest Group Listserve (tdgw-bdq) was also established to provide another form of communication on activities.
The Interest Group is currently writing a paper for the TDWG Journal Biodiversity Information Science and Standards: Improving Biodiversity Data Quality through a Fitness for Use Framework. It is hoped to have the paper completed and submitted early in 2019. There has been some hold-up with this paper as we have needed to finalize the tests and run some of the tests to provide a Case Study.
Significant highlights for the year were
-
The proposed implementation of the core suite of tests-assertions by the ALA (and its international deployments), GBIF and iDigBio (mentioned last year) are ongoing, and all remain fully committed.
-
Task Group 2 has finalized the list of core Tests and Assertions and these are now being coded and prepared for publication and eventual development of a standard.
-
Task Group 4 was established and the Charter approved by the Executive, and will begin liaising with possible contributors to and developers of Vocabularies of Value at TDWG2018. A preliminary draft scoping document is being prepared for discussion.
In the past twelve months, the DQIG has held many on-line meetings using Zoom, although time zones don’t always allow these to be as efficient as they may otherwise be.
Two physical meetings were held during the year.
-
In January 2018, a highly productive meeting was held in Gainesville, Florida to finalize the Tests and Assertions. Funding was provided by the Atlas of Living Australia (ALA), iDigBio and the Kurator project. In attendance were Lee Belbin (Chair), Arthur Chapman, John Wieczorek, Paul Morris, Paula Zermoglio and Alex Thompson.
-
Since the last report, a full day meeting was held on the Sunday preceding TDWG2017 in Ottawa, Canada. A report on that and other meetings held at TDWG2017 can be found on the GitHub at https://github.com/tdwg/bdq/wiki/Summary-of-DQIG-meetings-during-TDWG2017,-Ottawa-(Oct-2017).
-
Throughout the year there has been extensive interchange and liaison with key data aggregators and others, including GBIF, ALA, iDigBio, VertNet, OBIS and Kurator.
-
A second meeting is planned for the Sunday before TDWG2018 in Dunedin, New Zealand.
- An on-line Symposium was given by Allan Koch on the Conceptual Framework for Data Quality to the Biodiversity Informatics Training Curriculum in August 2018.
Task group 1 is led by Allan Koch Veiga from the University of Säo Paulo, Brazil.
Develop a conceptual framework that serves as a common ground for a collaborative mapping of DQ needs and DQ methods, tools and DQ reports for DQ Assessment and Management based on data fitness for use.
- The main goal of the task group has been achieved and has been consolidated by the engagement of the other task groups on using the Framework as a common backbone, and by putting the framework in practice in initiatives such as the Kurator Project and the Online Pollen Catalogs Network (RCPol).
- A contribution is being made to a new paper, entitled "Improving Biodiversity Data Quality through a Fitness for use Framework". This will spread and consolidate the principles and concepts in a more practical way
- For the same purpose, in August 2018, the conceptual framework was presented in a webinar of the Biodiversity Informatics Training Curriculum (https://www.youtube.com/watch?v=FJ7HLjl5_fg)
- The framework was tested
- in a use case in the Museum of Comparative Zoology of Harvard University,
- by implementation in at least two projects (Kurator and RCPol),
- and documented in a PhD thesis and in a full paper in PLOS ONE, and has the subject of a lot of discussions around DQ in the context of Biodiversity Informatics.
It is proposed that the next steps be:
- consolidate and document all the outcomes around Task Group 1 in a comprehensive public final report
- wind up the Task Group, and plan new efforts on the application of the framework to different contexts and projects.
- It is planned to wind up TG1 in the next twelve months
Task Group 2 is led by Lee Belbin who is the Science Advisor to The Atlas of Living Australia and who works from Tasmania, Australia.
To provide a report of the practical tests, rules, assertions, software and workflows associated with data quality of biodiversity-related records. To achieve this by developing a Core set of tests and assertions for documenting and improving the quality of Biodiversity Data.
- January 16-19, 2018 workshop was held at iDigBio in Gainesville, Florida to finalize the core tests and assertions. Attending were Lee Belbin and Arthur Chapman (sponsored by the ALA), John Wieczorek and Paul Morris (sponsored by Kurator), Paula Zermoglio and Alex Thompson (sponsored by iDigBio)
- Completed the TG2 components of the manuscript on Improving Biodiversity Data Quality through a Fitness for Use Framework, but further edit will be required post discussions ending at the Dunedin meeting.
- An implementation of the core tests/assertions has been achieved by Paul Morris for Kurator: https://github.com/kurator-org/ffdq-api/tree/master/src/main/java/org/datakurator/ffdq/api
- It is hoped that at least a draft suite of validation data and generic code for each test will be completed by the end of the 2018.
- A round-up meeting of the Interest Group on the Sunday of prior to TDWG2018 in Dunedin will identify outstanding issues. During that meeting and the Conference, it is hoped that any outstanding issues will be identified and a strategy for solutions developed.
- Standardization of the responses arising from the tests/assertions. Results of discussions can be seen in https://github.com/tdwg/bdq/issues/142. This issue is linked with the work of the Annotations Interest Group: http://www.tdwg.org/activities/annotations/. We believe that the assertions arising from the tests should conform with the standard that will be proposed from this Interest Group. This will, we hope, imply that assertions will always be retained (and travel with) the associated data.
Task Group 3 is led by Miles Nicholls from The Atlas of Living Australia, Canberra, Australia.
To document use cases in a structured format based on Toward A Conceptual Framework for the Assessment and Management of the Fitness for Use of Biodiversity Data (Veiga et al. 2017). The use case template will be placed in a collaborative editing environment for completion and discussion.
There has been minimal activity in TG3
- No additional use cases have been submitted in the available form or spreadsheet since October 2017. Collecting Data Quality use cases requires active follow up
- Using the methods developed by the task group: collect, document and implement key use cases in the Atlas of Living Australia as filters or views
- It is planned to Wind up TG3 in the next twelve months
Task Group 4 is led by Paula Zermoglio from the University of Buenos Aires, Buenos Aires, Argentina.
To create a framework within which to build biodiversity data vocabularies, particularly by developing a standard format for building TDWG vocabularies concerning the values used under the Darwin Core terms.
- The Task Group was established and the Charter approved by the TDWG Executive in February 2018.
- At the time of this report it was expected that a Scoping Document and a current best practice for building vocabularies would be ready. However there were delays, and these documents are not yet ready.
With respect to the goals proposed in the Charter, the group has:
- Developed a preliminary version of the scoping document which is now open for input from all members of the Task Group and from the community at large through a link in the BDQ GitHub Repository. https://github.com/tdwg/bdq/blob/master/Vocabularies/README.md
- Collected and are assessing already existing vocabularies across the community. These are being gathered in a Google Spreadsheet, linked in the BDQ GitHub Repository.
- Evaluated the current state of data shared through aggregators in relation to the use of controlled values. Data is being gathered and some preliminary analyses are being performed. This is not yet open to the community.
The workplan for the next twelve months includes the activities proposed in the charter:
- Finalizing Scoping Document.
- Development of a common repository for TDWG vocabularies-of-values. (Currently using GitHub)
- Development of a current best practice for building of TDWG vocabularies.
- Building of at least one exemplary vocabulary.
- Collection and assessment of already existing vocabularies across the community. (in progress)
- Identification of domain-specific groups that may be involved in the preparation of vocabularies.
- Finalizing in-depth evaluation of the current state of data shared through aggregators in relation to the use of controlled values.
- Preparation of a list of vocabularies needed for terms of the Darwin Core standard.
- We do not plan to propose a new standard or any modification to existing ones, but intend to provide a current best practice for building TDWG vocabularies.
Meetings at TDWG2018 will set the workplan for the next twelve months. We hope to use the Symposium and working meeting at TDWG to identify and recruit new people to help progress the work of the group over the next twelve months. It is planned that the next twelve months will be a major implementation phase, with continual liaison with key data aggregators (ALA, GBIF, iDigBio, SIBBR, OBIS), data custodians and data users. Tentative plans that we will take to the meeting at TDWG2018 include:
- Complete and submit paper to Biodiversity Information Science and Standards
- Finalize and get sign off on the CORE tests
- Finalise Code and the Test Datasets for the Tests and Assertions
- Submit Tests and Assertions as a TDWG Standard or equivalent
- Liaise with Annotations Interest Group on standardising Assertions as Annotations
- Hold working meeting in first half of 2019
- Continue liaison with ALA, GBIF, iDigBio and others on harmonizing/aligning Data Quality procedures.
- Encourage uptake of standard tests and assertions
- Outreach and dissemination of information
- Finalize, Document and Close Task Groups 1 and 3.