Summary of DQIG meetings during TDWG2017, Ottawa (Oct 2017)

The Data Quality Interest Group had a very busy and successful series of meetings and symposia at TDWG2017 in Ottawa.

Sunday Meeting

TG1 – Framework for Data Quality

Report:

Framework published: Veiga AK, Saraiva AM, Chapman AD, Morris PJ, Gendreau C, Schigel D, Robertson TJ (2017) A conceptual framework for quality assessment and management of biodiversity data. PLOS ONE 12(6): e0178731.[https://doi.org/10.1371/journal.pone.0178731] (https://doi.org/10.1371/journal.pone.0178731)
Framework light describes what is quality and what is not quality. Fundamental concept generates several derived concepts. End product: common metadata schema in rdf.
FFUB = Fitness for Use Backbone Next steps include engagement and collaboration.
Framework light helps people use the conceptual framework to determine data quality and if data is fit for use. It’s a link between the model and the process.

Allan also showed a modification of the Framework to cater for the “negative” tests.

Identification of Issues / Q&A

Get people on Task Group more involved
Revisit Charter
Engage people in a group.
Suggested close this Task Group and start a new task group (largely based around implementation)
- Teaching curriculum
- Publishing (instructions for authors AND reviewers and editors)
  - Start with Pensoft
- Hold workshops

TG2 – Tools, Services and Workflows

Report: There has been significant comment and suggestions on making the tests more compliant with the framework of TG1

Discussion on using negative tests (i.e. looking for errors in data) versus positive tests (looking for compliance). It was suggested that if looking for errors at the aggregator level, there is great benefit in using a “negative” test. One negative test may identify a problem whereas it may take a number of positive tests to say that the data is “good or compliant”. It was also pointed out that it is much easier to automate the negative tests – this statement was supported by all the aggregator implementers present. Negatives should be naturally used as filters.

A lot of work has been going on with the spreadsheet of Tests and assertions recently. Lee was waiting on people to complete scoring the tests (with 1’s for keep as core, 0’s to move to supplemental). As soon as this is done, he will finalise a list of core tests and restructure. They will then be added to the GitHub with the comments added as issues.

Reports from Alex Thompson and Paul Morris on the Test Dataset and Generic Code. Alex stated that he has completed a few modules and that once the tests have settled down, he will get back to it. The new compliance tests will only take one line of code each and are thus easy to do.

Identification of Issues / Q&A

Workshop to be held in the next few months, if funding is available, to have people from the Task Group as well as implementers from the three key aggregators (GBIF, ALA, iDigBio) come together and work through each test in detail. Budgetted proposal to be prepared for presentation to GBIF, ALA and iDigBio.

Tests can have dependencies, making the order that the tests are run important (don’t check whether country is valid before you review +/- of lat/lon values).

Dmitry suggested – as an outcome of the between-TDWG workshop to complete the revision of the available tests & assertion collections and to come up with clusterisation sketch of the test and the major DQ issues for future visualisation / infographics, using sustainable goals like the GEOSS wheel as inspiration.

TG3 – Use Case libraries

Report: 28 use cases assembled. Key DQ fields identified. Emily’s publication pending as part of the Biodiversity Information Science and Standards (BISS) article. TG near to completion.

Identification of Issues / Q&A

GBIF has a new program officer, Andrew Rodrigues, who is charged with communicating how to better analyze your data. TDWG has a curriculum IG that is currently inactive. Town Peterson has a FB group on Biodiversity Informatics Training with thousands of followers; should we associate with them? Also, the data carpentry WS is a potential use case for applying our data quality products. There is a journal set up to publish curricula.

Task: find out what Town and data carpentry are currently teaching on data quality. Are we ready to push our products?

Suggested that once BISS paper is out, consider closing TG3 and maybe restart collection of use cases once framework / FW Light comes into active use.

GBIF/ALA/iDigBio collaboration

Report:

Alignment of the different languages between the three systems needs work. Noted that it is more than just Data Quality but has definite links to Data Quality and the work of TG2.
Potential for shared harvesting infrastructure; run the harvest once, then distribute to the partners based on the different schemas. This idea is already being worked on by Tim Robertson.
OBIS have indicated that they will also implement the core tests, via generic code and SQL. SiBBr (Brazil) is going through a difficult time, and may not be able to join in. There are several ALA clones, as well, such as Canadensys.

Proposed TG4 – Vocabularies

Status: Propose a new task group to deal with the issue and values of controlled vocabularies within DwC terms. Want to build a framework on how to build a controlled vocabulary for DwC terms; providing a standard format to do so.

Task: Proposed charter is now available (https://tinyurl.com/ydc6l7h3). Comments called for by 15 October (do not alter – only comments and suggestions please) and then will be submitted to the TDWG Executive.

Next steps: submit charter and begin work. Liaise with what Quentin is doing on invasive species terms. Hope to develop a repository by Feb-2018. Hope to identify domain-specific groups by Apr-2019. Hope to create a list of needed vocabularies by Jul-2019.

Examine ControlledVocabs Resources on google doc. (see Resources section of charter).

Proposed TG5 Invasives

Discussion on where to host it:

Agreed that it is a better fit with the Species Interactions IG. Should also have extensive conversation with the Vocabulary TG and potentially be the first instance of the framework that the VTG generates.

DQIG meeting between TDWGs

Funding: preparing a proposal GBIF/iDIGBio/ALA for the Tests and Assertions workshop

Between meeting of Interest Group: Suggested Location: Colombia (Bogotá Inst. Humboldt) or possibly Sao Paolo.

May be possible to link to the GBIF Capacity Building (calls late 2017/early 2018). Would need to include a training module. Proposals has to come from a GBIF member country. It was thought that Colombia would be a strong candidate, with training linked to the GBIF Data Validator, possibly Data Quality oriented training for students. John Wieczorek and Antonio Saraiva to follow up and develop a proposal.

Agenda: at the latest, March. Combine with TG2 work? Would need an extra day. Defining the agenda will help determine why to apply for funding to assist. Include testing the data cleaning tools.

Article in Biodiversity Data Journal / Biodiversity Information Science and Standards- Pensoft

Progressing paper is a bit behind schedule. “Improving Biodiversity Data Quality through a Fitness for Use Framework”. Allan and Paula to write up their sections (November 15 deadline)

Discussion during week suggested that this paper may be better placed in the new TDWG Journal (Biodiversity Information Science and Standards (BISS)). This will now be done following discussion with TDWG Executive members and the BISS Editor.

Case study GBIF has a public API that receives lat/lon as input and returns all countries within 5KM of the point, plus whether on land or on water.

Identifying to do’s: volunteers to write the Case Study

Need simple example
Stay away from taxonomy
Suggested something along the lines of what simple changes can I make to my dataset to improve the number of records I can use
Use georeferences and possibly some date fields.
A quick analysis showed 5.8 million records with verbatimCoordinates, 1 million of those did not have lat/long – this could be an easy example.

Alex, John and Antonio to coordinate

Annotations

Report on Denver meeting

Awaiting report of meeting from Paul Morris (to be available shortly after end of TDWG2017).
Settled on W3C oa:Annotation; allows annotations on annotations; annotation on a data record can be treated as an issue; may not be robust enough, according to iDIGBio; they have a complex method that includes a json element.
A similar application may be useful for the VocabTG.

Where to from here

Determine if the services we are offering are for the data owners, or should the data aggregators simply ‘take care of it’ and the owners not be involved? NO! the data owners should be responsible and demonstrate a pride in the quality of their data.

Data Quality Listserve

Established at BDQ Listserve

Thursday Workshop

A draft summary (dot points) can be found at http://bit.ly/TDWG17-W06. This will be edited and uploaded in the near future (thanks to Abby Benson, Paul Morris and others who have worked on these notes).

Summary of DQIG meetings during TDWG2017, Ottawa (Oct 2017)

Sunday Meeting

TG1 – Framework for Data Quality

TG2 – Tools, Services and Workflows

TG3 – Use Case libraries

GBIF/ALA/iDigBio collaboration

Proposed TG4 – Vocabularies

Proposed TG5 Invasives

DQIG meeting between TDWGs

Article in Biodiversity Data Journal / Biodiversity Information Science and Standards- Pensoft

Annotations

Data Quality Listserve

Thursday Workshop

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally