Skip to content

Report of TDWG DQIG Meeting Canberra, Australia (May 2017)

Arthur Chapman edited this page Oct 15, 2017 · 1 revision

NB This is a Summary report.

A detailed report of this meeting with full discussion can be found at:

Attendees:

  • Arthur Chapman (DQIG Co-Convenor)
  • Lee Belbin (TG2 Leader)
  • Miles Nicholls (TG3 Leader)
  • Paula Zermoglio (Vocabularies - Leader)
  • Donald Hobern (Director, GBIF)
  • Alex Thompson (iDigBio)
  • Dave Watts (OBIS)
  • John Wieczorek (Darwin Core)
  • Renato de Giovanni (CRIA/Brazil)
  • Shelley James (NSW Botanic Gardens – ex iDigBio)
  • Tania Laity (Australian Department of Environment and Energy - ERIN)
  • Peggy Newman (Museum of Victoria – part of time)
  • Emily Rees (ALA – part of time)

Remote: (Part of time)

  • Antonio Saraiva (DQIG Co-convenor)
  • Allan Koch Veiga (TG1 Leader)
  • Arturo Arino (Spain)
  • Dmitry Schigel (GBIG)
  • Tim Robertson (GBIF)

Acknowledgments

We would like to acknowledge the Atlas of Living Australia (ALA) for hosting the meeting, and organisations that provided funding for the International attendees (ALA, TDWG-CSF, GBIF, iDigBio, Kurator project) as well as local and overseas organisations that provided time and funds for attendance at the meeting. Thanks to Tania Laity for taking the notes.

Fitness for Use Framework (TG1 – Allan)

  • Allan reported on progress to date which can be seen in the following Powerpoint presentation: https://drive.google.com/drive/folders/0B5o4z55hvAxxa1p1NkRwMTZhX1U
  • Allan reported that Publication of the Framework had been accepted for publication in Plos1
  • A workable Conceptual Framework (Framework Lite) is being developed and progress can be seen on the TDWG DQIG GitHub. This is still a work in progress and many of the parts are still being worked on.
    • Metadata schema – not started on this yet but have some thoughts on how to start it
    • Fitness for Use Backbone (FFuB) – partnership has been established between USP-BIOCOMP and SIBBR-MCTI - providing funding to develop Biodiversity Quality Toolkit
    • Biodiversity Data Quality Profiling - A Practical Guideline – short paper submitted and accepted

Alan then went through the process of Using the conceptual framework and demonstrated the process of four groups with different rolls using scenarios for:

  • DQ profiler
  • DQ developer
  • Data user
  • Data holder

Discussion:

Discussion centered around how the Tests and Assertions developed in TG2 would fit into this aspect of the Conceptual Framework. There was also discussion on the positive versus negative approaches - i.e. users of the profiles are wanting data that fit certain criteria, whereas the tests and assertions is identifying problems with the data. It was noted that as most of the tests were of a binary nature, this should not be a problem and it was pointed out that the Tests and Assertions Spreadsheet already included a column giving the positive equivalence (although these need to be checked to ensure that they are indeed direct opposites).

There was discussion on Provenance i.e. how we capture how data that has changed during the workflow is described and the status of quality over time. Allan reported that the DQ report is only a snapshot in time – this is not something the framework will cover. It was agreed, however that this was an important issue and recommended that it be discussed further in the Annotations context.

Framework Lite

Allan went through the Framework Lite in detail and expanded on the five components.

  • Quick start – where we will introduce Framework-lite, we plan to have all information available to use the conceptual framework at a high level – there is a bit of content but there is a lot of work to do yet – there will be sections for each type of stakeholder and summaries of each stakeholder role and how they interact with the framework, plus an interactive tool to help define a DQ profile – not controlled fields.
  • Resources – presentations, publications and research go here to help use and understand the framework and DQ
  • Glossary and controlled vocabularies – the glossary is the same as supplementary information for paper, controlled vocabulary for data dimensions, data enhancement types and data resource types – we would like to include references for vocabulary terms
  • FAQs – we need to start to populate this page -maybe start with some more general questions
  • Tools – we intend to put tools here to help people define DQ profiles and reporting based on the framework – these will be integrated with the FFuB

Discussion

It was suggested that the development of profiles could be improved by removing the free-form nature of the answers and to maybe include drop-downs / checkboxes. Good to iterate on the design of the form and add more structure but not change the underlying structure – just the way users enter data

Another suggestion was to allow for changes to be saved automatically along the way rather than waiting to the end. Also there would be advantages in saving to a Database so that users could come back to it - or use an already saved profile and modify it according to their needs rather than have to start from scratch each time.

It was also suggested that it might be useful to include a family of tests e.g. for coordinate-based tests that you want to apply – bundles of checks that might be applied to specific IE etc.

Tests and Assertions (TG2 – Lee)

Lee reported that the Tests and Assertions spreadsheet had been circulated for comment, and he had received several responses. These had been examined and are being incorporated into the document.

Lee then went through the spreadsheet and explained each of the columns and worksheets.

  • I would like to put it out more formally and broader via TDWG list soon

  • He is preparing a paper covering all decisions etc. made in creating the spreadsheet

  • This is only meant to be the core tests (not a comprehensive list) – it is very generic– different domains may want to add specific tests relevant to their data e.g. depth data for marine domain

  • One of things not mentioned is the need for some mechanisms to capture domain specific tests

  • The Resolution is for single or multi-record and there are single / multi-term tests. Every test links to one or more DwC Terms

  • Data dependency – is either internal / external – i.e. available within the aggregator or outside – one of the problems regarding DQ in the case of DwC is the uncontrolled nature of it. We have to do something about this i.e. we have to make an attempt at improving it as it needs more internal controls

  • Output type – this is a linkage to the framework summary e.g. number of DwC fields filled in in the raw data. For the tests which have an output type of validation and amendment we need to go through these and try to split out as either validation or amendment.

  • Outstanding issues included

    • Unique ids for the tests needs to be addressed [Alex has ID generator ready to go and while Lee was talking, prepared and added a GUID into the working spreadsheet.]
    • Preparation of a Test dataset [ACTION: Paul Morris had agreed to prepare an example dataset]. John reported that this is in process
    • Development of generic code for each of the tests.
    • We need a standardised set of variable names which makes more formal specifications easy to render

    Discussion

    There was some discussion on how the Assertions may be applied in practice. Lee suggested that the outcome for each record in a spreadsheet could become an extra column against the record i.e. 104(ish) columns added to DwC – for most cases this will be true or false as long as it’s explicit what each column means. During the discussion it was suggested that there will be an enormous benefit if we agree on a standard way to implement it. It was agreed that GBIF, ALA and iDigBio will look into this as part of their joint implementation strategy.

    • Specification (Technical description) – I have started this but it could be improved – I included this to [ACTION: handball to Alan to fill in some examples]
    • DwC terms – this is probably not as comprehensive as it could be and may be - needs checking before publication.
    • DQ dimension – standard terms were used for this column. It was sometimes difficult to select which category each test fit into.
    • Severity – we might need a bit more discussion regarding this and the terms used in this column.

    Discussion
    There was some discussion about dependencies and severity. i.e. if year is missing then data is missing; and if you 'correct' latitude and longitude by swapping them is this an error or warning. Lee suggested that we probably need recognition of order of the dependency and what are the implications of switching from positive to negative. [ACTION: Arthur to check definitions re DQ dimensions and see how they align with severity]

    • Source – the tests come from 2 types of sources i.e. systems and individuals – this is not exhaustive
    • References – this has not been done – probably mostly Chapman et al. or we could get more specific. This would be good to fill out if we can find the reference. Things in the reference list (tab) are those which are considered to be fundamental to the work of TG2. It’s best to leave field blank if don't have one [ACTION: One or two people to take responsibility for filling this in – Lee to delegate and then go through this with them_]
    • Example implementations – this will be based on the stuff coming out of the GBIF/ALA/iDigBio alignment – I’m happy to capture some of this information in here if it’s relevant.

    Discussion
    There may be different code implementations in different agencies and this will need to be recorded. Especially where different algorithms may have been used - for example with outlier tests using the reverse jackknife test, or expert defined range, etc. We need to look at how we record more detailed documentation / technical explanation for more descriptive / complicated examples. There may be algorithms we want to publish but we may want to apply different test for different applications. Alex suggested that there are website compliance tests – W3 - perhaps we could develop something like an implementation matrix rather than filling out a column regarding which tests are currently compliant. [ACTION: Alex to follow up how this could be implemented]

    • Link to specification source code – will there be a link here to code / reference implementation of test?

    Discussion
    It was suggested that recording this in the spreadsheet would be very complicated as some of the codes/APIs etc could require several spreadsheets of there own. Suggested that maybe they could be true / false columns i.e. to indicate where we have developed something for each of the assertions.

    • Feedback sheet – this is included for general comments on the spreadsheet. There are only a few so far, but it has only been put out to the DwC people so far.

GEOSS Project - Donald reported that data management principles have been published and include 10 key aspects that they are encouraging people to adopt e.g. discoverability, quality of metadata etc. -www.geolabel.info/DMP_generation.htm. It would be interesting to apply a profile for datasets that would pass a certain quality standard for TDWG e.g. maybe 2 thresholds i.e. minimum quality or gold star. High level uses of the framework could be to set some standard expectations for data publishers and set a mechanism to report on this. Enabling reporting on long-term management of data. The meeting agreed that this was worthwhile and that we should not attempt to reinvent the wheel. It was suggested that some of the tests in the Tests and Assertions could feed into the overall evaluation of a dataset quality metric.

Development of a Reference Set (Test Dataset) for tests and Assertions - Alex

Alex reported that there has been no decisions yet - only agreed that one was needed and Paul Morris wanted to be able to augment reality-based datasets with things you haven’t found yet. So we were looking at a mix of synthetic and real data and a way to identify which are synthetic and real values. I have given this to John Wieczorek.

  • For the single term tests – these are pretty easy, but multi-term tests are pretty complicated e.g. dates around 1000 rows. Most controlled vocabularies exist as standard – GBIF have some for most of the larger tests – but these are not standards yet and are derived from code e.g. country controlled vocabulary in GBIF – we could start with these.
  • Verbatim fields could be perfectly valid or rubbish
  • There are differences between human-readable and machine readable formats
  • We are happy with using a test dataset as long as records are flagged as verbatim or modified. We can’t do all permutations but need to have enough data to exercise all aspects of the tests (e.g. to pick up all the edges i.e. we need a date that represents all formats of the dates or a cut down set) – iDIGBio has more than 20,000 date formats, but there are only about 20-50 that comprise majority of the data, so we could focus on those.
  • Do we need one test dataset to start with or one dataset per test or a combination e.g. for each profile for TG3 have a data set?
  • What proportion of records will have dataset level metadata?
  • Need some process to add additional tests as more come up – and some governance / moderation process set up for this

Discussion

As part of the discussion it was suggested that we need individual small datasets for individual tests and one large real dataset (or more) to test the overall capability. There was extensive discussion including on technical aspects without any final consensus but general agreement that [ACTION: further discussion involving Lee, Alex, Paul and John would be needed as a follow-up to this meeting]. Paul has begun working on this and [ACTION: Paul to inform relevant players as to what he requires to complete this task].

Development of Code - Alex

Alex reported that they are still working on doing the Python implementation and expect to make significant progress over next couple of months Lee and Donald stated that as soon as we have something ALA and GBIF are happy to run with it. What ends up being standard implementation for GBIF and ALA will be a reference implementation

  • We will try csv and if too unwieldy we might have to produce a huge Jupiter notebook for cell output or something similar
  • A separate GitHub repository will be used because the current one has a lot of things going in it already. We could use the current BDQ repository for a reference dataset – pushing code into it that might overload it already. All changes go into the same change log so it’s hard to pull out the thing that's relevant when there are several processes running concurrently. A link will be made from the DDQ GitHub to the other repository. [ACTION: Alex to work with Peter Desmet to get a new repository]

The Way Forward

  • We need timeframes on action items
  • Implementation will throw up some stuff we need to modify before we release a standard
  • Need to ask TDWG what they want us to produce – this will be different to DwC as it’s less about terms and conditions in many respects [ACTION: – Lee to talk to TDWG Executive about this]
  • Options are for one overall standard (from Framework down) of several interrelated standards - including a Vocabulary Maintenance Standard.
  • It was suggested that one standard may be unwieldy and harder to maintain, but if more than one standard then a Framework Standard should be the first and the others follow with linkages to it. Let us not have standards fro TG2, TG3 and TG4 prior to the Framework.
  • Suggested that not everything need be within the standard but include documents that reference the standards.
  • [ACTION: Arthur – to add to agenda for Sunday meeting prior to TDWG2017]. this will allow us to be a bit more mature by then
  • Noted that many of the tests rely on Controlled Vocabularies so implementation will be difficult without these available. Suggested that a priority list be generated.

Use Case Library (TG3 – Miles)

Emily Rees gave a Powerpoint presentation of her work on preparing Use Cases from literature and through interviews with users of the ALA data.

  • We were tasked to develop a library of use cases to use as a resource for research – this would allow users to run statistics across the library to work out which are the most common elements used
  • We can then automate data selection from profiles
  • Therefore, we need to collect the use cases data. We pulled the framework apart and put it into related spreadsheets with elements from the framework
  • Alan built a tool which you can select a use case and it will build a profile from it – to demonstrate that you can do something useful with this information
  • We are trying to get information from people to enter – we put information up on the TDWG notice board but got no respondents – then we developed an offline spreadsheet which was a simpler version but still got no respondents – Dmitry then built the google form with 11 questions that goes through to a spreadsheet – and only got two respondents
  • It takes a lot of work to transpose into spreadsheet – Emily was employed to work on this through the CSIRO summer scholarship for 3 months – she reviewed research papers, emailed people and talked to people to gather information for use cases – there are 28 use cases in resulting spreadsheet from this work.
  • How can we continue to populate the library? How can we use it? Is it worthwhile continuing with this?
  • Most examples were from papers and then emailed authors where there was not enough information to populate the spreadsheet.
  • Of the types of use cases – Species distribution models was the largest number of use cases (so this will skew the results), then database entry, species lists, audit, climate envelope

Discussion:

Dmitry mentioned that there the GBIF Mendeley library could be used for data mining if it could be partially automated. If Emily has documented the routines by which she’s extracted the important bits we could potentially automate this

It was also suggested that once the tests are clearer it would be easier to track which filter options people use when downloading data. In this way we may be able to save filters as profiles and develop a library this way. Near impossible to extract the required details from published papers as most methodologies not that precise.

Once there is a sufficient number of Use Cases, then themed profiles could be developed (SDM, etc.) and then these could be modified by individual users.

Suggest that we get Pensoft to require authors to develop a profile prior to publication

Next Steps

  • We shouldn't discontinue building the library – but perhaps be putting effort into manual building is not worth it – but exploring opportunities to drive more people more strongly would be good
  • An option is developing better controls on the google form – and more directed and controlled inputs. e.g. rather than free text use check boxes and drop down lists based on the Tests/DwC terms, etc. Would still then need a Verbatim field for tests that may not be included.
  • Could get ALA to add an extra prompt with checklist to ask users during download for the use [ACTION: Lee/Miles to explore]
  • Alex suggested that there might be benefit from bringing Sherry (iDigBio evaluator) in to help design a new form [ACTION: Alex to persue]
  • Need to put a piece in a paper for the biodiversity data journal about Emily’s work –it’s up to Emily and Miles to

NB. The meeting suggested to Emily and Miles that this work should be published – possibly as a Pensoft publication as there were some important implications.

GLOBIS project - Lee

Lee reported that this is an EU funded project which came off a similar project – interoperability – looked at positioning biodiversity research infrastructures – aligned to state of the environment reporting – Essential Biodiversity Variables (EBVs) – we looked at a couple of invasive species from GBIF and ALA and then looked at criteria to extract data

  • There were shortcomings of ALA and GBIF on this very specific project – there were lots of lessons to be learnt [ACTION: Lee to send notes of issues identified to Alex]
  • It would be great if people could filter more for users’ purposes rather than having to do it post extraction – it came down to additional support that agencies could provide in the filtering process
  • In the spatial portal of ALA we can interrogate the points and see changes the filtering made
  • We need to improve tools for people to filter data prior to extraction
  • Domain experts flagging missing datasets
  • Differences in data between ALA and GBIF
  • Concept behind EBV’s trying to construct species occupancy matrices for portions of the earth’s surfaces with associated abundances – but want to identify which species within a range of groups that occur in an area and how this compares with those in other areas / time periods so that they can be used to model patterns of change in time and space
  • Looking at major aggregation databases as conduits to these models – so initially focussing on a few species – the thinking was that we need to optimise data processing so it’s possible to get the best, most likely data out of the system without much post-processing
  • We should be pushing various filters down to the APIs to select the data e.g. bins to decades for birds – this means that you want to be able to exclude records that don’t meet criteria
  • We need to revisit how DQ processing happens in ALA, GBIF etc. – how can this be done in more consistent way and that the effects of what’s been done to the data output and the APIs can offer exports in the same way
  • 3 tiers in which we’re trying to deliver real standardisation and interoperability across infrastructure – how we document checks and implement them – APIs exports could be more aligned across infrastructures (e.g. an external tool could access data more consistently across aggregators) – clarity, guidelines and web services to push into the same workflows that ALA and GBIF use – can we move all data indexing and harvesting into shared cloud? – so that they all use one pool of data but with different views for different jurisdictions, etc. but perform the same checks etc. on the data across the whole dataset- there would be a lot of alignment of data

Discussion

  • Tim – we need to think about putting data into more than DwC or DwC archives – need to identify classes of data that we need to validate and treat them partially in isolation. It doesn’t matter where it is – the three things can be run anywhere – don’t let the pool of data in the cloud stop us moving forward.
  • Dave -OBIS includes extended DwC data model which might be useful - still star schema
  • Alex -a lot of this stuff doesn’t fit into a star schema but we’re handling it in iDigBio

GitHub – Arthur

  • How best can we operate using GitHub – should we be using google docs instead for documents / spreadsheets etc.
  • GitHub is not as intuitive for non-programmers
  • If other non-programmers are going to use it there's a lot of work in setting up a spreadsheet etc.

Discussion

The meeting noted Allan's use of GitHub for the Framework was impressive , however there is quite an overhead and investment in doing this and it won't suit all activities.

Need to keep it simple. Alex and John prepared to help if needed. It was noted that there are many tutorials and tools within GitHub itself.

It was suggested that the profile builder may be split off from from the GitHub i.e. a separate tool to generate the profiles – there is going to need to be a full application to enable us to be able to do something

Antonio – we want to have a DQ solution for the system, but we want to embed many things in that – we want to develop some tools which will serve other purposes. Although these may not be needed by the funding body, we want to have this done.

It was suggested that there are a lot of things we’re doing which don’t fit neatly into GitHub –we shouldn't allow the complexities of GitHub discourage people from contributing – links from GitHub to google docs etc. is all we need in some cases. There are benefits to broadcasting our progress to the wider community and we can upload arbitrary files to do this to GitHub. Final products should be in GitHub

Vocabularies (Proposed TG4 – Paula)

Paula presented her draft report on Building Community-vetted Controlled Vocabularies for Biodiversity Data based on the need for Vocabularies for both Darwin Core and for the Tests and Assertions arising out of this Interest Group. It was estimated that there are many (up to 48) vocabularies required as a priority for use in the Tests and Assertions.

  • The data is messy – lots of distinct values for each attribute e.g. 80,000 + values in the country attribute in GBIF – we would expect 250 distinct values – this makes it very hard to use the data
  • Do we really need controlled vocabularies?
    • Data producers – capture data using vocabularies
    • Data custodians
    • Data aggregators
    • Data users
  • There have been discussions in the DQIG and a summary report produced regarding how best to handle them etc.
  • Where does the subject fit – which interest group?
  • Darwin core Hour April 2017 – there was a discussion – data situation picture, landscape of already built controlled vocabularies – I put together a list of resources
  • There is also a brainstorming document – to set the stage – which includes a preliminary plan to go forward with controlled vocabularies –there has been lots of interest / comments
  • The idea was to create within the TG4, discipline-specific groups – to be defined – still open for discussion – need to get the community involved – need to reach out as broadly as possible – e.g. herpetology etc.
  • What is the environment for capturing vocabularies? If we are going to build them, how are we going to write them down? GitHub? Google spreadsheet etc.
  • What are the formats & requirements?
  • What is the action plan / working strategy?
  • It was also reported that the GBIF Task Group on Data fitness for use in research on alien and invasive species has an imminent need for several vocabularies for use with Invasive Species, including for dwc:establishmentMeans and dwc:occurrenceStatus. They also suggest two new DwC termas (degreeOfEstablishment and origin). They are beginning to work on these, and had a preference to do this under a specific Task Group under the DQIG.

Discussion

  • how do we structure the groups that develop the controlled vocabularies – maybe set up standards and practices for generating them and then develop a few test cases?
  • Donald - we need long-term specialist group to manage these things to deal with noise coming out of the data. We need to allow people to use vocabularies where they exist but if they don't exist users can enter whatever is in their data. I see controlled vocabularies as a type of classed instances – we want to create nodes to attach things to – i.e. take one of these and pull all the records back for that value. The idea of a fixed class and the ability to have soft terms that reference them – we need to have some way of curating all the “background noise”.
  • There are two key areas – the data in GBIF and transferring that and the other area where we are trying to implement the tests and assertions. This is the reason for suggesting a task group with domain-specific groups. A task group that defines requirements and formats and then domain specific groups. By having a Task Group rather than a separate Interest Group, it's easier to maintain vocabularies in longer term
  • TDWG has the ability to have maintenance groups now so that when a TG finishes, maintenance can be ongoing.
  • There is an ISO standard to developing vocabularies – it covers both monolingual and multi-lingual
  • Sometimes there are things that we don’t think of as vocabularies e.g. scientific name, institution names – but it’s a similar problem conceptually so we could have services that would help us fix up a lot of data cleaning
  • Lee – regarding if this is an Information Group or Task Group issue – it would be good to have it under DQIG but there is a lot of high level concepts across many Interest Groups – it would be good to set up a new Interest Group. Then the recognition of the people doing the coordination is more apparent. There are common issues across all vocabularies so perhaps an Interest Group with domain specific task groups under this would be the way to go.
  • It would be good to have a mechanism for deciding which elements are priorities for normalising them and turning them into vocabularies. In cases where the only conceivable use case is that a human user looks at a record – this is probably not worth the effort. Separate domain-specific groups is too heavy weight – some could be very small – just with a narrow list of terms others might go into more depth. There are domain-specific groups out there without formalising the TDWG task group. Under the TDWG structure this may be more difficult.
  • Donald – my biggest concern with TDWG is that we don't have a workable high level framework of basic classes of things that we care about – we’ve been trying to do this by structuring the semantics of our world with XML schemas. We need domain-specific knowledge to map data across schemas (this is not always possible). We don’t need a full ontology but a very basic level to ensure that all relevant data refers to something that we agree upon i.e. tie records to instances of concepts and entities in the work (some open ended and some controlled). There are questions of hierarchy to resolve but we want to map high level concepts. We can start bottom up with the DwC terms but if we take it from a high level from where we want to get from the biodiversity information graph then we can talk about the important ones. For the ones which are open-ended thesauruses then we need to have a process to develop these
  • Arthur – we could set it up as a task group under the DQIG – and then go towards an Interest Group if necessary later – just to get the first stage at least as Paula has suggested –there are a lot of arguments set out on her document on the google drive
  • Paula – Tasks would include building a document / common way to build and capture vocabularies and to understand the extent to which each vocabulary could have flexibility or not – set out minimum requirements for a controlled vocabulary to provide. The 2nd task – is to better identify / define / centralise groups who are developing or could develop controlled vocabularies for specific domains and having them build vocabularies. We could build a common area where those vocabularies can be stored
  • We need to establish a list of priority vocabularies – e.g. a list of ones which are already available, ones that are either priority or near complete, or longer-term ones
  • Lee – we could look at the tests and assertions and the DwC terms which most come up and use those as a starting point
  • Donald – vocabularies – flat, hierarchical, historical coverage e.g. taxa, countries, preferred terminology. It would be good to get a view as to the challenges we’ll face and what choices we’ll make with respect to these – i.e. do we get high level, or domain-specific terms? How do we build the scoping? At what point do we start looking at ontologies? This would be good to get in a discussion / white paper for TDWG to consider. GBIF etc. could undertake a wide consultation process to get an understanding of a preferred approach
  • Paula – it is important to define scope and what we’re targeting. I like the idea of going with terms that are already being used – particularly within this information – it’s important to understand which cases we’re basing it on.
  • Arthur – scoping could be one of the tasks of the Task Group. We could set up the charter with deliverable and then wind up with recommending Task Group or Interest Group to set up / manage the remaining processes etc.
  • [ACTION: Paula to set up a charter and TG – Antonio will help with this.]
  • Paula – it’s a big challenge to engage people to help – we’re going to need a lot of outreach – ideas to engage people?
  • [ACTION: Arthur –the DQ Interest Group is giving a talk at SPNCH - so send some slides to Paul Morris to include in his talk – I will email Paul]
  • Environment for capturing vocabularies – things we need to keep in mind – easy, minimum requirements, easy to manage, easy to propagate, easy to use down the pipe – google docs / spreadsheets / GitHub / special web interface
  • [ACTION: Lee – The French version of ACEAs has a user interface – will provide links / contacts to Paula]
  • [ACTION: Miles – Australian National Data Service (ANDS) has research vocabulary service which might be useful – to send link to Paula]
  • [Action plan / strategy – charter, action plan (part of charter?), agenda for 2017 meeting, involvement and outreach, grant applications (to get people together, to develop vocabularies, to develop interface).]
  • Paula – where do we go for grant – RCM?
  • Donald – this depends on the timing – this is something that GBIF has in its current implementation plan for next 5 years. If we have a plan we could look to providing funding one way or another next year.
  • Miles – it could be worth asking John (le Salle) for co-investment from ALA as well
  • Alex – with a supplement from iDigBio? [ACTION: Paula and DQIG Convenors to persue these funding options]
  • Task Group to be "TG4 – vocabularies" but check the ISO Standard for appropriate terminology.
  • Need to make sure that we have time to discuss at TDWG2017. May require some evening beer meetings.

Summary

Post the meeting there was various discussion in small groups. I have attempted to summarise below the discussion both during the meeting and shortly afterwards.

  • Stage 1 would be the establishment of a Task Group on Vocabularies under the DQIG.
  • Paula Zermoglio agreed to lead this Task Group.
  • The first step will be to write Charter for the Task Group.

The meeting was unsure of the limits of the Task Group. It was not clear whether such a Task Group should manage the development of all required vocabularies using Domain Specific Groups (see Paula’s report), whether some could be done this way with other vocabularies (for example, the Invasive Species Group) established under a separate Task Group or whether a TDWG Vocabulary Interest Group should be established with Task Groups under that for each required Vocabulary. In the discussion, it was suggested that having lots of Task Groups (20-40) would be very difficult to manage and the formalities around Task Groups establishment and management would make the process long and unwieldy. It was suggested that perhaps the best way to achieve the desired result may be for a mix – i.e. some vocabularies to be prepared under this proposed Task Group, and some under separate Task Groups (where possible using domain specific groups) – either under the DQIG, or under several relevant Interest Groups. It was agreed, however, that some form of standardisation in format was desirable.

It was agreed that more discussion on the development of a Charter would need to be held remotely. Following discussion at the meeting and in several follow-up side meetings it was decided that key tasks of the Task Group may include:

  • Preparation of a Scoping Document
    • Issues for scoping include where the various Task Groups would fit if it is decided that that is the way to go. Whether they all should fall under the DQIG, or be in several different IGs or if a separate Vocab IG be established.
    • Determining what types of vocabularies TDWG requires – consistent with the various ISO Standards on Vocabularies (i.e. do we require vocabularies, dictionaries of terms, controlled vocabularies, thesauri, ontologies, etc.) and whether vocabularies to be mono-lingual or multi-lingual or be various mixes of the above. The relevant ISO Standard is ISO25964
  • Listing TDWG-required vocabularies as well as examine vocabularies that do exist and that can be used as is, or that can be modified for TDWG purposes. (See document prepared by Paula on some existing vocabularies). Prioritisation of vocabularies may also be of value.
  • Identifying domain-specific groups that may be involved in the preparation of vocabularies
  • Developing a standard format for the building of TDWG vocabularies
  • Development of a common repository for TDWG vocabularies
  • Using one or more vocabularies as examples (e.g. the Invasive Species vocabularies may be a good starting point).

The meeting suggested that the final format (e.g. SKOS, RDF, RDFS, etc.) should be determined further down the line and that the proposed Task Group should not get too bogged down on that aspect at this stage.

The meeting agreed that working all this out and making recommendations to TDWG on best way(s) to proceed will be a large enough job for the Task Group without having the job of preparing approximately 40 or so required vocabularies through managing 40 or more DSGs. The scoping document may, however, be able to identify what domains would need to be approached to establish a DSG for each required vocabulary.

The meeting hoped that the Task Group could be established and that at least a draft Scoping Document prepared by TDWG2017.

Paula will he looking for members for her TG and assistance to help write the Charter.

OBIS – data flows and quality (See Powerpoint)– Dave

Dave reported that he has taken over role in node manager for Australia –there has been a change in how OBIS manages data

  • There are thematic and country nodes – OBIS expects nodes to deliver quality – OBIS expects to be able to talk to a node and get data
  • All data providers have to come through Tier two node (e.g. Australia) to the Tier one node – OBIS - so there is opportunity to do data quality checking at the Tier-2 level.
  • All names have to have a WoRMS LSID
  • OBIS-env-data project – add environment and other context data to DwC
  • Designed to deal with CTD casts, trawl events, related catch composition existing species occurrence records with environmental measurements etc.
  • EventCore structure, Occurrence extension, eMoF extension. Achieves a star schema
  • Already in node to undertake simple tests – dates, spatial, test ranks are filled – good opportunity to have rules formalised and put into data stream
  • We provide metrics for all datasets that come in – we keep building more and more tests against the data -so we’re very keen to make sure the data are in the best format possible
  • We like to identify tests relevant to a Tier-2 node with more difficult ones performed at iOBIS
  • We identify OBIS specific tests e.g. benthic taxa recorded 1000m above sea floor and add those tests and test values where in DwC
  • All tests should have GUIDs
  • We have heaps of data already and are keen to use this as test dataset

BDJ paper – Arthur

Arthur reported on the paper being written for the Biodiversity Data Journal : “Improving Biodiversity Data Quality through a Fitness for use Framework”

  • Dmitry had suggested a more light-hearted style paper – but probably not the type of paper we want in this instance. May be another paper elsewhere in Pensoft at another time. * I have responded to Dmitry that would be better in another article – maybe with Emily’s and TG2 data
  • Arthur to be overall editor and write introductory parts
  • I have suggested a structure and circulated this – no comments to date – it may be too ambitious –we need people to commit to writing parts of this
  • Lee has begun writing up the principles out of the Tests and Assertions
  • Pensoft pushing to get this done before “too long” no timeframe submitted – possibly by TDWG 2017?
  • intended to be just an overview / introductory information but it may be hard to condense
  • Alex-we don’t need to go into technical elements of FFU as this is described in more detail – we could focus on the application of tests and assertions with respect to the DwC framework and use cases e.g. species distribution models. Guiding principles – many of these are more relevant to application of DwC than others
  • We need a case study what? who? how detailed? Taking user story, taking profile, generating reports, use that to extract data
  • Miles - Kristen Williams use case example would be a good one to use for this paper
  • Arthur – would be good to do something other than a species distribution model we could bring in Kristen as an author. Put up a process of how this works i.e. where the data cleaning comes in but needs further discussion at a later date (with Alan)
  • Arthur- we need to annotate the document to nominate sections to write. The Alpha tool is unable to track who made the changes, only when changes were made. I will be chasing up people more and more over next few months to ensure it gets written
  • Arthur – can use list of authors as an email out to everybody.

##Other Business •* Meeting on Sunday before TDWG - room booked for day to talk DQ - Have 2 symposium sessions also - Open workshop session

  • Antonio thinks he can come up with some money for us to hold another one of these meetings in Brazil next year – good opportunity to push the agenda
  • Joint DQIG and Annotations IG meeting in Denver on Sunday before SPNHC meeting. Only about 8 people registered.
  • Need to discuss feedback mechanisms – i.e. feeding back annotations to data providers and how we deal with that – our part only deals with DQ part but need to address integration and feedback
  • [ACTION: Donald to send email summarising GBIF work on Annotations] (This has been received)
  • Miles - ALA have annotations for any issues with records (types of issues e.g. geospatial, taxonomic etc.) / images (upvote / downvote – then used to rank images in the gallery) theory is that services can be used to link into annotation services from collections to ALA and upload new Greg – what do you do with the output of the data quality tests and assertions - they are annotations and would be extremely useful to be fed back to the data providers Alex – having them actually be annotations is an issue as there will be 5 per record in GBIF plus user annotations – Paul said can’t do this using filterpush- if have system moving back to the curator then if you start feeding automated annotations into that then there’s no way to control velocity Donald – we can provide a way for users to be able to verify if the manual annotations were reasonable or not Alex – having data quality services that specimen databases run against themselves relieves the need for automated annotations at the aggregator level

Unfinished business

Paula – don’t have email address linked to all of us – [ACTION: Arthur to set up in the GitHub or other options] – slack, google groups, list on TDWG mailing list if possible? Arthur – if registered on GitHub as watching then can send to all of them but not every on them

Clone this wiki locally