Review of Data rescue: saving environmental data from extinction

Content of review 1, reviewed on January 04, 2022

The authors introduce and describe approaches to the process of data rescue,which seeks to preserve existing scientific data that is at risk of being lost through any of a wide range of well-documented processes that render data inaccessible or unusable to other researchers. This is an absolutely critical issue that has not received proper attention in the literature. The paper is well-written and well-referenced. Overall this is clearly an important contribution which I highly recommend for publication. However, I do think a few of the particularly thorny issues of the paper are not properly developed, and ought to be addressed head-on if the paper is to convince the broader community and not merely those already receptive to the message. (1) Is it possible to quantify in any way the extent of data requiring rescue? (2) How do we prioritize data to rescue? (3) How do we navigate issues so nicely summarized in the CARE principles to the extent that they conflict with FAIR (or else convince the reader that there is no inherent conflict between these.)

The devil's advocate position is that much scientific data does not need rescuing because science is rooted in the production of knowledge, not the production of data. This perspective is not without merit -- had we lost Brahe's measurements of planetary motion or Galileo's observations of the phases of Venus or the moons of Jupiter the theories of orbital mechanics would be in no way imperiled -- the conclusions and methods used are still preserved in the scientific literature so that the data can be re-collected and the results confirmed. High school physics students can confirm Millikan's oil-drop experiment measuring the charge of the electron. Obviously this does not apply to data about processes such as global change, where observations today will be different from the past, but much experimental data does not fit neatly into that category. Does that mean that we should prioritize the preservation of observational data over experimental data because the latter can be more easily re-generated from scratch? Or field data over lab data? (what about numerical simulation data or model output, such as forecasts?) Do we trade off between reproducibility and priority for rescue? Or should priority be based on the (perceived) scientific significance of the data, regardless of whether or not it could be recreated? I think the paper would benefit significantly from more guidance on how to prioritize data to rescue than details on how to go about rescuing it, which largely echo existing (and ever evolving) advice about best-practices for archiving ecological data (new or rescued). In particular, identifying a few examples or case studies of specific at-risk data products would greatly strengthen the paper. Can the authors point to some key datasets (at least key to a specific sub-field) they see being at risk of loss, or walk through an example of identification or prioritization?

One area that would be particularly instructive as an example is a case study that involves navigating the essential issues raised in the CARE principles the authors cite: how to execute data rescue without crossing boundaries that may appear to be data appropriation. This has always been a contentious issue, but seems to be particularly keen in the context of data rescue; for example, cases where indigenous data sovereignty requires that data not be immediately accessible on some of the public data archives already cited in the paper.

The examples provided in the boxes are excellent.

More minor technical details the authors may want to consider:

Regarding metadata creation (L162): the authors mention XML, and EML in particular.
I'm as big a fan of EML as you could find, but really I think the standard has shortcomings which make it relatively unfriendly for both authors creating the EML and data reuse. I highly recommend some mention of more modern standards, such as schema.org, which can be expressed in JSON-LD, a format that is both simpler for creators and consumers and also more powerful (thanks to the 'linked data' part, which makes this metadata RDF-compliant). These technologies did not exist at the time EML was created. Good guidelines and examples of scientific use of schema.org can be found from the Federation of Earth Science Information (ESIP), https://wiki.esipfed.org/Main_Page, and also https://bioschemas.org/. Notably, schema.org is backed by the major search engines and used throughout federal agencies and data.gov (e.g. see https://developers.google.com/search/docs/advanced/structured-data/dataset). Some tooling exists to help generate schema.org markup as well as translate it to EML: https://cran.r-project.org/package=dataspice.
Regarding Data structure (L187):

The discussion of 'tidy data' is most welcome. It would be worth acknowledging in the text, as Wickham does in his paper, that "tidy" is more technically known as Cobb's third normal form (see https://en.wikipedia.org/wiki/Third_normal_form and references therein), which is a part of relational database theory that has been around since the 70s, though certainly we all owe a debt to Wickham for successfully introducing this concept more broadly. Most ecological data would certainly benefit from more attention to relational database design; nevertheless, this section ought to at least mention various alternative models that may be more appropriate for certain data.

The most obvious is array-based data objects, which are particularly prominent in spatial data (think netcdf, hdf5, geotiff, and the myriad other raster formats, as well as more recent models like zarr and xarray), though array-based data can be used in many other situations where dimensions are not merely x/y/z spatial or temporal coordinates. One advantage many of these formats have over relational database model is that essential metadata (units, column descriptions, etc) can be embedded in the data files themselves, where they are understood by a range of optimized software. Emphasis might be placed on modern, cloud-optimized formats (https://www.cogeo.org/) over legacy formats (netcdf) for explicitly spatial data.

It may also be worth mentioning that some data does not lend itself well to a relational database structure (which is quite rigid and harder to extend), and can benefit from a tree or graph data model like JSON. The Resource Description Framework (RDF) is a widely implemented W3C model that you can think of as an alternative to Cobb's third normal form of relational database theory -- in RDF, all data is essentially 'maximally long' form of three columns or 'triples' (subject, predicate, object). Such data can be queried not with SQL like a relational database, but with a graph query language SPARQL. That's probably well beyond the scope of this paper, my point is only that researchers ought to be aware that "tidy data" is not the only theoretically grounded generic data model, (even if it remains the most relevant).

More obviously, it may be worth mentioning that sub-field-specific practices may dictate the encoding of particular data objects, such as phylogenetic trees.

Data Archiving (L248)

the mention of csv/txt is welcome, though as per discussion above, alternatives might be prefered for data not structured in a relational data model (e.g. geotiff, geojson, etc). Text encoding of tabular data has the disadvantage of not encoding data type (logical, integer, text, Date, etc). Good metadata takes care of this, but in practice that is imperfect -- it may be worth also mentioning the now widely used open source standard of Apache parquet for large data (https://parquet.apache.org/documentation/latest/), which encodes types, compresses data, and is optimized for database queries. Breaking very large tables into multiple files is also an accepted best-practice that reduces chance of data corruption on network transfer etc. I'd love to see somewhere in here the recommendation to include a checksum (ideally something strong like sha256) of the data files in the metadata. Cryptographic checksums are critical for ensuring data integrity. Most data repositories already compute these automatically. They can also be used to identify and retrieve data files mentioned in the metadata records.

Source

Content of review 2, reviewed on May 16, 2022

The authors have carefully addressed all of the concerns I had previously raised. I congratulate them on a well-written piece raising this critical and previously overlooked issue.

Source

References

K., B. E., B., B. J., T., H. G., G., R. D., A., B. S., Kerri, F., Jason, P., S., P. L., M., S. J., S., S. D. 2022. Data rescue: saving environmental data from extinction. Proceedings of the Royal Society B: Biological Sciences.

Pre-publication Review of

Data rescue: saving environmental data from extinction

Reviewed On January 04, 2022 , and May 16, 2022

Submitted to

Reviewed by

Actions

Content of review 1, reviewed on January 04, 2022

Source

Content of review 2, reviewed on May 16, 2022

Source

References