Content of review 1, reviewed on January 26, 2015

Soranno and colleagues describe the considerable effort they expended to compile dark data for 50,000 lakes from 17 US states into an integrated and easily used database. Dark data are what Heidorn (2008) and others have called the data that is produced from small-scale projects that are typically processed based. These data are not described for other’s use, are rarely indexed, invisible to scientists, and therefore rarely used (Heidorn 2008).

The database focuses on lake water composition (more later) and contains a wealth of watershed characteristics. The authors describe in detail the methods they used to compile this highly valuable database, including harmonizing and integrating the data, as well as ascertaining the quality of the contents. The paper ends with a set of lessons learned for synthesizing such diverse data.

This paper represents important work describing how to compile large integrated databases for synthesis and helping individual researchers see the benefit of following data conventions when conducting research. There a number of topics that need to be addressed

  1. The paper would have benefitted from more information about the data themselves. What are the main variables (maybe top 10 variables) in the database? Did the authors have criteria for selecting a study for inclusion in their database? At times in the manuscript they mention lake chemistry (line 111), but they seem to be focusing on lake nutrient chemistry (line 267). A summary of the data would clarify that.

  2. A demonstration of the basic characteristics of the data (numbers of lakes with lake water nutrients, range, frequency distribution, means, etc) is one way to show that the effort expended to develop this database is worthwhile; another measure of the usefulness of this data should come later when it is used to address regional limnology / macroecology questions, large scale questions that can only be addressed with this type of data.

  3. Are these data to be made publicly available? A major shortcoming in this paper is that the authors do not provide a link for readers to obtain the data.

  4. There are many examples of previous researchers compiling diverse data into integrated and easily analyzed data products (line 162). The paper would benefit from referencing studies compiling flux tower data (Falge and Baldocchi), root characteristics (Jackson and students), terrestrial net primary production (Olson’s NCEAS project with Gower and others), and soil respiration (Bond-Lamberty).

  5. Interdisciplinary teams are needed for pulling together databases of the type described in this paper (line 205, Figure 1). With the range of data being integrated, one would have expected to see other science disciplines besides ecology included. For example, some science expertise is needed in the Geo part of the database (landscape ecology, terrestrial ecology, etc.) and some hydrology and biogeochemistry in addition to just ecology. And importantly, an interdisciplinary study should definitely use interdisciplinary standards, including international metadata standards, like ISO standard metadata, instead of a discipline-based standard EML (line 337, 341, 544, etc.)

  6. Did the authors apply rules to include / exclude lakes from their efforts? For example if no lake water nutrients were included (N or P) was the lake excluded?

  7. The discussion of open science is good and important (Line 221). But, how will future researchers add information to the database? Will a controlled version be placed on-line with additional lake water nutrient chemistry data “check-in” by one of the authors? See Bond-Lamberty and Thomson (2010) for one way to do enable others to extend the database by adding more data points. One detail that the authors need to correct is that DOIs are actually just one of many different locators, not identifiers, per se, that could be used. Authors should be more open to use of other locators (e.g., ARC, UUID, etc.). And a related comment: How will the authors deal with future versions of the database to distinguish from the original and other subsequent versions with either additional data or corrections?

  8. The authors and the journal need to work together to clarify the “additional files.” It is currently quite confusing. The URL links have 26 “additional files” (last page of ms), but the text reference has 22 “additional files” (see p. 25). When clicking on a couple of these to find more details, the expected file did not appear. When the URL labeled “Additional file 14” was accessed, a document called “Additional file 11” appeared, not #14. After a couple of times of clicking on one link and not finding the expected document, I stopped and did not review these additional files. Fortunately, this is an easy thing to fix.

Detailed comments

Line 235. This paragraph, and particularly the opening sentence, needs to be recast. Does anyone collect data without a clear plan linking samples and analyses back to the hypotheses? Ready, fire, aim? The point that data structure and data management should be fully integrated into the research activity is great, but the first sentence could be removed to your advantage.

Line 344: Authors should have used scripting languages as stated here, but instead they used spreadsheet manipulations (Figure 4, Excel), which are problematic and should be avoided (see Borer et al., 2009) for a number of reasons.

Line 527: (missing figure, wrong figure call-out) the CUHASI Data Model is not included in this manuscript (see Figure 6 caption on line 770).

Line 649: Please lose the techopedia Reference and use another (e.g., Heidorn 2008) reference, which defines dark data thoroughly and in a manner more relevant for this paper.

Figure 2: The text box labeled “Figure 2” describes community “standards” for sharing data, when the authors mean a policy for sharing data, based on contributor’s requirements, that users agree to.

Figure 5 has many details about the contents of the Geo data (see red-colored text), but none about the Limno data. This figure requires more details about the contents of the Limno database, similar to the level of detail in the Geo database.

Figure 6. Some of the lakes in this map look like they are or extend outside of the defined study area (lakes / watersheds look like they are in Maryland, Arkansas, western Ontario, Quebec, and New Brunswick). This characteristic of the database should be described appropriately in the text.

References Bond-Lamberty, B. and A.M. Thomson. 2010. A global database of soil respiration measurements, Biogeosciences, 7, 1321-1344, doi:10.5194/bgd-7-1321-2010.

Borer, ET, Eric W. Seabloom, Matthew B. Jones, and Mark Schildhauer 2009. Some Simple Guidelines for Effective Data Management. Bulletin of the Ecological Society of America 90:205–214. http://dx.doi.org/10.1890/0012-9623-90.2.205

Falge, E., D. Baldocchi, R. J. Olson, P. Anthoni, M. Aubinet, C. Bernhofer, G. Burba, R. Ceulemans, R. Clement, H. Dolman, A. Granier, P. Gross, T. Grünwald, D. Hollinger, N.-O. Jensen, G. Katul, P. Keronen, A. Kowalski, C. Ta Lai, B. E. Law, T. Meyers, J. Moncrieff, E. Moors, J. W. Munger, K. Pilegaard, Ü. Rannik, C. Rebmann, A. Suyker, J. Tenhunen, K. Tu, S. Verma, T. Vesala, K. Wilson, and S. Wofsy. 2001a. Gap filling strategies for defensible annual sums of net ecosystem exchange. Agricultural Forest and Meteorology 107:43-69.

Gill, R., and R. B. Jackson. 2000. Global Patterns of root turnover for terrestrial ecosystems. New Phytologist 81:275-280.

Gower, S.T., O. Krankina, R.J. Olson, M. Apps, S. Linder, and C. Wang. 2001. Net primary production and carbon allocation patterns of boreal forest ecosystems. Ecological Applications. 11: 1395-1411

Heidorn PB. 2008. Shedding Light on the Dark Data in the Long Tail of Science. Library Trends 57.2 280-299.

Schenk, H. J., and R. B. Jackson. 2002. The global biogeography of roots. Ecological Monographs 72(3):311-328.

Scurlock, J.M.O., W. Cramer, R.J. Olson, W.J. Parton, and S.D. Prince. 1999. Terrestrial NPP: Towards a consistent data set for global model evaluation. Ecol. Appl. 9(3): 913-919.

Level of interest An article of importance in its field Quality of written English Acceptable Statistical review No, the manuscript does not need to be seen by a statistician. Declaration of competing interests I declare that I have no competing interests.

Authors' response to reviewers: (http://www.gigasciencejournal.com/imedia/1248946842171461_comment.pdf)

Source

    © 2015 the Reviewer (CC BY 4.0 - source).

References

    A., S. P., G., B. E., S., C. K., T., C. S., M., C. S., Emi, F. C., T., F. C., Jean-Francois, L., R., L. N., K., O. S., E., S. C., J., S. N., Scott, S., Shuai, Y., Tate, B. M., A., D. J., Corinna, G., N., H. E., K., S. N., H., S. E., A., S. C., Pang-Ning, T., Tyler, W., E., W. K. 2015. Building a multi-scaled geospatial temporal ecology database from disparate data sources: fostering open science and data reuse. GigaScience.