Content of review 1, reviewed on December 12, 2023

The paper reports the results of planned inter-laboratory comparison measurements to investigate errors associated with selection of RMs used for calibration with regard to extrapolation, matrix-matching and number of RMs.

Many of my comments are minor in nature, however it is the lack of consideration of measurement uncertainty that I think holds this manuscript back. Simply focusing on “normalization error” isn’t to my mind enough to be able to make recommendations about “best practice”

I recommend major revision to include discussion of uncertainty

General comments:
While many isotopic RMs do come with a certificate of some sort, few are strictly certified RMs (CRMs) as few are produced following the international quality standard ISO 17034:2016. Please consider carefully your use of the word “certified.”

Similarly, “standard” can refer to a RM of some sort, or a paper document (such as the ISO standard mentioned above). For simplicity I would suggest using reference material in place of “standard” throughout when referring to a material for analysis.

Symbols (e.g. “n” “p” and delta) should be in italic font.

There is no consideration of measurement uncertainty, only of accuracy. A value might be less accurate, but if it has a larger uncertainty then it still overlaps the expected value. Many of the median errors shown in the boxplot figures are <0.1 permil – which is generally smaller than measurement uncertainty for EA-IRMS. This is the major shortcoming of this work. Please add discussion of measurement uncertainty in relation to the types of calibration investigated.

Specific comments:
Line 94 – While the Carter & Fry paper is excellent, there have been many other publications concerning inter-laboratory studies of isotope delta measurements since 2013. These include inter-laboratory comparisons among metrology institutes (https://doi.org/10.1088/0026-1394/54/1A/08005, https://doi.org/10.1088/0026-1394/59/1A/08004), members of the FIRMS Network (https://doi.org/10.1016/j.forc.2021.100306, https://doi.org/10.1016/j.scijus.2018.08.003) and inter-laboratory efforts to characterize new RMs (e.g. https://doi.org/10.1021/acs.analchem.5b04392 and https://doi.org/10.1021/acs.jafc.0c02610) among others. Not all of them show the same "poor interlaboratory comparability"

Line 106 – It’s not always a least-squares linear regression – see Meija & Chartrand (https://doi.org/10.1007/s00216-017-0659-1)

Line 109 – this paragraph could benefit from citation of the recent paper from Chartrand et al (https://doi.org/10.1088/0026-1394/60/1A/08028) concerning the inter-laboratory comparison CCQM-P212 that looked at the coherence of RMs (rather than the performance of the participating laboratories). This study looked at calibrations performed with different numbers of RMs and the choice of RMs. Some of its findings are in agreement with those discussed in this paper.

Line 114 – I’d put a hyphen between EA and IRMS.

Line 122 – There’s at least one paper that has looked at the effect of extrapolation on measurement uncertainty (https://doi.org/10.1002/rcm.8453) – admittedly it’s hidden away in a paper about carbon isotopic analysis of methylmercury by GC/C-IRMS and therefore not easy to find…

Line 133 (and 209) – the interlaboratory comparison CCQM-K140 and the parallel comparison involving members of the FIRMS Network as described in two papers (https://doi.org/10.1088/0026-1394/54/1A/08005 and https://doi.org/10.1016/j.scijus.2018.08.003) did seem to show a linearity (mass) effect between participants (see figure 3 of the latter paper) – so you’re not quite the first to look into it, although I’m pretty sure you are the first to investigate with planned measurements.

Line 140 – single-point calibration has not been recommended practice for carbon isotope delta since 2006. I would therefore suggest removing all mention of single-point calibration from the main paper. Some of the findings in particular could be taken to mean single point calibration is better than two point (and this is in fact stated at line 463).

I suggest that the discussion points relating to single-point calibration are moved to supplementary information as they are worth including. I would also add to this part of the supplementary information discussion on the difference in isotope delta between the single point and the sample and the effect on bias and uncertainty – it’s not clear to me that this has been included.

Line 142 – this isn’t strictly true, certainly for Thermo Isodat and carbon isotope delta. This follows the calculation approach detailed by Santrock, Studley and Hayes (https://doi.org/10.1021/ac00284a060) in which a correction for 17O is applied at the same time as a calibration using the working gas is applied.

Line 154 – please be consistent and use “working gas” rather than “reference gas” here and throughout the manuscript (as you have done on the previous page).

Line 226 – how were the in-house materials value assigned? Given you show later that selection of the number and identity of RMs used for calibration can lead to error, then it is important to show the calibration information for these materials as you use them later on to demonstrate error. You could then see whether the dataset you discuss in the manuscript allows you to re-create the calibration(s) used for these in-house materials, and if it does, whether the error is smaller than if different calibrations are applied.

Are the elemental compositions of the in-house materials available (C and N mass fractions)? If they are, please add them together with the values for the commercial RMs to Table 1

Line 243 – presumably you tested that the standard deviations among the groups were approximately equal before pooling?

Line 248 (table 1) – the “certified” values for USGS88, 89, 90 and 91 are not given with 1sigma uncertainty. The ones you show are expanded uncertainties with 95% confidence (see https://doi.org/10.1021/acs.jafc.0c02610).

Line 287 (and Table 1) - How did you define “high organic” in terms of a matrix classification? It seems to me you have simple single compounds (caffeine and L-glutamic acid) and then complex macromolecules (collagens) and then protein and muscle tissue which are more complex still.

Line 296 – missing apostrophe in “Dunns”

Line 303 – do you mean “matrix-mixed and/or extrapolated” or were you only comparing the matrix-matched and bounded calibrations to other calibrations that violated both conditions? – it might be interesting to look at the two issues separately.

Line 310 – Is the accuracy of the normalization (or normalization error) the difference between the obtained and expected isotope delta values of the two selected quality control materials? – I don’t think you’ve specifically defined it in the paper.

If it is and you want to know the significance of the accuracy/error you need to consider the uncertainty in each term – you have the uncertainty in the expected values of the selected QC materials, but have you estimated the uncertainty in their measured values following calibration? You can't compare two values without the uncertainty associated with each.

Line 314 – this is where the letters on the box plots indicating which groups are not significantly different is confusing as you mention “C normalizations” while discussing figure 2 and indeed there is an uppercase C above one of the boxes – but I think you’re using C in the text to mean carbon (because you used N later in the same paragraph) but not on the figure where it’s used to show that the group is significantly different to others.

Linea 318 (fig 2) – I think the issue with two-point normalizations that you haven’t mentioned is that when one of the points changes, then unless the change is between IAEA-600 and USGS91, there is a corresponding change in the calibration range. For three- or more-point calibration, the total range changes only when it’s one of the end members that’s substituted for a different RM. This is likely why two-point calibration shows such high (and variable) “normalization errors”

Perhaps a fairer comparison looking only at the number of points used would be to use the materials with highest and lowest isotope delta as the two points and then see what happens when more points are added in the middle? That way you can remove the confounding factor of change in calibration range. You could then repeat for other two-point calibrations that allow you to add further RMs within their span.

Are the two-point calibrations that show abnormally large errors (as mentioned at line 448) those involving small calibration ranges?

Line 348 – if you want to isolate and examine the effect of extrapolation, then you need to keep other confounding factors the same. This section seems to look at extrapolation and at the width of the calibration range and matrix matching/mixing at the same time and is confusing as a result.

Line 399 onwards (and fig 6) – rather than comparing peak amplitudes, what happens if the comparison is done in relation to amount of the element analysed?

Line 403 – which m/z signals were used to measure amplitude?

Line 404 – why not convert either V to nA or vice versa using the amplification resistance – that would make for easier comparison between the two systems?

Line 420 – could you simply plot amplitude vs amount of element analysed and test for linearity of the relationship or look at the residuals of a linear fit?

Line 438 – this is inline with the findings of CCQM-P212 mentioned earlier.

Line 447 (and 479, 493) – The use of percent is confusing – I would stick to expressing differences in magnitude in absolute terms in permil (or if convenient using something like "doubling")

Line 449 onwards – I like this discussion, thanks for including it.

Line 460 (Fig 7) – here it becomes apparent that for two-point calibrations, the ones with large errors are those with narrow calibration ranges – something repeated in the other calibrations too. I would say that this figure suggests that it is the range that’s important, rather than the number of RMs used. In all three cases (2, 3, or 4 point calibration) once the range is above 20 permil, the errors are similar and stable with additional expansion of range

I don’t think that this plot supports the conclusion/recommendation that two-point calibration is poor.

Line 482 – something to consider is that, in the Schimmelman et al 2020 study that characterized the USGS88-91 RMs, all measurements that contributed to the assigned values were performed on single elements at a time (i.e. separate analyses for C and N rather than a single measurement with a jump in magnet configuration). The paper discusses briefly the problems uncovered during initial measurements that did use peak jumping. It might be worth considering if peak jumping has impacted the results in your work too?

Line 486 – the reduction in normalization errors discussed in your reference 28 is for nitrogen only – please make this clear here.

Line 502 – this is unsurprising – after all working gas diagnostics of linearity only provide evidence of the linearity of the mass spectrometer (the gas goes there directly), while your study investigated linearity of the entire EA-IRMS system. Nonetheless, I think it’s nice to have these results and discussion in the manuscript.

Something else that you may wish to consider is the effect of dilution – does running a larger amount of an element but with a correspondingly higher dilution give a constant delta? As you measure both C and N in the same run, you presumably need to alter the amount of dilution of the C signal depending on the C:N ratio in the materials. Can you investigate the effect of sample dilution with the data you have? If you can, it would be a useful addition to this section.

Line 517 – if linearity is important for material with low N, that implies to me that it is incomplete conversion of material to analyte gas that is important.

Line 521 onwards – As you have not considered measurement uncertainty, I’m not sure you can recommend best practices with the data you do present.

Line 538-40 – using a large range and matrix matched RMs for calibration isn’t different to current recommendations. There are definitely advantages to using more than two points for calibration (that aren’t discussed in your paper such as the ability to check how linear the calibration line is or being able to detect contamination in one of the calibration RMs), but the data you’ve presented in your figure 7 doesn’t support two-points not being enough.

Source

    © 2023 the Reviewer.

Content of review 2, reviewed on February 21, 2024

I have carefully reviewed how the authors have addressed the comments from the previous round of review (including my own). They have addressed each comment satisfactorily and the revised manuscript is very much improved. There remain only a few minor errors as detailed below.

I recommend acceptance of this manuscript following their correction.

Specific comments (line numbers refer to the revised manuscript without tracked changes highlighted):

line 70 – please add “delta” between “isotope” and “values” in the first line of the rationale.

line 124 – I don’t think “inaccuracy margin” as suggested by the other reviewer and applied by the authors in this revision is correct. The authors are trying to describe the difference between a true value (the assigned value for an RM) and a measurement result (for that same RM). That’s the definition of a measurement error (e.g. from the EURACHEM Guide to uncertainty section 2.4 https://www.eurachem.org/index.php/publications/guides/quam) which requires knowledge of the true value (estimation of measurement uncertainty does not need this knowledge). It could also be referred to as bias (e.g. see the EURACHEM Guide for validation section 6.5 https://www.eurachem.org/index.php/publications/guides/mv).

Lines 200 and 202- please add the “+” symbols in front of the positive isotope delta values as you have elsewhere.

Line 213 – where desired RMs are not commercially available, end users have been encouraged for many years to prepare their own and guidance to that end is available (https://doi.org/10.1002/rcm.9177). Clearly when preparing in-house RMs (or indeed commercial ones) it might be necessary to extrapolate or be impossible to matrix match when using existing RMs for their calibration (noted in https://doi.org/10.1002/rcm.9177 and https://doi.org/10.1002/rcm.8711). Of course, the valuable results obtained by the authors in this work can be applied to improve characterization measurements of those new materials be they in-house or commercial to minimize the influence of extrapolation and non-matrix-matching.

Line 279 Table 1 – There are still inconsistencies between the column headings and the values given for the uncertainties in “reported” isotope delta values for the RMs listed. As the column headings now specify 95% confidence, then those uncertainties for USGS61 and USGS63 derived from reference 23 need to be multiplied by a coverage factor of 2 as the reference provides only standard uncertainties. Further, for IAEA600, ref 23 is given, but the values and uncertainties in the table do not match those in that paper which reports +1.02 plus/minus 0.05 permil (standard uncertainty) for carbon isotope delta and -27.73 plus/minus 0.04 permil (standard uncertainty) for nitrogen isotope delta of IAEA600.

Line 298 – please move the number to before “normalized” for clarity

Line 355 – A total of 8 RMs were available. The more that are used for calibration, the fewer remain to be used as the quality controls which you use to determine bias. If a single quality control material is found to have a significant bias, then this will represent a larger proportion of the remaining quality control materials when a larger number of RMs has been used for calibration (if four are used for calibration it would be one quarter but if only two are used for calibration it would be only a sixth). Given that you have found the opposite (i.e. proportion of significant bias decreases with more RMs used for calibration and therefore with fewer quality control materials available, it would be nice to see (in supplementary information) the absolute numbers of QC materials with significant bias for each of the groupings shown in Figure 2 together with the total number of QC measurement results considered.

Line 390 – typo in the first median value given

Source

    © 2024 the Reviewer.

Content of review 3, reviewed on May 20, 2024

I have carefully gone through the authors’ responses to the previous round of reviews (including my own) and the resultant changes to the manuscript. The authors have addressed them all and no further issues remain other than one minor thing:

Lines 356-359 – the two instances of “error” that are highlighted in yellow should be “deviation” as suggested by the comment.

I would add that reporting carbon isotope-delta values on a scale where NBS 19 and LSVEC both have exactly-defined carbon isotope-delta values is absolutely fine. It’s the use of the LSVEC carbonate itself for calibration of instrumental results that should be avoided – and indeed the authors did not analyse the LSVEC material directly.

Source

    © 2024 the Reviewer.

References

    Sawyer, B., Morgan, S., N., F. D., Stella, L., Cromratie, C. S., K., B. L. 2024. Experimental assessment of elemental analyzer isotope ratio mass spectrometry normalization methodologies for environmental stable isotopes. Rapid Communications in Mass Spectrometry.