Review of ProVoc: An app to train vocabulary depth in order to foster children's reading comprehension

Content of review 1, reviewed on August 18, 2020

The authors have developed an App, ProVoc to improve vocabulary on children. The present study assess the effectiveness of this app using third and fourth grade French speaking children. In particular, the training focused on improving vocabulary depth (the quality of knowledge about words) rather than vocabulary depth (the number of words that are familiar to an individual). The two main findings are: (1) an improvement in vocabulary depth for the trained words, and (2) a transfer to general comprehension – that is, scores on standardised comprehension tests improved. This latter result is intriguing as almost all previous studies have reported weak or no transfer to general comprehension following vocabulary training.

While this study is interesting, there are a number of major drawbacks which would prevent it from being published in its present form. My main concern is the form of the analysis undertaken. However, I also have concerns about the introduction and discussion. I set out these concerns below.

Introduction

In general I found the introduction to be imprecise. I don’t strongly object to anything that the authors have said, although I think there are inaccuracies.

P2.L16. The Simple View of Reading is not a model of reading. It is more general. To be a model it would need to include more details about how each process occurs – for example, how decoding takes place. It is something more general than a model and Castles, Rastle, & Nation (2018) use the term framework. Furthermore, the authors describe the SVR as having “two aspects of reading”. While this is strictly speaking correct, the SVR as three components, with the third being oral language comprehension. Perhaps the authors could make it clearer that the SVR has 3 components, with two of those being directly related to reading.

In the same section, the authors commence to talk about the important role of vocabulary (P2.L16), but then mix concepts in the following examples. For example, in the SVR, vocabulary is oral vocabulary. The authors talk about orthographic representations, reading via the lexical route (assuming a dual route model) which is not decoding, as well as the concept lexicon. If the purpose of this text was to link back to the SVR, neither orthography nor lexical reading are relevant. I am not suggesting that these concepts are not important to reading. However, I think the authors need to be clearer in their descriptions of previous ideas as they are conflating concepts.

P3.L15 “In general, two types of poor readers can be identified, based on Gough and Tunmer’s (1986) simple view of reading model: those who have difficulties with identifying written words but not with language comprehension (so called poor decoders or dyslexics); and those with persistent difficulties with comprehension but average word-reading performances (poor comprehenders)”. There are actually 3 types: poor decoding, poor comprehension, or both poor coding and comprehension in the same individual.

P3.L26 “Research has generally shown …” The Cavalli et al., 2016 paper cited included only 20 participants with dyslexia. Can the authors provide other citations to support this claim.

P3.L33. “Poor comprehenders, however, tend to experience difficulties in tasks requiring access to word meanings.” I think this oversimplifies the situation. Other factors may be partly or fully contributing to the comprehension difficulties. For example, Adlof & Catts (2015) showed that poor comprehenders had problems in addition to oral language. And in normal readers, syntactical and morphological knowledge are often predictors of language comprehension after controlling for oral language skills (although this may be language dependent as per Simpson, Moreno-Pérez, Rodríguez-Ortiz, Valdés-Coronel, & Saldaña, 2020). I think the authors should acknowledge that poor access to word meanings is one of several possible causes of poor comprehension.

P9.L17 “most intervention studies still focus on helping students gain knowledge about form-meaning connections and enlarge their lexicon” Firstly, I think there are more recent examples that focus on vocabulary depth, for example Gomes-Koban et al. (2017). Also, please see my following comment.

P8.L38. The authors review Wright and Cervetti (2017) who conducted a systematic study of oral vocabulary interventions, assessing their impact on general reading comprehension. These authors found that there is very limited evidence that direct teaching of word meanings can improve generalized comprehension. In the manuscript under review the authors suggest that this failure is due to the fact that “previous vocabulary interventions may have been unproductive because they did not specifically focus on vocabulary depth” (p9.L13). However, Wright and Cervetti (2017) explicitly state that “Many of these direct teaching studies focused on active processing and depth over breadth in vocabulary instruction, using rich, multidimensional, and extended vocabulary instruction” (approx. 3rd page of article).

In summary, I agree that vocabulary depth is more important than vocabulary breadth for reading comprehension. However, I get the impression that the authors are suggesting that their study is one of the first interventions which has targeted vocabulary depth, something which I don’t think is the case. If the authors believe this “discrepancy” is due to differences in how vocab depth is defined, they should provide a clearer definition in the introduction.

Methodology
Vocabulary Breath. I am concerned that the instrument used, only contained 20 words, with 10 drawn from the PPVT test, and 10 other words used in the app. The PPVT has almost 200 words ranging from very easy to very hard to allow for a range of ages to be used. Given the present study recruited children from both grades 3 and 4, there would presumably be a broad range of oral vocabulary knowledge. How can we be sure that the 10 words selected from the PPVT were sufficient to cover this entire range? Were there floor effects for the grade 3 children or ceiling effects for the grad 4 children? At the very least, ranges should be provided in Table 1, in addition to the means and SDs to allow the reader to better assess the suitability of these items.

If the range was excessively large, it may be necessary to analyse the grades separately.

Vocabulary depth: Ranges should also be provided for this measure. In the description it state that “until > 90% interrater agreement was reached”. What does the “until” mean, as it seems to imply that if the two raters varied greatly in their scoring, they adjusted their scores until they were more similar. More detail need to be provided on this procedure. Regardless, with two raters one would expect the inter-rater reliability (kappa) to be provided rather than percentage agreement, which has long been criticised (Cohen, 1960), and for kappa have a value of at least .7 (see for example McHugh, 2012). This comment also applies to the reading comprehension measure.

Provoc. Please include the range of word frequencies for the items used.
I am confused about the number of trained and untrained words. Initially, the authors say that each tasks has a small set of trained and untrained words – for example, for vocabulary depth, 10 words presented in the app (trained) plus another 10 not seen in the app (untrained). However, in the description of the app (P11.L54), the authors describe 20 series consisting of 20 words each (total, 40 words). What is the relation between these 40 words and the previous description of 10 trained words? Are there just a lot of filler words? I also note that, depending on the task, additional words appeared in the task. For example, in figure 1, for each trained word, 12 attributes (i.e. words) were shown. Thus, during this task, children would see 400 x (1+12) words = 5200 words. My concern is that some of the untrained words may have appeared as attributes, and thus been (implicitly) trained. Can the authors please confirm that none of the untrained words appeared as attributes, synonyms, etc in any of the tasks.

The “dropout rate” is approximately 25% (only 131 of the 173 children completed all tasks). This seems a little high. Is there any reason for this?

Analysis

P13.L24 “A crossed-lag paradigm was used so that ultimately all the children underwent the training”. I have seen a “waitlist control design” where the participants in the control group subsequently receive training so that they benefit from their participation. However, I have never seen this design before where each group serves as both control and training, and the term “cross lagged paradigm” means something completely different to me (a repeated measures design with two or more variables that predict each other). It may just be my ignorance, but could the authors please provide a reference for this type of cross-lagged design.

The authors state that they carried out a 2 way repeated measures ANOVA with the between participant factor being group, with 2 levels (experimental vs. control). How is this possible? If I understand the design correctly, each group served as both “control” and “intervention”. Do the authors mean that “experiment” and “control” are based on the initial assignments at pre-test? (that is, the “control group” at posttest2 would actually be the training group???). Please clarify.

Even though the interaction is the test of interest, it is customary to provide the statistics for the overall ANOVA, along with the main effects before launching into the interactions.

The values in table 1 don’t seem to match the values in figure 1 in some instances. For example, from figure 1, the pretest cores for vocab depth appear to be approximately 14.0 and 15.5, but in table 1 they are given as 14.9 and 15.0, respectively. Why is this?

More generally, the reporting of the results is confusing and in my opinion, not correct. In the first instance, I don’t believe that the trained and non-trained words should be pooled. As the authors pointed out in the introduction, there is little reason to think that non-trained words will benefit from the training, and this is borne out by the results on P15.L3. Reporting the combined totals serves little purpose. Thus, I believe results for just trained, and just non-trained words should be reported. Secondly, the authors mix reporting “differences between groups at a single time point” with “differences within groups between timepoints. (for example, the paragraph starting at P14.L19). Furthermore, a repeated measure ANOVA is not the best way to analyse these data. A statistically more powerful approach is to use ANCOVA, with pretest values as the covariates (see Van Breukelen, 2006). Given the (in my opinion) unusual design, of both groups serving as control and training groups, I believe two separate tests would need to be required: (1) oneway ANVOVA of posttest1 with pretest as the covariate, (2) oneway ANVOVA of posttest2 with posttest1 as the covariate. In addition to using ANCOVA instead of ANOVA, I would suggest adopting a linear mixed effects paradigm. They allow ANOVA/ANCOVA/regression style analyses which simultaneously take into account random effects for both participants and items (see for example, Baayen et al., 2008)

In general, the authors provide effect sizes for the interactions (which tend to be very weak – the authors should comment on this in the discussion), but they don’t provide effect sizes for the follow up tests. These are actually the more interesting effect sizes as they are the direct comparisons between the control and training groups. They should be included.

In summary, I think there are too many ambiguities in the results, as reported, to be able to fully assess the validity of the claims made by the authors.

Conclusion

The finding that there was a transfer effect to reading comprehension, if correct, would be remarkable. This is especially the case given that there are studies which have trained vocabulary depth, but failed to find a transfer effect (for example, Gomes-Koban, et al, 2017). Why do the authors think there was no effect found with untrained words, yet there was an effect on general comprehension? The authors suggest that training vocabulary depth is the key to them having found a transfer effect. However, as I have pointed out, I believe there are sufficient other studies which have trained vocab depth and which have not found a general transfer effect (I am happy for the authors to argue against this). Do the authors have an alternative explanation? What is the key difference between this study and other studies which have not found such a transfer?

Minor

Coltheart, Rastle, Perry, Langdon, & Ziegler, 2011 – the correct year is 2001

Figure 1. I appreciate that the study was carried out with French speaking children, but it would be normal to provide the English translation (along with the original French) for each word in the example to assist not French speakers. Applies to figures 2 & 3 as well.

P12.L19 The “essential attributes” mentioned here seem to be author defined. I suggest that in future studies the authors confirm their intuitions by running confirmatory/pilot tests: each word appears with a list of possible attributes and participants are asked to indicate the essential attributes. The items most frequently selected would then form the essential list.

I think a major flaw is the lack of a followup assessment to determine whether the gains reported after training were just short term gains, or whether they persisited some months after training had concluded. This should be mentioned as a limitation

References

Adlof, S. M., & Catts, H. W. (2015). Morphosyntax in Poor Comprehenders. Reading and Writing, 28(7), 1051-1070. doi: 10.1007/s11145-015-9562-3

Baayen, R. H., Davidson, D. J., & Bates, D. M. (2008). Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory and Language, 59(4), 390-412. doi: 10.1016/j.jml.2007.12.005

Castles, A., Rastle, K., & Nation, K. (2018). Ending the Reading Wars: Reading Acquisition From Novice to Expert. Psychological Science in the Public Interest, 19(1), 5-51. doi: 10.1177/1529100618772271

Cohen, J. (1960). “A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement, 20, 37-46.

Gomes-Koban, C., Simpson, I. C., Valle, A., & Defior, S. (2017). Oral vocabulary training program for Spanish third-graders with low socio-economic status: A randomized controlled trial. PLoS ONE, 12(11), e0188157. doi: 10.1371/journal.pone.0188157

McHugh M. L. (2012). Interrater reliability: the kappa statistic. Biochemia medica, 22(3), 276–282.

Simpson, I. C., Moreno-Pérez, F. J., Rodríguez-Ortiz, I. d. l. R., Valdés-Coronel, M., & Saldaña, D. (2020). The effects of morphological and syntactic knowledge on reading comprehension in Spanish speaking children. Reading and Writing, 33(2), 329–348. doi: 10.1007/s11145-019-09964-5

Van Breukelen, G. J. P. (2006). ANCOVA versus change from baseline had more power in randomized studies and more bias in nonrandomized studies. Journal of Clinical Epidemiology, 59(9), 920-925. doi: 10.1016/j.jclinepi.2006.02.007

Source

Content of review 2, reviewed on January 13, 2021

I thank the authors for their responses. Many of my concerns have been addressed, but some still remain, and they are of sufficient importance that the manuscript cannot be published in its present form.

Analysis
The values in Table 1 do not match up to the figures. Table 1 appears to contain “total scores”, that is, the sum of trained and untrained words whereas the two figures appear to show trained words only. Given that all of the analyses were conducted on the trained and untrained words, the values shown in Table 1 are not particularly useful to the reader. Instead, this table should be broken down by trained and untrained word to provide the reader with the descriptive statistics of what is actually being used in the analyses.

Details of the description of the analyses don’t appear to match. On page 16 the general form of the analyses is described as having two fixed effects (group and session) and three random effects (participant, item and grade).

Firstly, the decision to include grade as a random factor seems odd. Random effects are designed to account for the fact that if the experiment was repeated with different participants or different items, there would be random variation in the results, But here we can safely say that the children in grade 4 will perform better than the children in grade 3. Furthermore, this would seem to be important to distinguish given that there may well be a difference in how well the app works for children with 12 months age difference. I suggest that this is included as a fixed effect, not a random factor as it may reveal some interesting age related effect. Adding this as a random effect does not address reviewer 1’s valid issue about grade differences. While on this topic, it should be stated how many children from each grade were assigned to each group.

If item is included as a random effect, this implies that the analyses was done at the item level, that is, at the trial level (scores of 0, 1 or 2 for vocab depth, scores of, 0,5, and 1 for comprehension). Yet the two figures show values that suggest that the analysis was performed at the subject level (Figure 5, values between 4 and 8 would be consistent with the 10 test items; Figure 6, values between 5 and 8 would be consistent with the 12 items). This needs to be clarified.

However the analysis was done, the y axis of the figures need to be labelled error bars representing either standard deviations, standard errors of CIs, whatever the authors prefer, should be included on the graphs

Another query is I have is that regarding the degrees of freedom (values such as 3554 and 3547) Allowing for some missing data, this seems to suggest that all 20 items for the vocabulary depth (173 participants x 2 items = 3460), rather than just 10 items (trained or untrained were included in each analysis. Perhaps this is a quirk of Jamovi, but could the authors please check this.

More importantly, the analysis undertaken is inadequate for the stated research goals and may have led to incorrect conclusions. I’ll explain by referring to Figure 5. The authors claim that at posttest1 there was a difference between the two groups, with the experiment-control group (EC) scoring higher than the control-experiment (CE) group. This difference may be significant, but it does not demonstrate that the CE group improved from pretest to posttest1 significantly more than then CE group. It just shows that the difference at posttest1 is significant. As can be seen in the graph, the EC group started from a higher base and this was not controlled for in the analysis. Had this pretest difference been taken into account, the difference found at postets1 may not have been significant. More generally, an ANOVA analysis in this situation is incorrect (be it within a mixed model framework with random effects or a traditional least squares difference frame work). The correct analysis is an ANCOVA where the prescores are entered as the co-variate to control for initial differences (Van Breukelen et al., 2006).

Furthermore, I believe this analysis should exclude posttest2 for the following reasons:
• If the authors were interested in assessing the long term stability effects of the app, they needed to compare the scores at postets2 for the experimental-control group with a group of children who had not received training.
• Comparing the means at posttest 2 is simply comparing the means of two groups of children who have received the same training and tells the reader nothing useful (for example P17L11 “but the difference between the groups was not significant at Posttest 2”.
• There seems to be no theoretical motivation for comparing a group of children who had previously received the training to a group of children who just received the training (that is, assessing the change of scores from posttest1 to postets2 for the two groups).
• As I said in the first round of reviews, I have never seen this design before where each group serves as both control and training. I again provide the authors with an opportunity to cite a study where this has been done to justify the inclusion of postetst2 data in the analyses.

Conclusions
P2.L33 The authors state that “The main finding of the present study was the transfer we observed from vocabulary training to reading comprehension. This is an original finding, insofar as it shows a clearly significant effect, where previous vocabulary intervention studies found at most a weak transfer to comprehension”. Even if I accept the analysis as presented which I don’t for the reasons outline above, the data simply doesn’t support this finding. For example:
• P17L42 “Post hoc analyses (Bonferroni correction) revealed no significant difference between the groups either at pretest or at posttest.” Hence, the experiment-control group did not show significantly better comprehension after training than the control-experiment group.
• P17L47. “Comparing the progress made by the children from each group during the period in which they were trained, we observed that both groups show significant improvement between the pretest and Posttest 1”. So, the untrained group, as well as the trained group significantly improved their reading comprehension between pretest and postest1.
• The only result suggesting a transfer effect is P17L54 “Between Posttest 1 and Posttest 2, only the control-experimental group exhibited significant improvements in reading comprehension”. However, this result does not represent a direct comparison between the two groups.

Thus, the fact that “the first experimental group” did not outperform the control group and the second experimental group did improve, but no direct comparison was made with the control group does not represent sufficient evidence to claim that there was a transfer effect.

I’m left with the result that an app designed to improve vocabulary improved vocabulary, but there was not transfer effect to reading comprehension.

Re the difference between vocabulary depth and vocabulary breadth (the app appears to improve depth but not breadth), the following alternative explanation occurs to me. Breadth was measured with a binary scale (response correct or incorrect) whereas depth was measured with a 3 point scale (0, 1, 2 awarded). This the possible range of scores for breadth is 0-20 for 10 items, but 0-40 for depth (again, 10 items). Hence, the system used to score vocabulary depth is more sensitive. I suggest the authors mention this in the limitations.

Minor points
P2. L34. The two Coltheart and colleagues references are fine but Harm & Seidenberg, 2004 would not consider their model as a dual route model containing a lexical route with lexical entries (see the footnote on p. 669). I suggest removing this reference. If the authors want another, “non Coltheart” reference, I suggest Perry et al. (2010).

With regards to the inter-rater reliability, perhaps the authors could consider using “scoring guidelines” instead of “rating grid”. I think this might make it easier for the reader to understand what was done.

References
Perry, C., Ziegler, J. C., & Zorzi, M. (2010). Beyond single syllables: Large-scale modeling of reading aloud with the Connectionist Dual Process (CDP++) model. Cognitive Psychology, 61(2), 106-151. doi: 10.1016/j.cogpsych.2010.04.001

Pre-publication Review of

ProVoc: An app to train vocabulary depth in order to foster children's reading comprehension

Reviewed On August 18, 2020 , and January 13, 2021

Submitted to

Reviewed

Actions

Content of review 1, reviewed on August 18, 2020

Source

Content of review 2, reviewed on January 13, 2021

Source