Content of review 1, reviewed on November 11, 2016
Knowledge Gap: Functional annotation, especially for non-coding regions, remains challenging. Current tools (CADD) does not use tissue-specific information provided by projects like ENCODE/RME/GTEx.
Strategy:
Here, they propose to integrate RME data for more than 100 cell-types to derive 7 tissue-specific predictive models (brain, GI, lung, heart, blood, muscle and epithelium), meaning that one locus might be functional in one tissue but not in others.
- Their model estimation relies on unsupervised learningand mixture modelling: they assume a joint distribution of their 8 annotations (histone marks) to be a mixture of functional and non-functional positions.
To ease the model fitting, they make another assumption (that I think is a bit strong): each annotation mark is conditionally independent given the functionality, and then derive a posterior probability for each locus.
Main findings:
they estimate that 22% of the genome is functional, for at least one tissue, and ~2% is functional across the 7 tissues.
- they validate GenoSkyline on known tissue-specific annotations (regions for blood and heart, VISTA enhancers for brain and heart)
On Psychiatric Genomics Consortium data (SCZ), they demonstrate a better prioritarization using brain model rather than heart, and opposite results were observed on 23 CARDIoGRAM loci.
Strengths:
UCSC track for whole-genome annotation
preprint of GenoSkyline-Plus that includes RNA-Seq information from RME too
Weaknesses:
strong assumption on the probabilistic model
- only integrate one level of information (epigenetics, even if there is multiple marks)
Source
© 2016 the Reviewer (CC BY 4.0).