New Paper: Genetic Architecture of Complex Traits and Disease Risk Predictors

New paper! Non-coding regions contribute significantly to genetic disease risk -- think twice before you opt for exome sequencing over array genotyping. Also, pleiotropy between common disease risks seems to be weak.

Genetic Architecture of Complex Traits and Disease Risk Predictors


Soke Yuen Yong, Timothy G. Raben, Louis Lello, Stephen D. H. Hsu


Genomic prediction of complex human traits (e.g., height, cognitive ability, bone density) and disease risks (e.g., breast cancer, diabetes, heart disease, atrial fibrillation) has advanced considerably in recent years. Predictors have been constructed using penalized algorithms that favor sparsity: i.e., which use as few genetic variants as possible. We analyze the specific genetic variants (SNPs) utilized in these predictors, which can vary from dozens to as many as thirty thousand. We find that the fraction of SNPs in or near genic regions varies widely by phenotype. For the majority of disease conditions studied, a large amount of the variance is accounted for by SNPs outside of coding regions. The state of these SNPs cannot be determined from exome-sequencing data. This suggests that exome data alone will miss much of the heritability for these traits - i.e., existing PRS cannot be computed from exome data alone. We also study the fraction of SNPs and of variance that is in common between pairs of predictors. The DNA regions used in disease risk predictors so far constructed seem to be largely disjoint (with a few interesting exceptions), suggesting that individual genetic disease risks are largely uncorrelated. It seems possible in theory for an individual to be a low-risk outlier in all conditions simultaneously.


https://www.biorxiv.org/content/10.1101/2020.02.12.946608v1

doi: https://doi.org/10.1101/2020.02.12.946608


From the conclusions:

III. The DNA regions used in disease risk predictors so far constructed seem to be largely disjoint (with a few interesting exceptions), suggesting that individual genetic disease risks are largely uncorrelated.


Observation III has interesting implications for pleiotropy [63–65]. We found that genetic risks are largely uncorrelated for different conditions. This suggests that there can exist individuals with, e.g., low risk simultaneously in each of multiple conditions, for any essentially any combination of conditions. There is no trade-off required between different disease risks ... One could speculate that a lucky individual with exceptionally low risk across multiple conditions might have an unusually long life expectancy.



Note added: Some clarifying remarks from the comments.

1. We used the output of the UKB variant calling pipeline for the 50k exomes they released -- it is essentially the output data that researchers have available from these exomes. This is discussed in great detail in some of the references as there were some technical issues with the pipeline. SNPs that are not called by this process are (presumably) not determined from the exome reads. Exome sequencing only probes a small fraction of the whole genome, after all.


In any case, we independently analyzed the locations of the SNPs and plenty are outside of coding regions, etc. Referring to the exome process specifically is just to give another "operational" definition of what is in coding vs non-coding regions since the boundaries of these regions are a bit ill-defined in the literature.


2. Plenty of people believe in strong pleiotropy, and are likely surprised by this result. High dimensionality alone is enough to make low pleiotropy plausible, but it *might* have been the case that some special genomic regions play an important role across many diseases. Lots of people with "strong biomedical intuition" told me this would be the case, but apparently not...


There is no way to know until you compare detailed genetic architectures on a disease by disease basis. We are the first to do that.


We don't claim that genetic correlations are close to zero. We just characterize the correlation/overlap between known predictor SNPs for the various diseases. (There is still plenty of heritability not yet discovered for each of the diseases -- need more training data. The predictors will improve a lot in time but these are the most significant SNPs -- i.e. the easiest to discover.)


From our results one can at least put a lower bound on the amount of "risk reduction" (or longevity gain!) that is independently and simultaneously available across various diseases (e.g., if one could make edits freely). It's a lot.


© Genomic Prediction, 2020