Our newest paper (released to Bioarxiv here) describes over a dozen genomic predictors for common disease risk, constructed via machine learning on hundreds of thousands of genotypes. The predictors use anywhere from a few tens (e.g., 20 or 50) to thousands of SNPs to compute the risk PRS (Poly-Genic Score) for a specific disease.

The figure above (Atrial Fibrillation) shows out-of-sample testing of risk prediction (black dots with error bars) compared to theoretical prediction (red line). The theoretical prediction uses the empirical fact that cases and controls are normally-distributed in PGS score, with the two distributions shifted relative to each other. Cases have, on average, higher risk scores, and come to dominate in high PGS percentile bins. So, conditional on a high PGS risk score (e.g., 99th percentile PGS), the probability of the condition can be significantly elevated (e.g., ~8 times typical probability of developing atrial fibrillation).

**We can identify, from SNP genotype alone, a subset of the population with unusual risk for conditions like Atrial Fibrillation or Diabetes or Breast Cancer or Prostate Cancer. **Just a year or two ago this would have seemed like science fiction to most biomedical researchers, but the demonstrable validity of PGS is now increasingly mainstream.

Empirical validation of risk is limited by availability of out-of-sample populations for whom we have genotype and disease status. However, it is clear from the results that the theoretical models do a competent job of predicting odds ratios once the properties of the case and control normal distributions (mean and standard deviation of PGS) are known.

Once the genotype has been determined, *all* of the disease risks in the above can be computed for an individual patient or embryo. It is only a matter of time before genotyping of this kind becomes Standard of Care in health systems and IVF practices around the world.

In the paper we also analyze the rate of improvement of prediction AUC as training sample size increases. With more data, our extant predictors are becoming significantly more accurate, and more predictors for many more risk traits are currently undergoing similar validation. News of actionable developments in predictor performance is being released at an accelerating pace, on predictor development timescales ranging from a few months to a few years.

© Genomic Prediction, 2019