Genomic data has been growing explosively in the past few years. Until now, there are more than 500K gene expression profiles in public databases (e.g., NCBI Gene Expression Omnibus). The Encyclopedia of DNA Elements (ENCODE) Consortium has generated vast amounts of annotation data using next-generation sequencing, such as gene expression (RNA-seq), transcription factor binding sites (ChIP-seq), etc. As of September 2012, more than 1,600 ENCODE data from 147 cell lines have been produced. Meanwhile, thousands of GWAS directly genotyped millions of SNP markers to study the genetic bases of complex diseases and more than 10,000 loci have been reported to be associated with at least one disease. Whole genome sequencing aims at directly detecting all genetic variants and it is rapidly becoming a primary tool to characterize the genetic bases of human diseases.
In 2010, Yang et al. showed that 45% of the variance for human height can be explained by using all genotyped common SNPs. This result suggests that most of the "missing heritability" is not missing but remains hidden in the genome: due to the limited sample size. Many individual effects of genetic markers are too weak to pass the genome-wide significance and thus those risk genetic variants remain undiscovered. So far, people have found similar genetic architectures for many other complex diseases, such as psychiatric disorders, i.e., the phenotype is affected by many genetic variants with small or moderate effects, which is referred to as "polygenicity". The polygenicity of complex diseases is further supported by recent GWAS with larger sample sizes, in which more associated common SNPs with moderate effects have been identified (e.g., GWAS data from 34,840 patients and 114,981 healthy people are analyzed to understand the genetic architecture of type 2 diabetes). However, large sample recruitment may be expensive and time-consuming. Identification of those hidden risk variants is very challenging. Thanks to the Big Data in genomics, statistics can be very helpful for borrowing relevant information:
Shared information in multiple GWAS: accumulating evidence suggests that different complex human traits are genetically correlated, i.e., multiple diseases share common risk genetic bases, which is known as "pleiotropy". Based on a systematic analysis of published GWAS, 16.9% genes and 4.6% SNPs have been reported to show pleiotropic effects.
Data enrichment with functional annotation: SNPs are not equally important and functionally annotated genetic variants have revealed a consistent pattern of enrichment. Associated SNPs are more likely to be eQTLs, e.g., SNPs in genes preferentially expressed in the central nervous system are shown to be more important in psychiatry disorder. The ENCODE Project Consortium reported that 12% of disease-associated SNPs overlap transcription factor binding regions and 34% overlap DNase I hypersensitive sites.
Clearly, there is a great need to develop a statistically rigorous and computationally efficient methods to integrate genomic data. It allows biomedical researchers to make the most efficient use of the vast amounts of valuable data that have been generated to dissect complex disease genetics. The methods developed here are also broadly applicable to many other disciplines where diverse, rich, and multiscale data are available to address challenging scientific problems.
See more relevant information in our recent paper
D. Chung*, C. Yang*, C. Li, J. Gelernter and H. Zhao. GPA: A statistical approach to prioritizing GWAS results by integrating pleiotropy information and annotation data. PLoS Genetics, 2014. *Joint first authors. [Software and more information about GPA].
J. Ming, M. Dai, M. Cai, X. Wan, J. Liu, C. Yang. LSMM: A statistical approach to integrating functional annotations with genome-wide association studies. [arXiv][Software]
(852) 2358 7462
- Dr Tai-chin Lo Associate Professor of Science
- Associate Professor, Department of Mathematics
Scientific Breakthroughs & Discoveries
Active Liquid Crystal Systems Examined in Search…
Liquid Crystals (LC) are widely deployed in display technology and optical fibers. From smartphones in your pockets to large screen TVs, LCs are everywhere...
HKUST Researchers Find Novel Way to Produce New…
HKUST research team has discovered a method that would allow the production of a new type of spherical molecules not easily obtainable before...