UNDERGRAD RESEARCH

Project Topics

Note: The field of statistical genetics continues its unprecedented growth and expansion. Additionally, new technologies and ideas are constantly opening up new research areas. Thus, the projects below are proposed projects only, as we will seek to select the most timely, relevant, and meaningful research for completion during summer 2014. That may change between now and the start of the program.

Project Area #1. Directly connecting human disease etiology with rare variant test performance

One of the most active areas in statistical genetics is the development of new rare variant tests of association. The general approach for most of these tests is to combine genotype-phenotype association signals at multiple sites in a gene of interest into a single test of significance. Numerous gene-based rare variant tests of this type have been proposed (e.g., Li and Leal, 2008; Madsen and Browning 2009; Morris and Zeggini 2010; Zawistowski et al. 2010; Li et al. 2010a, Li et al. 2010b, Price et al. 2010, Neale et al. 2011, Pan and Shen 2011, Wu et al. 2011 among others), with differences between the methods in how variants are summarized individually, weighted and then combined to create a single statistic.

Recent comprehensive analyses on data simulated under numerous genotype-phenotype models (e.g., Basu and Pan, 2011) have shown that different tests are optimal in different models. However, we still lack an intuitive understanding about why these methods perform the way that they do, particularly with respect to parameters in the genotype-phenotype model and study design.

We will continue work started in summer 2011 to evaluate a geometric representation for case-control sequencing data which facilitates an analytic connection between the parameters in the disease model, the observed sequence data and the equations for many existing rare variant test statistics. This framework allows a geometric classification of proposed rare variant tests that explains some of the observed differences in performance and can be used to directly connect genetic disease models with statistical power. We anticipate that this framework will assist in guiding researchers in prospective test selection and study design, as well as provide the opportunity to analytically evaluate novel rare variant statistics.

Specific student research tasks will involve analytic evaluation of systems of equations along with computational and simulation approaches to complement analytic analyses.

Project Area #2 Considering genotype uncertainty in gene-based rare variant tests

The new class of rare variant tests has generally been evaluated assuming perfect genotype information. In reality, rare variant genotypes from next generation sequencing, SNP chip, and imputed data will all contain genotyping errors and the gene-based tests must be robust to accommodate imperfect data. Errors in SNP genotyping are already known to dramatically impact statistical power for single marker tests on common variants (Gordon and Finch, 2005; Kang et al. 2004a; Kang et al. 2004b; Ahn et al. 2007; Gordon et al. 2002; Huang et al. 2009) and, in some cases, inflate the type I error rate (Moskvina et al. 2006; Ahn et al. 2009, Mitry et al. 2011).

Recent results show that errors in genotype calls derived from sequencing reads are dependent on several study factors, including read depth, calling algorithm, the number of alleles present in the sample, and the frequency at which an allele segregates in the population. Imputation accuracy of rare variants is known to depend on frequency of the allele to be imputed as well as the size of the reference panel and its genetic relatedness to the study sample.

We will work to utilize the rare variant genotype error models from sequencing calls and imputation to precisely quantify the impact of study design variables affecting genotype accuracy on rare variant tests. These analyses will provide a realistic framework for assessing power and type I error for rare variant tests in real studies and suggest certain genetic disease model/test combinations that are particularly impacted or resistant to genotype uncertainty.

Specific student research tasks will involve analytic evaluation of systems of equations along with computational and simulation approaches to complement analytic analyses.

Project Area #3 Translating significant gene-based rare variant associations into precise formulations of disease architecture

The current frenzy in developing rare variant tests has focused on developing increasingly complex tests in order to maximize statistical power to detect genomic regions harboring multiple rare variants that influence disease risk. Thus far, little effort has been placed in the subsequent steps to replicate and characterize significant gene-based rare variant tests. Genome-wide association studies (GWAS) have shown that replicating a significant statistical association and precisely defining its genetic architecture are not straightforward. Bias in effect size estimates owing to the Winner’s Curse effect (Zöllner and Pritchard 2007, Xiao and Boehnke 2011), differences in LD patterns between populations, and heterogeneity in effect sizes all complicate replication of true GWAS associations (Ioannidis 2007).

Further, determining the exact causal variant is confounded by the fact that the SNP showing strongest association in the GWAS may not be causal, but instead may be in linkage disequilibrium with the true causal variant (Wang 2010). These issues will persist for gene-based rare variant tests and several new challenges are likely to emerge. Notably, the significant gene-based statistic may include a mix of causal, risk and neutral variants, each with a unique weighting term. Thus, a primary challenge will be decomposing the pooled test statistic to determine precisely which variants influence disease risk. Further, replication studies for rare risk variants identified through sequencing experiments may include sequencing, genotyping or imputation in an independent sample.

The appropriateness of each of these methods will be dependent on the frequency spectrum of causal variants. We will develop of statistical methods for post-hoc analysis of significant rare variant test results to gain insight into the underlying genetic architecture. We anticipate that these methods will provide a basis to guide future analyses, including powerful replication studies, and ultimately yield a clearer picture of disease etiology.

Specific student tasks will involve in depth analysis of simulation data and development of novel identification/clustering algorithms, along with the development of analytic cost-efficiency models for replication study design.

Project Area #4 Bayesian approaches to regulatory network inference in prokaryotes

Regulatory network inference is an attempt to characterize the regulatory network of prokaryotes through the integration of multiple genomic data sources. Traditional RNI approaches CLR (Faith et al. 2007) and Inferelator (Bonneau 2006) are described as agnostic in that they ignore a priori knowledge about regulatory mechanisms and use only gene expression measurements. We will explore the development of a Bayesian framework to integrate a priori knowledge about sets of genes that may be co-regulated in order to improve statistical power for regulatory network inference.

Specific student tasks will involve the development of the Bayesian framework and significant simulation studies. Successful efforts in this area will result in application of the methods to large amounts of prokaryotic microarray data in an internationally recognized knowledgebase for prokaryote genetic data.

References

Ahn K., Gordon D., Finch S.J. (2009): “Increase of rejection rate in case-control studies with differential genotyping error rates.” Statistical Applications in Genetics and Molecular Biology 8:25.

Ahn K., Haynes C., Kim W., St. Fleur R., Gordon D., Finch S.J. (2007): “The effects of SNP genotyping errors on the power of the Cochran-Armitage linear trend test for case/control association studies.” Annals of Human Genetics 71:249-262.

Basu S., Pan W.: “Comparison of statistical tests for disease association with rare variants.” Genetic Epidemiology 35(7) 606-619.

Bonneau R, Reiss DJ, Shannon P, Facciotti M, Hood L, Baliga NS, Thorsson V: The Inferelator: an algorithm for learning parsimonious regulatory networks from systemsbiology data sets de novo. Genome biology 2006, 7(5):R36.

Faith JJ, Hayete B, Thaden JT, Mogno I, Wierzbowski J, Cottarel G, Kasif S, Collins JJ, Gardner TS: Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS biology 2007, 5(1):e8.

Gordon D., Finch S.J. (2005): “Factors affecting statistical power in the detection of genetic association.” Journal of Clinical Investigation 115:1408-1418.

Huang L., Wang C., Rosenberg N.A. (2009): “The relationship between imputation error and statistical power in genetic association studies in diverse populations.” American Journal of Human Genetics 85:692-698.

Ioannidis J.P. (2007): “Non-replication and inconsistency in the genome-wide association setting.” Human Heredity 64(4):203-13.

Li B., Leal S.M. (2008): “Methods for detecting associations with rare variants for common diseases: Application to analysis of sequence data.” American Journal of Human Genetics 83:311-321.

Gordon D, Finch S.J., Nothnagel M., Ott J. (2002): “Power and sample size calculations for case-control genetic association tests when errors are present: Application to single nucleotide polymorphisms.” Human Heredity 54:22-33.

Kang S.J., Finch S.J., Haynes C., Gordon D. (2004): “Quantifying the percent increase in minimum sample size necessary for SNP genotyping errors in genetic model-based association studies.” Human Heredity 58,139-144.

Kang S.J., Gordon D., Finch S.J. (2004): “What SNP genotyping errors are most costly for genetic association studies?” Genetic Epidemiology 26:132-141.

Li Q., Zhang H., Yu K. (2010): “Approaches for evaluating rare polymorphisms in genetic association studies.” Human Heredity 69:219-228.

Li Y., Byrnes A.E., Li M. (2010): “To identify associations with rare variants, just WHaIT: Weighted Haplotype and Imputation-based Tests.” American Journal of Human Genetics 87:728-735.

Madsen B.E., Browning S.R. (2009): “A group-wise association test for rare mutations using a weighted sum statistic.” PLoS Genetics 5:e1000384.

Mitry D., Campbell H., Charteris DG, Fleck B.W., Tenesa A., Dunlop M.G., Hayward C., Wright A.F., Vitart V. (2011): “SNP mistyping in genotyping arrays – an important cause of spurious association in case-control studies.” Genetic Epidemiology 35:423-426.

Morris A.P., Zeggini E. (2010): “An evaluation of statistical approaches to rare variant analysis in genetic association studies.” Genetic Epidemiology 34:188-193.

Moskvina V., Craddock N., Holmans P., Owen M.J., O’Donovan M.C. (2006): “Effects of differential genotyping error rate on the type I error probability of case-control studies.” Human Heredity 61:55-64.

Neale B.M., Rivas M.A., Voight B.F., Altshuler D., Devlin B., Orho-Melander M., Kathiresan S., Purcell S.M., Roeder K., Daly M.J. (2011): “Testing for an unusual distribution of rare variants.” PloS Genetics 7:e1001322.

Pan W., Shen X. (2011): “Adaptive tests for association analysis of rare variants.” Genetic Epidemiology 35:381-388.

Price A.L., Kryukov G.V., de Bakker P.I.W., Purcell S.M., Staples J., Wei L.J., Sunyaev S.R. (2010): “Pooled association tests for rare variants in exon-resequencing studies.” American Journal of Human Genetics 86:832-838.

Wang K., Dickson S.P., Stolle C.A., Krantz I.D., Goldstein D.B., Kakonarson H. (2010): “Interpretation of association signals and identification of causal variants from genome-wide association studies.” American Journal of Human Genetics 86(5):730-42.

Wu M.C., Lee S., Cai T., Li Y., Boehnke M., Lin X. (2011): “Rare-variant association testing for sequencing data with the sequence kernel association test.” American Journal of Human Genetics 89:82-93.

Xiao R. and Boehnke M. (2011): “Quantifying and Correcting for the winner’s curse in quantitative trait association studies.” Genetic Epidemiology 35:133-138.

Zawistowski M., Gopalakrishnan S., Ding J., Li Y., Grimm S. and Zöllner S. (2010): “Extending rare-variant testing strategies: Analysis of noncoding sequence imputed genotypes” American Journal of Human Genetics 87:604-617.

Zöllner S., Pritchard J.K. (2007): “Overcoming the winner's curse: Estimating penetrance parameters from case-control data.” American Journal of Human Genetics 80: 605-615. (2010) “A map of human genome variation from population-scale sequencing.” The 1000 Genomes Project Consortium. Nature. 467:1061-1073.