The first two problem sets will be an exercise in integrative genomics. To study the problem of Parkinson’s Disease, you have the raw data from two experiments. You will need to analyze these experimental raw data and integrate the results.
First, read the publication by Hon-Chung Fung, et al. Genome-wide Genotyping in Parkinson’s Disease and Neurologically Normal Controls: First Stage Analysis and Public Release of Data. Lancet Neurology 2006; 5:911-916.Paper (This is a Stanford LIbrary page. Click “Get Full text” at the top of the page to get to the article.)
This publication describes a genome-wide association study on Parkinson's Disease. More than 408,000 single nucleotide polymorphisms (SNPs) were measured (or genotyped) across 276 patients with Parkinson’s Disease, and 276 normal control individuals. Each SNP is a potentially differing nucleotide between individuals. Recall that there are estimated to be as many as 10 million SNPs in the human genome, so this collection does not encompass all of them.
The raw data for this study are here:
The cc subdirectory indicates data for the Caucasian control individuals, while the pd subdirectory indicates data for the individuals with Parkinson's Disease.
The file chr22.map in the cc subdirectory starts with
22 15,407,252 rs5747620 T C 0.527 0.473 13 22 15,447,037 rs2236639 G A 0.921 0.079 0 22 15,447,620 rs5747988 G A 0.919 0.081 0 22 15,449,907 rs5747999 A C 0.835 0.165 2 22 15,462,210 rs11089263 A C 0.634 0.366 3 ...
The first column indicates the chromosome. The second column indicates the specific base-pair (nucleotide) on the chromosome for the location of this SNP. The third column indicates the dbSNP identifier for this SNP. The fourth column indicates the major allele found at this SNP, or the variant (i.e. base-pair) most commonly seen. The fifth column indicates the minor allele found at this SNP, or the variant (i.e. base-pair) least commonly seen. The sixth and seventh columns indicate the frequency of the major and minor alleles seen in this population. The eighth column indicates the number of missing genotypes (i.e. missing measurements).
The file chr22.pre in the cc subdirectory starts with:
ND 412 1 T C G G G G A A A A C C ... ND 528 1 T T G G G G A A ...
Each row of this file represents a single individual in the study. “ND 412” indicates the code for the individual. 1 indicates an unaffected individual, while 2 indicates an affected individual. After this number, a series of A, T, C and G characters appear. Each pair of characters represents the sequenced alleles (for two chromosomes: one maternal and one paternal) at a single locus for a single individual. For example, the first two alleles for individual “ND 412” are T and C. These first two base-pairs correspond with the first row of chr22.map. In other words, individual “ND 412” has at locus rs5747620 a T base-pair on one chromosome, and a C base-pair on the other chromosome. Individual “ND 412” has at locus rs2236639 a G base-pair on both chromosomes.
Since an individual has two chromosomes, and there are typically two possible alleles at each SNP locus, individuals can either have two of the major alleles (i.e. homozygous for the major allele, also known as “AA”), two of the minor alleles (i.e. homozygous for the minor allele, also known as “aa”), or one of each (i.e. heterozygous, also known as “Aa”). This makes three possible genotypes per locus. Continuing the example above, individual “ND 412” above is heterozygous at locus rs5747620, having a T on one chromosome and a C on the other. Individual “ND 412” is homozygous for the major allele at locus rs2236639, with a G base-pair on both chromosomes.
At any SNP locus, we can use the control individuals to provide the expected distribution of the three possible genotypes (“AA”, “Aa”, and “aa”). We can then test to see the distribution of these genotypes is significantly different in the affected individuals. For example, these distributions may look like:
| AA | Aa | aa | |
|---|---|---|---|
| Control | 192 individuals | 59 individuals | 19 individuals |
| Parkinson’s Disease | 167 individuals | 100 individuals | 3 individuals |
We can first use the chi-squared test to determine whether the genotype distribution seen in Parkinson’s Disease patients is different than control individuals (i.e. a case-control study). A chi-squared test for these data with 2 degrees of freedom yields a p-value of 6.301 x 10-6, indicating that it is highly unlikely that the Parkinson’s Disease distribution matches the control distribution. A Fisher-exact test could also be used, yielding similar results. Either way, this would indicate that the genotype distributions are significantly associated with the presence of Parkinson’s Disease.
To work on your local machine:
To work on our server do the following:
1. Focus on the sequencing data available for just chromosome 11 (chr11). Ignore the other chromosomes, for simplicity. Using all the control and affected individuals, calculate for each SNP locus the number of individuals having each of the three possible genotypes (“AA”, “Aa”, and “aa”). At each locus, determine the likelihood that the genotype at the locus is significantly different in Parkinson’s Disease individuals versus control individuals, using chi-squared testing.
List the top ten SNP loci on chromosome 11 associated with Parkinson’s Disease, ordered by chi-squared test p-value. (45 pts)
2. Why is chi-square an appropriate statistic to use for this analysis? (5 pts)
3. Draw out a representative chi-square table for the SNP locus (rs3741411) on chromosome 11, manually calculate the X^2 statistic and user R to get the p-value. Show all work. (5 pts)
4. How many SNP loci have a p-value of < 0.05? (5 pts)
Extra Credit: Carry out the analysis above on all 22 chromosomes (data can be obtained using the links above) excluding X and Y. List the top ten SNP loci associated with Parkinson's Disease, ordered by chi-squared test p-value. (10 pts)
Things to note:
Three steps:
1. Create a directory containing the following files:
2. Zip the directory into one file called ps1_your_sunet_id.zip.
3. Email the ps1_your_sunet_id.zip file to bmi217submit@gmail.com
You will need to explain your work.
You can talk with others about this problem set, but you must not compare answers to be fair to online students. You must submit your own individual work.