Problem Set 1

  • Released January 12, 2009
  • Due January 19, 2009 at 5pm (Pacific time)

Introduction

The first two problem sets will be an exercise in integrative genomics. To study the problem of Parkinson’s Disease, you have the raw data from two experiments. You will need to analyze these experimental raw data and integrate the results.

First, read the publication by Hon-Chung Fung, et al. Genome-wide Genotyping in Parkinson’s Disease and Neurologically Normal Controls: First Stage Analysis and Public Release of Data. Lancet Neurology 2006; 5:911-916.Paper (This is a Stanford LIbrary page. Click “Get Full text” at the top of the page to get to the article.)

This publication describes a genome-wide association study on Parkinson's Disease. More than 408,000 single nucleotide polymorphisms (SNPs) were measured (or genotyped) across 276 patients with Parkinson’s Disease, and 276 normal control individuals. Each SNP is a potentially differing nucleotide between individuals. Recall that there are estimated to be as many as 10 million SNPs in the human genome, so this collection does not encompass all of them.

The raw data for this study are here:

The cc subdirectory indicates data for the Caucasian control individuals, while the pd subdirectory indicates data for the individuals with Parkinson's Disease.

The file chr22.map in the cc subdirectory starts with

22      15,407,252      rs5747620       T       C       0.527   0.473   13
22      15,447,037      rs2236639       G       A       0.921   0.079   0
22      15,447,620      rs5747988       G       A       0.919   0.081   0
22      15,449,907      rs5747999       A       C       0.835   0.165   2
22      15,462,210      rs11089263      A       C       0.634   0.366   3
...

The first column indicates the chromosome. The second column indicates the specific base-pair (nucleotide) on the chromosome for the location of this SNP. The third column indicates the dbSNP identifier for this SNP. The fourth column indicates the major allele found at this SNP, or the variant (i.e. base-pair) most commonly seen. The fifth column indicates the minor allele found at this SNP, or the variant (i.e. base-pair) least commonly seen. The sixth and seventh columns indicate the frequency of the major and minor alleles seen in this population. The eighth column indicates the number of missing genotypes (i.e. missing measurements).

The file chr22.pre in the cc subdirectory starts with:

ND 412 1 T C G G G G A A A A C C ...
ND 528 1 T T G G G G A A ...

Each row of this file represents a single individual in the study. “ND 412” indicates the code for the individual. 1 indicates an unaffected individual, while 2 indicates an affected individual. After this number, a series of A, T, C and G characters appear. Each pair of characters represents the sequenced alleles (for two chromosomes: one maternal and one paternal) at a single locus for a single individual. For example, the first two alleles for individual “ND 412” are T and C. These first two base-pairs correspond with the first row of chr22.map. In other words, individual “ND 412” has at locus rs5747620 a T base-pair on one chromosome, and a C base-pair on the other chromosome. Individual “ND 412” has at locus rs2236639 a G base-pair on both chromosomes.

Since an individual has two chromosomes, and there are typically two possible alleles at each SNP locus, individuals can either have two of the major alleles (i.e. homozygous for the major allele, also known as “AA”), two of the minor alleles (i.e. homozygous for the minor allele, also known as “aa”), or one of each (i.e. heterozygous, also known as “Aa”). This makes three possible genotypes per locus. Continuing the example above, individual “ND 412” above is heterozygous at locus rs5747620, having a T on one chromosome and a C on the other. Individual “ND 412” is homozygous for the major allele at locus rs2236639, with a G base-pair on both chromosomes.

At any SNP locus, we can use the control individuals to provide the expected distribution of the three possible genotypes (“AA”, “Aa”, and “aa”). We can then test to see the distribution of these genotypes is significantly different in the affected individuals. For example, these distributions may look like:

AA Aa aa
Control 192 individuals 59 individuals 19 individuals
Parkinson’s Disease 167 individuals 100 individuals 3 individuals

We can first use the chi-squared test to determine whether the genotype distribution seen in Parkinson’s Disease patients is different than control individuals (i.e. a case-control study). A chi-squared test for these data with 2 degrees of freedom yields a p-value of 6.301 x 10-6, indicating that it is highly unlikely that the Parkinson’s Disease distribution matches the control distribution. A Fisher-exact test could also be used, yielding similar results. Either way, this would indicate that the genotype distributions are significantly associated with the presence of Parkinson’s Disease.

To get started

To work on your local machine:

  • Download the following RData file here.
  • This data file contains raw data for chromosome 11 imported into R data structures.
  • If you are using the R GUI click on File → Load Workspace, and select ps1.RData

To work on our server do the following:

  • Use your terminal window to connect to bmi217compute.stanford.edu
  • Use your sunetid and password to login.
  • Run the following command without the quotes: “cp /data/shared/ps1.RData .”
  • Run the following command without the quotes: “R”
  • Type into the prompt: load(“ps1.RData”)

Questions

1. Focus on the sequencing data available for just chromosome 11 (chr11). Ignore the other chromosomes, for simplicity. Using all the control and affected individuals, calculate for each SNP locus the number of individuals having each of the three possible genotypes (“AA”, “Aa”, and “aa”). At each locus, determine the likelihood that the genotype at the locus is significantly different in Parkinson’s Disease individuals versus control individuals, using chi-squared testing. List the top ten SNP loci on chromosome 11 associated with Parkinson’s Disease, ordered by chi-squared test p-value. (45 pts)

2. Why is chi-square an appropriate statistic to use for this analysis? (5 pts)

3. Draw out a representative chi-square table for the SNP locus (rs3741411) on chromosome 11, manually calculate the X^2 statistic and user R to get the p-value. Show all work. (5 pts)

4. How many SNP loci have a p-value of < 0.05? (5 pts)

Extra Credit: Carry out the analysis above on all 22 chromosomes (data can be obtained using the links above) excluding X and Y. List the top ten SNP loci associated with Parkinson's Disease, ordered by chi-squared test p-value. (10 pts)

Things to note:

  • Some of the SNPs were unable to be measured in some individuals. These missing measurements need to be ignored in the statistics.
  • Some individuals were ignored in the publication, resulting in slightly different statistics. You do not need to eliminate these individuals, and can instead use all the individuals provided in the files.
  • Your statistical results may not exactly match those in the publication. That’s ok. We are simplifying this problem.
  • For this problem set, we are ignoring the specific role of genetics at each locus, such as additive, dominant and recessive genetic models. If you do not know what these are, look them up.
  • For now, we are not compensating for the multiple tests and hypotheses we are studying.
  • Try to use apply instead of a for loop for faster running time. Use ?apply or help(apply) to learn more about it.

Submission

Three steps:

1. Create a directory containing the following files:

  • All code/scripts. We need be able to run your code! Make sure we will be able to do so.
  • A file called “readme.txt” explaining your technical code details. Write down exactly how to run your code. If you used libraries that we should install, note them here.
  • A PDF file called “ps1.pdf”, which is a summary of your approach (1 page max, 12pt Arial font, single spaced) and answers to our questions. This is where you explain your work, which is important for assigning partial credit. Free tools exist for converting Word documents to PDF like http://www.zamzar.com/

2. Zip the directory into one file called ps1_your_sunet_id.zip.

3. Email the ps1_your_sunet_id.zip file to bmi217submit@gmail.com

Grade breakdown

You will need to explain your work.

  • 20 pts for well-commented working code. You can get partial credit for partially working code or non-working code that is well-commented.
  • 5 pts for readme.txt that clearly, concisely describes how to run your code.
  • 75 pts for ps1.pdf, for a clear consise summary (15 pts) of what you did and answers to the questions (60 pts)

Collaboration policy

You can talk with others about this problem set, but you must not compare answers to be fair to online students. You must submit your own individual work.

 
public/pset1_win0809.txt · Last modified: 2009/01/22 10:55 by ecoronap
 
Recent changes RSS feed Creative Commons License Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki