Read the publication by Renee Miller, et al. Dysregulation of Gene Expression in the 1-Methyl-4-Phenyl-1,2,3,6-Tetrahydropyridine-Lesioned Mouse Substantia Nigra. The Journal of Neuroscience 2004; 24:7445-7454.Paper In this publication the authors used a chemical (MPTP) to induce the degeneration of specific neurons in mice, to resemble and model Parkinson’s Disease. The authors measured genome-wide changes in expression using microarrays. The data for this time-series study are publicly available at the NCBI Gene Expression Omnibus (GEO) at http://www.ncbi.nlm.nih.gov/geo/ under DataSet accession number GDS2053. When viewing the dataset, download the DataSet SOFT file. There are 12 microarrays, but the ones of interest for us are from the control group (4 microarrays) and after 7 days (4 microarrays). Use the GEO web-site to determine which samples correspond to those two conditions.
Comparing two groups of 4 microarrays each is not ideal (one would want more measurements than this), but it will suffice for this problem set. One commonly used way to test whether each gene is differentially expressed between the two groups is to use the Significance Analysis of Microarrays (SAM) method. To understand the SAM method, you will want to read Virginia Tusher, et al. Significance Analysis of Microarrays Applied to the Ionizing Radiation Response. Proceedings of the National Academy of Science 2001; 98:5116-5121. Paper You will also want to use the SAM (Statistical Analysis of Microarrays, http://www-stat.stanford.edu/~tibs/SAM), software which has both R and Excel implementations. You should read the SAM help files and examples for instructions.
One of the challenges in using statistical tests on data from gene expression microarrays is the question of where to set the significance threshold, especially given the number of comparisons being made across all the genes on the microarray. One solution to this problem is to control for the False Discovery Rate, or the expected proportion of false predictions being made in the set of predictions. For example, when comparing one group of microarrays with another, one sure way to generate false predictions is to shuffle the group assignments of the microarrays.
In the human genome-wide association study (Problem Set 1), you found that multiple genes can be found having statistically significant associations with Parkinson’s Disease. These genes may play a role in a multitude of ways. For one silly (though plausible) example, one of these genes may predispose individuals to imbibe MPTP, thus indirectly leading to Parkinson’s Disease. Other genes may play a more direct causal role in the area of the brain involved in Parkinson’s Disease. Here, we want you to integrate the findings between the two analyses you have performed, searching for genes significantly associated with Parkinson’s Disease in humans, and significantly different in the brain of a mouse model of Parkinson’s Disease. One would perform this kind of integration to find those genes that are potentially more directly or functionally associated with a disease.
We are providing three additional translational tables to do this.
/afs/ir.stanford.edu/class/biomedin217/WWW/pset2
The first file, called translate_affy_geneid.txt, starts with:
id GeneID 100000_at 101118 100001_at 12502 100002_at 16426 100003_at 20190 ...
The first column indicates a probe-set identifier on the Affymetrix microarray used in the Parkinson’s Disease experiment above. The second column indicates an NCBI Gene identifier, directly usable on the website http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene. These identifiers are unique to a specific gene in a specific species. For example, probe “100000_at” is a measurement of the gene Tmem168 (Gene 101118). Note the duplicates in this table and consider how you will have to deal with them. There are multiple ways to deal with duplicates. Choose one way and justify your choice in the methods section of your PDF report.
The second file, called translate_rsid_geneid.txt, starts with:
rs243 7093 rs541 5214 rs542 5214 rs543 5214 ...
The first column indicates a SNP locus id, used in the genome-wide association study above. The second column indicates an NCBI Gene identifier. For example, SNP locus id rs243 represents a SNP near or within the genes TLL2 (Gene 7093). Note the duplicates in this table and consider how you will have to deal with them; this is a many-to-many table.
The third file, called translate_hid_geneid.txt, starts with:
HID GeneID 3 34 3 11364 5 37 5 11370 ...
The first column indicates a HomoloGene identifier (HID), which represents genes that are similar across different species. The second column indicates a human or mouse NCBI Gene identifier. For example, human ACADM (Gene 34) is significantly similar (i.e. orthologous) to mouse Acadm (Gene 11364) when their DNA sequences are compared. You can use this table to “convert” gene identifiers between species. Again, note the duplicates in this table (3 or more GeneIDs for a single HID) and consider how you will have to deal with them.
1. Using SAM, find the genes that are significantly increased and decreased, at a median false discovery rate (FDR) of no worse than 0.05. You will need to examine the delta table and tweak the delta parameter to achieve the right FDR. Submit the top 20 up regulated and top 20 down regulated genes as well as their q-values. You may use the default seed of 100 for samr. (25 pts)
2. Of those genes having a q-value under 0.05 by SAM analysis of the mouse microarray data, which gene has the best p-value on the chi-squared test of the human genome-wide association data from chromosome 11 (Problem Set 1)? (5 pts)
(In this problem set, you can ignore the role of haplotype blocks. In reality, these SNP loci represent more than a single gene and in fact cover a region of the chromosome. A significant SNP could be in linkage disequilibrium with another gene that is significantly differentially expressed.)
3. What does a q-value represent? Recall your results from Problem Set 1; why would one want to use a q-value as opposed to a p-value to assess significance? (5 pts)
4. Why does it not make sense to run SAM across microarray experiments from different datasets? What can you do to enable such an analysis? (5 pts)
5. Gene Ontology (GO) is an increasingly popular structured vocabulary used to describe the functions of proteins. Genes may have zero, one, or more Gene Ontology categories assigned, depending on what is known about the function of their proteins. For any given Gene Ontology category, a number of genes have been assigned to that category, and a larger number of genes have not been assigned to that category. This implicitly makes a ratio, or proportion of genes in the genome (or other relevant large baseline collection of genes) assigned to the category. For a new set of genes, such as those resulting from a bioinformatics analysis, the ratio seen for this category can be tested against the ratio seen in the baseline set of genes using the hypergeometric distribution. Of those genes significantly different by microarray with q-value < 0.10, and with SNPs with p-values < 0.10, what Gene Ontology category is most over-represented, using the hypergeometric distribution? This translation table between NCBI Gene identifiers and GO categories will be useful: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2go.gz (20 pts)
Extra Credit: Run SAM on all combinations of GEO Data Set (GDS) subsets annotated with time for GDS2053 (i.e. control vs. day 1, control vs. day 7, day 1 vs. day 7). Are there genes that are significantly increased and decreased, at a median false discovery rate (FDR) of no worse than 0.05 that are common in all three comparisons? Two comparisons? If so, what are they? What conclusions can you draw from this result? (10 pts)
Three steps:
1. Create a directory containing the following files:
2. Zip the directory into one file called ps2_your_sunet_id.zip.
3. Email the ps2_your_sunet_id.zip file to bmi217submit@gmail.com
You will need to explain your work.
You can talk with others in the class about this problem set, but may not compare answers. You must turn in your own individual work.