Problem Set 4

  • Released Feb 2, 2009
  • Due 5:00 PM (Pacific Time) - February 9, 2009

Introduction

PART I

Personal Genomics: James Watson's Genome

James Watson and Craig Venter have had their genomes sequenced and released them to the public. Craig Venter recently expressed frustration that he has not learned as much as he thought he would about his genome at the 58th Annual ASHG (American Society of Human Genetics) conference. Despite the various GWAS publications to date, it is still much easier to tell how tall a child will be when he or she reaches adulthood by looking at the parents as opposed to genotyping. Read the following paper which focuses on what we can't learn from having one's genome sequenced.

http://www.nature.com/news/2008/081105/full/456018a.html Data required for this problem set can be downloaded here. http://biomedin217.stanford.edu/pset4/

James Watson's genome in its original format along with more data can be downloaded here. This original format has been modified to make it more intuitive. http://jimwatsonsequence.cshl.edu/cgi-perl/gbrowse/jwsequence/

The format in which the SNPs are given is not very intuitive, so I converted it to a more manageable format for your convenience. Below is a description of the file.

  • watson.txt
  • Col 1: rs_number of SNP
  • Col 2: allele_1 (A,G,T, or C)
  • Col 3: allele_2 (A,G,T, or C)

In order to investigate James Watson's genome, we supplied a database that links specific alleles in SNPs to increased susceptibility to disease which can be downloaded here. Below is a description of this diseases database.

  • diseases.csv
  • Col 1: Name of the disease or condition, hereafter called disease.
  • Col 2: SNP the disease is associated with.
  • Col 3: Allele that increases susceptibility to disease
  • Col 4: The “other” allele which does not increase susceptibility to disease

PART II Towards Chemoinformatics Approaches to Disease

In a strike of rarity, Watson is suddenly afflicted with a variety of diseases after a whirlwind tour of the world. He is stranded on a tropical island, where he discovers mines full of small molecules, his only hope.

Some small molecules have surfaced as important players in disease. For example, astemizole, a small molecule, has been recently shown to be an antimalarial drug link.

Because of the sudden onset of disease, as well as the shock from being stranded, Watson is too weak to mine more than 1 small molecule. You must determine which molecule he must mine.

In your toolbox is PubChem. PubChem database is managed by the National Center for Biotechnology Information (NCBI), under the NLM/NIH. It is part of the molecular libraries initiative (MLI), in order to further research in basic biomedicine, as well as basic research public health and therapeutics. With the initiative, an article was published in Science in 2004.

PubChem contains 3 databases, PubChem Substances, PubChem Compounds, and PubChem BioAssays. The BioAssay screens involve a panel of small molecules from a standardized catalog, and are studied for activity within a particular assay (e.g., if we were going to assay for binding to protein A, it is used to see how much of an affinity (activity) each small molecule has for protein A).The following problems involve BioAssay data from PubChem.

You are given the following information:

Watson is diagnosed with:

  • cancer
  • diabetes
  • HIV
  • neuronal loss

As far-fetched as it may seem, our protagonist has no choice but to mine the small molecules, with the hope that one precious drug may keep him alive until help arrives (which may be months, even years).

You are given the following tables, a subset of PubChem in the bmi217_winter0809 database on bmi217compute: diseases, which has 3 columns:

  • AID - This is the unique identifier of each BioAssay
  • SUMMARY - A brief summary of the BioAssay experiment
  • DISEASE - the disease with which this BioAssay is associated

substances, which has 6 columns

  • PUBCHEM_SID - This is the number that we will be using to identify our substances
  • PUBCHEM_CID - The compound number, we will not use this for the pset
  • PUBCHEM_ACTIVITY_OUTCOME - we define 2 as active, 1 means no activity
  • PUBCHEM_ACTIVITY_SCORE - a score that quantifies activity, most numbers are between 0-100, where 100 is very high activity.
  • PUBCHEM_ASSAYDATA_COMMENT - Comments stored by the experimenters who developed the assay
  • AID - this column matches AID in the diseases table

During your quest you will be building disease-substance networks. One way to visualize network data is via Cytoscape, which was introduced in 2003 as an easy but powerful way to visualize network data. Also, you will be asked whether or not your network follows a power law (scale-free) distribution. Please read the following article to learn more about different network properties.

Questions

Answer the following questions: PART I

1. GWAS studies produce plenty of data associating SNPs with diseases. The file “diseases.csv” contains a list of SNPs associated with diseases. This file also includes information regarding which allele in the SNP is the “risky” allele (the one associated with the disease). How many diseases is James Watson homozygous for the risky allele? (10 pts)

2. How many diseases is James Watson homozygous for the “other” non-risky allele? (8 pts)

3. Are these results what you expected? Why or why not? (5 pts)

4. Should James Watson be worried about contracting the diseases he is homozygous for with respect to the “risky” disease allele? Generally speaking, is it significantly more likely that he will get the diseases if he has risky alleles for a set of diseases? (5 pts)

Extra Credit:

Does Watson have homozygous “risky” alleles for the same disease on different SNPs? If so, what are these diseases and should be concerned about getting these diseases? (5 points)

PART II

5. How many unique substances total do we have which will display some sort of activity with a disease? (2 pts)

6. Return a list of the PUBCHEM_SID's which are (1) active, and associated with the (2) highest number of distinct diseases from their assays. From your query, what is the largest number of distinct diseases associated with one substance?(5 pts)

7. Find if any of our drugs cover exactly the diseases of Watson supplied above? If there are any, list them. Which experiments were they from (AID number)? Obtain the structure and real name of these three substances by searching them in PubChem. Also from this search, How many times has it been found to be active in a BioAssay? Inactive? (10 pts)

8. Create a network by making a list of associations between substance IDs and diseases. Only pick active substances that were tested with the disease. Thus, your nodes will be disease and substance IDs, and your edges are an indication of activity. Use only connections with an activity score of 80 or above. If you have more than one edge between a disease and substance, take the averaged score for the edge weight. How many nodes total do you have? Edges? (5 pts)

9. We wish to analyze our interaction network to determine whether or not it follows a power-law distribution. For more information, read this article. For your network in problem 4, a node is defined as a substance or disease. An edge is an interaction with activity = 2, as well as a score of 80 or above. From your results in problem 4, determine if your edges make a scale free network. Give us any graphs you may make, as well as any analysis (hint:linear regression). (10 pts)

Extra credit:

Plot your network by importing your data from (8) into Cytoscape. Please give us two plots of your choice that will clearly show us characteristics of the data. (5 pts)

Submission

Three steps:

1. Create a directory containing the following files:

  • All code/scripts. We need be able to run your code! Make sure we will be able to do so.
  • A file called “readme.txt” explaining your technical code details. Write down exactly how to run your code. If you used libraries that we should install, note them here.
  • A PDF file called “ps4.pdf”, which is a summary of your approach (1 page max, 12pt Arial font, single spaced) and answers to our questions. This is where you explain your work, which is important for assigning partial credit. Free tools exist for converting Word documents to PDF like http://www.zamzar.com/

2. Zip the directory into one file called ps4_your_sunet_id.zip.

3. Email the ps4_your_sunet_id.zip file to bmi217submit@gmail.com.

Grade breakdown

You will need to explain your work.

  • 20 pts for well-commented working code. You can get partial credit for partially working code or non-working code that is well-commented.
  • 5 pts for readme.txt that clearly, concisely describes how to run your code.
  • 75 pts for ps4.pdf, for a clear consise summary of what you did and answers to the questions (75 pts)

Collaboration policy

You can talk with others in the class about this problem set, but you must turn in your own individual work and may not compare answers.

 
public/pset4_win0809.txt · Last modified: 2009/02/07 13:27 by tjchen
 
Recent changes RSS feed Creative Commons License Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki