Welcome to BMI 217 (Winter 2008-2009). Please post questions in this space
A. Should be up now.
-Erik
Q. A few of you have asked about how to run items on the server, and leave them running even if you're logged off.
A. Use screen. To make a new screen when logged into bmi217compute, type “screen”. This will place you in a session that even if you are disconnected, will still remain. To detach the screen so you can work with it later, press ctrl + a + d. To resume a screen, type screen -r. If you have multiple, type screen -r <thenameofthescreen>. For more information about screen, see here.
-Tiffany
Q. Where do we get the PubChem tables referenced in the homework? Are we supposed to download the entire database from the PubChem website? - Douglas Greiman
A. They are available in the bmi217_winter0809 database in the MySQL server on bmi217compute.stanford.edu.
-Erik
Q. In question 1. Should the homozygosity be for every locus asociated with a disease or is it enough for even one locus to be homozygous? -Sarim
A. One locus is enough.
-Erik
Q. Similarly, for problem 2 should we count every disease for which Watson has at least one homozygous “other” allele? In other words, is it fine to potentially count the same disease for 1 and 2? If not, how should we define “homozygous for the 'other' non-risky allele” (e.g. disease has at least one homozygous non-risky and no homozygous risky)?
A. It's fine if the same disease pops up in the answer to problem 1 and problem 2.
-Erik
Q. For #7, do you want the drugs that have activity for any of the conditions, or all of the conditions? -JL
A. A drug that has activity for all four of the conditions. -Tiffany
Q. Is the answer to question 4 in the readings listed?
A. The answer has been mentioned in class and can also be found in the linked article about Watson's genome in the reading for the problem set.
-Erik
Q. For Part 1 Extra Credit: Are we still looking only at homozygous alleles or should we take into account heterozygous as well?
A. They have to be homozygous as well.
-Erik
Q. For Question 8. How would you like us to deal with duplicate disease/substance pair experiments (there are a few). A) Use multiple Edges? B) Use a single Edge with an averaged score? C) Use a single Edge with the maximum score? D) Some other way?
A. Take an averaged score. This will be updated on the pset.
-Tiffany
Q. For Question 9. Could you give some more information/pointers on how to compute the 'Cumulative Distribution of Connectivities'? Is it the number of edges incident with vertexes of degree X divided by the total number of edges in the graph?
A. You will want to plot number of nodes vs. node degree on a log-log plot.
-Tiffany
Q. In the changed requirement for question 8. Do you mean that the average score should be greater than 80 in case of duplicates?
A. Yes. -Tiffany
Secondly, we are not being asked to show weights for the edges in the graph, is that correct? If so, then it shouldn't matter what exactly is the average weight as long as it is greater than 80? -sarim.
A. Yes, you do not need to show the edge weights for the edges in the graph. If you do manage to put them in and visualize your graph well, we wouldn't mind seeing it. :)
Also, why the average? Doesn't max make sense here? -JL
A. You could make an argument for the max. We think, however, that the average is a more conservative estimate. If you look more closely at how the activity measurements are made, this quantifier actually differs between experiments–it varies in how it is are determined. A better way to do it instead would be to use a normalized IC50 number, and also take into account the exact protein which is being bound, instead of completely generalizing them into larger disease classes. Here, to keep the problem set less complex (and since IC50 numbers are much rarer in pubchem) we are simplifying the problem.
Q. In #9, there are two classes of nodes where all connections between one of many substances and one of very few diseases. Should disease and substance nodes be considered as peers or separate classes in the determination of power law distribution?-JL
A. They should be considered as peers.
Q. What does a PUBCHEM_ACTIVITY_OUTCOME of 3 represent? Can those be ignored? -RB
A. Good job for finding them! Yes, ignore them. If you look at the pubchem website, they have different color codes for outcome, so there are more than 2.
-Tiffany
Q. For problem 6, is the boxplot simply supposed to show the range of hemoglobin values for the CRF vs. the Non-CRF patients? If so, R seems to automatically highlight the Median value of the set - do we need to force it to highlight the mean instead?
-RB
-Erik
Q. In problem 10, when I try to extract all of the distinct lab test names from the pat_test_histv table (i.e. “select distinct TEST_ABBR from pat_test_histv”), some of the test names are duplicates because of the way the string is formatted (for example, CARNITINE appears once as “CARNITINE” and once as ” CARNITINE”). Is there a good way to get around this problem, so that we don't end up counting the same test as two different ones?
-Erik
Q. In problem 5, is every other patient supposed to be all patients not in the CRF group from pat_test_histv or every other patient that is not in the CRF group but in both pat_test_histv and pat_fin_acct? There's a difference since in pat_test_histv, there are bunch of pat_nums that are not present in pat_fin_acct. The latter is actually easier to search.
-Erik
Q. In problem 5, some patients have never had a hemoglobin test. However, when grouping by patient to get the averages for the patients, these show up as zero averages. We do not want to include these zeros in the average of averages since they don't represent real data. How can we remove them?
-Tiffany
Q. When trying to load the tab delimited files into a local mysql database, they fail to load correctly for some tables, such as pat_fin_acct. All lines contain warnings when loaded, and are essentially blank in the database (though not in the file). When the warnings are examined, they are of the form “Warning | 1366 | Incorrect integer value: '' for column 'ACTIV_PAT_NUM' at row 1” Suggestions?
-Tiffany
Q. The yEd Graph Editor doesn't seem to parse the GraphML format exactly. The example page with attributes should show colored nodes, but when I import it into the yEd, none of the colors show up. -EC
A. Is this how you're formatting your node? This was verified to work.
<node id="FUROSEMIDE"><data key="d0"><y:ShapeNode><y:Fill color="#0000FF" transparent="false"/><y:NodeLabel>FUROSEMIDE</y:NodeLabel><y:Shape type="circle" /></y:ShapeNode></data></node>
It's formatted like that now, and yEd won't even open it now. Is there some kind of key definition that is needed? -EC
Q. I'm also having similar problems when following the GraphML primer (http://graphml.graphdrawing.org/primer/graphml-primer.html). The node structure shows up correctly but yEd doesn't seem to be interpreting the node/edge properties or labels. -RB
A. Please see answer above.
-Erik
Q. I'm working on question 10 and I'm still debugging it so I haven't run it on the entire data set yet. However, I'm worried how long my code will take to run as there are several for loops and I was wondering if I should be avoiding them like in pset1. -ML
Mine had like 3 nested for loops and it took about an hour to fully compute on the cluster. I wasn't sure if there was an easier way. It could have been done in apply but the variable passing would have been tedious. -EC
A. Using apply with a function that needs variables is tricky but is a huge time saver. Here's a small tip that might help you use apply.
Say you have a function that takes in a vector and a constant
myFxn ← function (values,someConstant){return(sum(values) + someConstant);}
vec = 1:20;
dim(vec) = c(4,5);
Say I'd like to get the sum of each row plus a constant using the 'myFxn' function. This is how to do it.
apply(vec,1,myFxn,someConstant=50);
This can eliminate a nested for loop, although you might still have to use apply within a loop.
-Erik
Q. For problem 10, is there any way to make sure you group all dosages of the same drug together? For example, there are both 10,000 U/ML and 3000 U/ML dosages of epoetin, but “select distinct medication_name from pharmacy” would count them as separate. There doesn't appear to be any unique identifying code for a particular drug in the table.
-Tiffany
Q. When doing Spearman Rank Correlation, is it important to you how we calculate significance? Is a Monte Carlo Permutation Test acceptable?
-Tiffany
-Tiffany
2. Question: Are mice the same in the control group (before treatment) and treatment groups?
-Tiffany
3. Question: I don't have Excel so maybe I'm missing something about SAM. How do I read in the *.soft data using either R or MySQL? -JL
-Tiffany
4. Question: What are the inputs that SAM uses? Can you provide a sample SAM function-call that illustrates the syntax and what inputs are required, or are there any hints/instructions available describing what command(s)/function(s) to use?? Alternatively, if no code/examples are available, it would be very helpful to see a flow chart/diagram conceptually describing the correct use of SAM in this context. Thanks!
-Tiffany
5. How do I access the soft file from bmi217compute.stanford.edu? Is it located in a folder or do I need to ftp it over?
-Tiffany
6. Is there a way to confirm my answers for Q1? I'm getting so many 0 q-value(%)?
-TiffanyI'm using delta=2.1. According to mean FDR in the delta table.
-Tiffany
7. Are we supposed to use the top 10 p-values from chr 11 for problem 2?
You should use all the SNPs and their corresponding p-values from on pset 1 for problem 2.
-Erik
8. So i'm trying to create my table of significant genes and this is what happens:
siggenes.table←samr.compute.siggenes.table(samr.obj,delta, data, delta.table)
Error in dimnames(res.up) = temp.names :
length of 'dimnames' [2] not equal to array extent
I'm not really sure how to correct this. I've followed the example from ?samr pretty closely.
9. The question relates to #8. I also have the same error. But it works if I use data=list(x=x,y=y, geneid=as.character(1:nrow(x)),genenames=paste(“g”,as.character(1:nrow(x)),sep=””), logged2=TRUE) (x and y are the real data from the Problem Set 2)
The example:
data=list(x=x,y=y, geneid=as.character(1:nrow(x)),genenames=paste(“g”,as.character(1:nrow(x)),sep=””), logged2=TRUE)
…
siggenes.table←samr.compute.siggenes.table(samr.obj,delta, data, delta.table)
The results of the example looks like:
Row Gene ID Gene
[1,] “212” “g211” “211”
Why Row number is 212 and Gene ID g211? What's the real Row number from the data? 212 or 211?
I found that you can pass in the column of gene ids and names for the arguments geneid and genenames. When you look at the genes.up and genes.lo table, it should be correctly labeled. -Molong
- Tiffany
10. For the text file that contains the the list of homologous genes listed by their HID: If there is more than 2 geneids for a HID (i've also noticed when there's only one geneid for a HID) how do we determine if it's a mouse geneid or a human geneid?
-Tiffany
11. I'm trying to read the gene2go file into R, but it gives me this error:
gene2go = read.table('gene2go', header = FALSE, sep = '\t')
Warning message: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
number of items read is not a multiple of the number of columns
If I just read in the first few lines, it works fine, but not when I try to read the whole file in. Is there something special about the way it's formatted? It's pretty large so I can't really open it in excel or notepad to take a look.
Also, I have less than 10 genes that satisfy the thresholds for question 3 and I was wondering if perhaps the sample size is too small to consider the proportion of various categories (since any category will automatically have 10% representation). I'm worried that I should be getting more genes out. I noticed that I get the same number of genes that have qvalue<.05 as with a threshold of .1 (there are a bunch of 0 qvalues and the rest are above .1). -molong
-Tiffany
12. Working with translate_rsid_geneid.txt (the 129 MB file) seems to take forever no matter how I code for it? Are there any tips you can give me?
-Tiffany
I'm not even merging anything yet. Even searching for 100 entries takes several minutes.
-Tiffany
13. In Question 5, are we supposed to be looking at the Category field (“Process”,”Component”,”Function”) or the GO term field (“axon”, “cell adhesion”, etc)? The GO term makes the most sense, but doesn't work with this data because I've only got 7 distinct genes that make it through the filter, so every term has at most 2 occurences, not enough for any kind of analysis. The Category field I could actually work with, but… what relevance is it? Does this question assume that we've done the extra credit and included genes from all chromosomes, not just chromosome 11? Did I screw up the analysis and I've got way too few genes? [JL] Yes, please specify what the “gene ontology category” refers to. In addition, what GENEIDs should we be using for this lookup? Mouse or human? Thanks. [/JL]
-Tiffany
-Erik
4. (again) Hi, I checked the SAM manual and did not see any examples in “R”. the ?samr helpfile does not show much either. are there any other SAMR references showing what the options are? some of the options are not documented at all! for example: data$logged2, data$genenames, data$geneid in the example from the man page. how are we supposed to know how to set the options if there is no documentation about what what does logged2 means?
thank you.
The samr package supplies a help file/manual here We will take a look at how the logged2 call affects the results, since we know it was difficult to determine this – the data itself is not logged.
-Tiffany
-Erik
1: Question: I am a little confused about one aspect of the problem. We need to count how many individuals from each population (control and disease) have which genotype (AA, Aa, aa), where A is the major allele, and a is the minor allele. However, what is the major allele in the control population may not be what is the major allele in the disease population. In these cases our comparisons wouldn't mean very much. So I'm confused on how we define what is the Major and Minor allele for the entire population (control + disease), so that we can compare apples to apples.
Answer: For question 1, it doesn't matter what we're calling a major or minor allele, all that matters is that we're comparing the same genotypes. So GG must be compared to GG even if GG is major in the control and GG is minor in the disease.
-Erik TA ecoronap@stanford.edu
2: Question: Continuing your example (ND412): rs2845379 is aa. Correct? What about rs7286463? Major C, Minor T. The pair is TC Thanks, Kate
Answer: Yes, it is aa, but it can also be AA as long as you stick to the same definition of aa and AA in the disease group. CT is always going to be Aa because it is heterozygous. To avoid confusion, we can just let the control group define major or minor alleles. The answers will be the same because ultimately we want to compare the same genotypes. Letting the control group define which is the 'A'(major) allele and which is the 'a'(minor) allele will get rid of confusion.
Here is an example: Let's say the control group states the major allele as C and the minor as G and the disease group states the major allele is G and the minor is C. Control A = C, a = G Disease A = G, a = C
If we were to compare the major alleles for the control and disease group we'd be comparing CC with GG, which is an obvious mistake. In fact, we don't really care to label any one of these alleles as major or minor so long as we compare the same genotypes. In other words, as long as we are comparing CC in the control group with CC in the disease group, GG in the control group with GG in the disease group, and GT in the control group with GT in the disease group, then calling one major or minor serves no purpose. I hope this helps. If not, just let the control group define the major and minor alleles and you'll be fine.
-Erik TA ecoronap@stanford.edu
3:
Question/Issue: on bmi217compute.stanford.edu
-bash-3.2$ cp /data/shared/ps1.RData . cp: cannot stat `/data/shared/ps1.RData': No such file or directory
Please grant access to the directory or there is no such directory
Thanks, Kate
Answer: We are working on getting this up ASAP. Meanwhile, you can just download ps1.RData from [[http://www.stanford.edu/~dpchen/ps1.RData|here].
You can then upload it to bmi217compute.stanford.edu like this… scp <location of ps1.RData> <your sunet ID>@bmi17compute.stanford.edu:~
Sorry for the temporary inconvenience.
-Erik TA ecoronap@stanford.edu
4:
“Some of the SNPs were unable to be measured in some individuals. These missing measurements need to be ignored in the statistics” Does this mean we ignore that patient for this computation?
-Tiffany
5:
In my computation some (very few) loci do not correspond to any of the (AA,Aa or aa) genotypes. i.e. some other base in place of the ones specified for this locus in the .map file. Should this even happen? or am I making an error. If it does happen then should it be ignored(i.e. not counted for any genotype)?
-Tiffany
6:
The first part of the problem set asks us to calculate the fraction of AA/Aa/aa genotypes at each gene locus among all individuals. My solution to this problem takes a very long time to run. From the R review session, I know that for loops are very slow in R, and my solution uses a for loop. However, since we have to look at two different elements of the individual matrix (one for each chromosome) to determine what genotype an individual is, I do not see how the problem can be solved with apply instead of a loop. Is there some method of making this calculation faster that I am missing, or should I just bite the bullet and let the code run for however many hours it takes?
myFunction = function(current){
#your data here
};
When you do apply, you are passing in the entire row or column into your function. e.g. apply(dataset, 1, myFunction) . This way, you can pass a parameter (current) into your function, which contains all of the items in your current row or column.
Also, don't hesitate to take a look at the other types of apply where you can reformat your data (sapply, lapply), or other types of functions in R which will help you do your counting.
-Tiffany
Question: The problem is how to apply “apply” if the function should work with individual members of the dataset. I can do something like - if dataset_pre[1,4]<>dataset_pre[1,5] then “Aa” in a loop In other words, I use individual indexes or work with cells in loop There is an other logic and I can not understand it.
7:
01/14/2009 (Wednesday) lection was not published . :(
The lecture will be up soon. Sorry for any inconvenience this may cause.
-Erik TA ecoronap@stanford.edu
8:
Question: Having trouble using rbind() with two matrices.
Answer: To use rbind(), you should make sure that the names of the columns match or else R will keep you from binding them. You can just make the names match like so ... names(firstMatrix) = names(secondMatrix) and then rbind() will work properly.
-Erik TA ecoronap@stanford.edu
9:
Question: Draw out a representative chi-square table for the SNP locus (rs... I do not understand what is representative chi-square table for a specific locus
Answer: A chi-square table refers to the table that is typically used to compute the chi square value. It involves expected and observed values for the genotype frequencies. The following website has an example: http://math.hws.edu/javamath/ryan/ChiSquare.html.
10:
Question: (an FYI really) control_chr11_pre and parkinson_chr11_pre seem to me to have different dimensions (control is 1 wider). Specifically for parkinson Field 1 contains ND-#### whereas for control the information is split into fields 1 and 2 ie ND and ####. This wrecked considerable havoc to my computations.
-Erik TA ecoronap@stanford.edu
11: Access for the extra files. http://biomedin217.stanford.edu/pset1 -Tiffany
12: Question: So I'm trying to convert the matrix of SNPs into a matrix that simply tells me whether the allele was major or minor. I wanted to use apply with MARGIN = c(1,2) so that it would iterate through each element in the matrix. However, the function that determines whether the allele is major or minor needs to know what position it's in in order to reference the map file. Is there any way to parse that kind of information to the function within apply? I've tried to look for this in the documentation, but am unable to find it.
-Tiffany
13: Question: I was wondering, for the chi-squared test, are we supposed to use a contingency test or a goodness-of-fit test?
-Tiffany
Question: Also, how do we deal with an expected frequency of zero? (This happens when the control group has a frequency of 0 for a genotype, in a goodness-of-fit test, or when both control and parkinson groups have frequency of 0 for a genotype, in the contingency test.)
-Tiffany
14: Question: is the Chi-Squared statistics expected to give correct results when the counts of the different groups (control vs parkinson) are different, or does it rely on same counts? Do we then have to only compute statistics for loci that have the same example counts?
I would first resolve the different total counts between observed and expected frequencies, and then if it is still an extremely significant SNP, we can look into why it was excluded from the original published results. -Erik
15: Question: Do we have to include a readme.txt? It's easier for me to talk about the technical details of the script just by commenting in the code.
-Tiffany
16: This is a question about your answer to question 12. So I tried to have an external variable from the function as an index which I could update every time the function is called and thereby keep track of which iteration the function is in. However, when I edit the variable within the function, the variable does not change outside the function and remains constant. I would return the variable except that what I actually want the function to return is new value of the matrix that I used apply on. Is there anyway to specify within a function that the variable is in essence a “global” variable and that changes to it should hold?
-Tiffany
17: Question: Table 1 from the study notes (at the very bottom line) that they only include SNPs which had measured genotypes from more than 95% of the patients. The instructions from homework 1 seem to indicate that we should skip this complication. My homework results are fairly different from the study's results, which is kind of alarming, so I tried this modification, and it changed my results to be more like the study's results. I just want make sure whether we're intended to do this filtering step or not. (Douglas Greiman)
-Tiffany
18: Question: I read the data from cc_chr2.pre like control_chr_pre=read.table(“cc_chr2.pre”); and use only nucleotides part: ind=control_chr_pre[,4:65415] and get weird data in the ind: length(ind[ind==“G”]) [1] 4579088 length(ind[ind==“C”])
[1] 4564784
length(ind[ind==“T”])
[1] 4212623
length(ind[ind==“A”])
[1] 4243610
length(ind[ind==“0”])
[1] 45518
length(ind[ind==“TRUE”])
[1] 81029 What is “TRUE”? The total sum is exactly 271*65412 I do not see “TRUE” in the file. It looks like there is an error in the reading data from file or something.
-Tiffany
**Thanks. It helped.=Kate