Welcome to BMI 217 (Winter 2008-2009). Please post questions in this space

General Questions


Q. 01/21/2009 Lection is still not published.:-/.

A. Should be up now.

-Erik

Q. A few of you have asked about how to run items on the server, and leave them running even if you're logged off.

A. Use screen. To make a new screen when logged into bmi217compute, type “screen”. This will place you in a session that even if you are disconnected, will still remain. To detach the screen so you can work with it later, press ctrl + a + d. To resume a screen, type screen -r. If you have multiple, type screen -r <thenameofthescreen>. For more information about screen, see here.

-Tiffany

Problem Set 4


Q. Where do we get the PubChem tables referenced in the homework? Are we supposed to download the entire database from the PubChem website? - Douglas Greiman

A. They are available in the bmi217_winter0809 database in the MySQL server on bmi217compute.stanford.edu.

-Erik

Q. In question 1. Should the homozygosity be for every locus asociated with a disease or is it enough for even one locus to be homozygous? -Sarim

A. One locus is enough.

-Erik

Q. Similarly, for problem 2 should we count every disease for which Watson has at least one homozygous “other” allele? In other words, is it fine to potentially count the same disease for 1 and 2? If not, how should we define “homozygous for the 'other' non-risky allele” (e.g. disease has at least one homozygous non-risky and no homozygous risky)?

A. It's fine if the same disease pops up in the answer to problem 1 and problem 2.

-Erik

Q. For #7, do you want the drugs that have activity for any of the conditions, or all of the conditions? -JL

A. A drug that has activity for all four of the conditions. -Tiffany

Q. Is the answer to question 4 in the readings listed?

A. The answer has been mentioned in class and can also be found in the linked article about Watson's genome in the reading for the problem set.

-Erik

Q. For Part 1 Extra Credit: Are we still looking only at homozygous alleles or should we take into account heterozygous as well?

A. They have to be homozygous as well.

-Erik

Q. For Question 8. How would you like us to deal with duplicate disease/substance pair experiments (there are a few). A) Use multiple Edges? B) Use a single Edge with an averaged score? C) Use a single Edge with the maximum score? D) Some other way?

A. Take an averaged score. This will be updated on the pset.

-Tiffany

Q. For Question 9. Could you give some more information/pointers on how to compute the 'Cumulative Distribution of Connectivities'? Is it the number of edges incident with vertexes of degree X divided by the total number of edges in the graph?

A. You will want to plot number of nodes vs. node degree on a log-log plot.

-Tiffany

Q. In the changed requirement for question 8. Do you mean that the average score should be greater than 80 in case of duplicates?

A. Yes. -Tiffany

Secondly, we are not being asked to show weights for the edges in the graph, is that correct? If so, then it shouldn't matter what exactly is the average weight as long as it is greater than 80? -sarim.

A. Yes, you do not need to show the edge weights for the edges in the graph. If you do manage to put them in and visualize your graph well, we wouldn't mind seeing it. :)

Also, why the average? Doesn't max make sense here? -JL

A. You could make an argument for the max. We think, however, that the average is a more conservative estimate. If you look more closely at how the activity measurements are made, this quantifier actually differs between experiments–it varies in how it is are determined. A better way to do it instead would be to use a normalized IC50 number, and also take into account the exact protein which is being bound, instead of completely generalizing them into larger disease classes. Here, to keep the problem set less complex (and since IC50 numbers are much rarer in pubchem) we are simplifying the problem.

Q. In #9, there are two classes of nodes where all connections between one of many substances and one of very few diseases. Should disease and substance nodes be considered as peers or separate classes in the determination of power law distribution?-JL

A. They should be considered as peers.

Q. What does a PUBCHEM_ACTIVITY_OUTCOME of 3 represent? Can those be ignored? -RB

A. Good job for finding them! Yes, ignore them. If you look at the pubchem website, they have different color codes for outcome, so there are more than 2.

Problem Set 3


Q. I didn't see a field for brand-names (for drugs) in the pharmacy table. There's only the MEDICATION_NAME, which gives a formulation. Is it safe to use te first word in each entry of the MEDICATION_NAME as the brand name? -Sarim
  • Use MEDICATION_NAME but search the entire field.

-Tiffany

Q. For problem 6, is the boxplot simply supposed to show the range of hemoglobin values for the CRF vs. the Non-CRF patients? If so, R seems to automatically highlight the Median value of the set - do we need to force it to highlight the mean instead?

-RB

  • No need to modify the way boxplots are displayed. It's fine if the boxplot shows the median instead of the mean.

-Erik

Q. In problem 10, when I try to extract all of the distinct lab test names from the pat_test_histv table (i.e. “select distinct TEST_ABBR from pat_test_histv”), some of the test names are duplicates because of the way the string is formatted (for example, CARNITINE appears once as “CARNITINE” and once as ” CARNITINE”). Is there a good way to get around this problem, so that we don't end up counting the same test as two different ones?

  • TRIM() or LTRIM() will handle your problem. LTRIM() removes whitespace from the beginning of any string.

-Erik

Q. In problem 5, is every other patient supposed to be all patients not in the CRF group from pat_test_histv or every other patient that is not in the CRF group but in both pat_test_histv and pat_fin_acct? There's a difference since in pat_test_histv, there are bunch of pat_nums that are not present in pat_fin_acct. The latter is actually easier to search.

  • The one in pat_test would be the one to use since this holds the actual values of the test for which the mean is sought.

-Erik

Q. In problem 5, some patients have never had a hemoglobin test. However, when grouping by patient to get the averages for the patients, these show up as zero averages. We do not want to include these zeros in the average of averages since they don't represent real data. How can we remove them?

  • Yes, you will need to first filter for empty and “not done” values for the test.

-Tiffany

Q. When trying to load the tab delimited files into a local mysql database, they fail to load correctly for some tables, such as pat_fin_acct. All lines contain warnings when loaded, and are essentially blank in the database (though not in the file). When the warnings are examined, they are of the form “Warning | 1366 | Incorrect integer value: '' for column 'ACTIV_PAT_NUM' at row 1” Suggestions?

  • Have not tried it yet, but it could possibly deal with whitespace in the data and how you're reading it in. Do you have access to the server? The tables are already placed on mysql for your convenience.

-Tiffany

Q. The yEd Graph Editor doesn't seem to parse the GraphML format exactly. The example page with attributes should show colored nodes, but when I import it into the yEd, none of the colors show up. -EC

A. Is this how you're formatting your node? This was verified to work.

  <node id="FUROSEMIDE"><data key="d0"><y:ShapeNode><y:Fill color="#0000FF" transparent="false"/><y:NodeLabel>FUROSEMIDE</y:NodeLabel><y:Shape type="circle" /></y:ShapeNode></data></node>

It's formatted like that now, and yEd won't even open it now. Is there some kind of key definition that is needed? -EC

Q. I'm also having similar problems when following the GraphML primer (http://graphml.graphdrawing.org/primer/graphml-primer.html). The node structure shows up correctly but yEd doesn't seem to be interpreting the node/edge properties or labels. -RB

A. Please see answer above.

-Erik

Q. I'm working on question 10 and I'm still debugging it so I haven't run it on the entire data set yet. However, I'm worried how long my code will take to run as there are several for loops and I was wondering if I should be avoiding them like in pset1. -ML

Mine had like 3 nested for loops and it took about an hour to fully compute on the cluster. I wasn't sure if there was an easier way. It could have been done in apply but the variable passing would have been tedious. -EC

A. Using apply with a function that needs variables is tricky but is a huge time saver. Here's a small tip that might help you use apply.

Say you have a function that takes in a vector and a constant

myFxn ← function (values,someConstant){return(sum(values) + someConstant);}
vec = 1:20;
dim(vec) = c(4,5);

Say I'd like to get the sum of each row plus a constant using the 'myFxn' function. This is how to do it.

apply(vec,1,myFxn,someConstant=50);

This can eliminate a nested for loop, although you might still have to use apply within a loop.

-Erik

Q. For problem 10, is there any way to make sure you group all dosages of the same drug together? For example, there are both 10,000 U/ML and 3000 U/ML dosages of epoetin, but “select distinct medication_name from pharmacy” would count them as separate. There doesn't appear to be any unique identifying code for a particular drug in the table.

  • You can collapse them if you like, but we are not requiring you to do so.

-Tiffany

Q. When doing Spearman Rank Correlation, is it important to you how we calculate significance? Is a Monte Carlo Permutation Test acceptable?

  • We are looking for an actual value.

-Tiffany

Problem Set 2


1. Quesion: Does SAM work in Linux? There are R packages for Linux but no any installation instructions, examples or I can not find them. Do we need setup.exe, Excel add-in for the Problem Set 2? Was it installed on bmi217compute.stanford.edu? =Kate
  • samr does work on linux, it is installed on bmi217compute. to load it, you type library('samr'). To install, it also requires package impute.

-Tiffany

2. Question: Are mice the same in the control group (before treatment) and treatment groups?

  • No.

-Tiffany

3. Question: I don't have Excel so maybe I'm missing something about SAM. How do I read in the *.soft data using either R or MySQL? -JL

  • You are not using MySQL for this assignment, you only need R. You can read the .soft data file in using R. If you are uncertain about the format of the file, you can use a text editor to edit it and take a look.

-Tiffany

4. Question: What are the inputs that SAM uses? Can you provide a sample SAM function-call that illustrates the syntax and what inputs are required, or are there any hints/instructions available describing what command(s)/function(s) to use?? Alternatively, if no code/examples are available, it would be very helpful to see a flow chart/diagram conceptually describing the correct use of SAM in this context. Thanks!

  • The inputs to samr are in the helpfile (e.g. type ?samr when you have R open and the samr library loaded). Part of the problem set is to understand what to use for samr, based on the paper and the online manual, which we have supplied the links to on the pset. Please let us know if you have specific questions about the arguments that are passed into samr.

-Tiffany

5. How do I access the soft file from bmi217compute.stanford.edu? Is it located in a folder or do I need to ftp it over?

  • You need to download the file from geo. Please make sure you have the right file.

-Tiffany

6. Is there a way to confirm my answers for Q1? I'm getting so many 0 q-value(%)?

  • Did you determine your significant genes with a specific delta value? tiny q-values are ok.

-TiffanyI'm using delta=2.1. According to mean FDR in the delta table.

  • You don't want to use the mean FDR, we specify a threshold in the problem set, Q1.

-Tiffany

7. Are we supposed to use the top 10 p-values from chr 11 for problem 2?

  You should use all the SNPs and their corresponding p-values from on pset 1 for problem 2.

-Erik

8. So i'm trying to create my table of significant genes and this is what happens:

siggenes.table←samr.compute.siggenes.table(samr.obj,delta, data, delta.table)

Error in dimnames(res.up) = temp.names :

length of 'dimnames' [2] not equal to array extent

I'm not really sure how to correct this. I've followed the example from ?samr pretty closely.

9. The question relates to #8. I also have the same error. But it works if I use data=list(x=x,y=y, geneid=as.character(1:nrow(x)),genenames=paste(“g”,as.character(1:nrow(x)),sep=””), logged2=TRUE) (x and y are the real data from the Problem Set 2)

The example:

data=list(x=x,y=y, geneid=as.character(1:nrow(x)),genenames=paste(“g”,as.character(1:nrow(x)),sep=””), logged2=TRUE)

siggenes.table←samr.compute.siggenes.table(samr.obj,delta, data, delta.table)

The results of the example looks like:

     Row   Gene ID Gene 

[1,] “212” “g211” “211”

Why Row number is 212 and Gene ID g211? What's the real Row number from the data? 212 or 211?

I found that you can pass in the column of gene ids and names for the arguments geneid and genenames. When you look at the genes.up and genes.lo table, it should be correctly labeled. -Molong

  • As a side note, you may want to take a look at how you are calling the information–is your data actually logged?

- Tiffany

10. For the text file that contains the the list of homologous genes listed by their HID: If there is more than 2 geneids for a HID (i've also noticed when there's only one geneid for a HID) how do we determine if it's a mouse geneid or a human geneid?

  • You can have multiple GeneIDs, and you don't know if they are human or not. You do, however, have a list of human genes that you can merge with this (hint: remember pset 1).

-Tiffany

11. I'm trying to read the gene2go file into R, but it gives me this error:

gene2go = read.table('gene2go', header = FALSE, sep = '\t')

Warning message: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :

number of items read is not a multiple of the number of columns

If I just read in the first few lines, it works fine, but not when I try to read the whole file in. Is there something special about the way it's formatted? It's pretty large so I can't really open it in excel or notepad to take a look.

  • I had the same problem. There is a field with a single apostrophe in it which confuses the parser. Add quote=”” as an argument to read.table. By the way, I have the same issue with q values, and I don't know the answer to that one. (Douglas Greiman)

Also, I have less than 10 genes that satisfy the thresholds for question 3 and I was wondering if perhaps the sample size is too small to consider the proportion of various categories (since any category will automatically have 10% representation). I'm worried that I should be getting more genes out. I noticed that I get the same number of genes that have qvalue<.05 as with a threshold of .1 (there are a bunch of 0 qvalues and the rest are above .1). -molong

  • If I understand right, you are talking about the unique genes that you have after all of your merges. That doesn't sound like a lot. You may want to check your q-values. You won't have many results total, but you will have many GO categories.

-Tiffany

12. Working with translate_rsid_geneid.txt (the 129 MB file) seems to take forever no matter how I code for it? Are there any tips you can give me?

  • make sure that you aren't doing redundant merges at each step. You may be merging over duplicates, and that will take a WHILE.

-Tiffany

I'm not even merging anything yet. Even searching for 100 entries takes several minutes.

  • Please send me your code snippet of how you are searching.

-Tiffany

13. In Question 5, are we supposed to be looking at the Category field (“Process”,”Component”,”Function”) or the GO term field (“axon”, “cell adhesion”, etc)? The GO term makes the most sense, but doesn't work with this data because I've only got 7 distinct genes that make it through the filter, so every term has at most 2 occurences, not enough for any kind of analysis. The Category field I could actually work with, but… what relevance is it? Does this question assume that we've done the extra credit and included genes from all chromosomes, not just chromosome 11? Did I screw up the analysis and I've got way too few genes? [JL] Yes, please specify what the “gene ontology category” refers to. In addition, what GENEIDs should we be using for this lookup? Mouse or human? Thanks. [/JL]

  • You are using the geneids that you have found at the very end. Do your calculations based on the GO category ID (GO term). When you supply your answer, however, you can give the category field as well. More info on these terms is here

-Tiffany

  • The GO Category ID is the 2nd column in the file and looks like this GO:0003676. For example, GO:0003676 is for “nucleic acid binding” which is the first entry.

-Erik

4. (again) Hi, I checked the SAM manual and did not see any examples in “R”. the ?samr helpfile does not show much either. are there any other SAMR references showing what the options are? some of the options are not documented at all! for example: data$logged2, data$genenames, data$geneid in the example from the man page. how are we supposed to know how to set the options if there is no documentation about what what does logged2 means?

thank you.

The samr package supplies a help file/manual here We will take a look at how the logged2 call affects the results, since we know it was difficult to determine this – the data itself is not logged.

-Tiffany

  • logged2 refers to whether the data is logged.
  • genenames should be the “IDENTIFIER” column in the mouse data you downloaded from GEO.
  • geneid This is the “GENE_ID” column in the “translate_affy_geneid.txt” file. NOTE: you want to add the geneid to the mouse data

-Erik

Problem Set 1


1: Question: I am a little confused about one aspect of the problem. We need to count how many individuals from each population (control and disease) have which genotype (AA, Aa, aa), where A is the major allele, and a is the minor allele. However, what is the major allele in the control population may not be what is the major allele in the disease population. In these cases our comparisons wouldn't mean very much. So I'm confused on how we define what is the Major and Minor allele for the entire population (control + disease), so that we can compare apples to apples.

Answer: For question 1, it doesn't matter what we're calling a major or minor allele, all that matters is that we're comparing the same genotypes. So GG must be compared to GG even if GG is major in the control and GG is minor in the disease.

-Erik TA ecoronap@stanford.edu

2: Question: Continuing your example (ND412): rs2845379 is aa. Correct? What about rs7286463? Major C, Minor T. The pair is TC Thanks, Kate

Answer: Yes, it is aa, but it can also be AA as long as you stick to the same definition of aa and AA in the disease group. CT is always going to be Aa because it is heterozygous. To avoid confusion, we can just let the control group define major or minor alleles. The answers will be the same because ultimately we want to compare the same genotypes. Letting the control group define which is the 'A'(major) allele and which is the 'a'(minor) allele will get rid of confusion.

Here is an example: Let's say the control group states the major allele as C and the minor as G and the disease group states the major allele is G and the minor is C. Control A = C, a = G Disease A = G, a = C

If we were to compare the major alleles for the control and disease group we'd be comparing CC with GG, which is an obvious mistake. In fact, we don't really care to label any one of these alleles as major or minor so long as we compare the same genotypes. In other words, as long as we are comparing CC in the control group with CC in the disease group, GG in the control group with GG in the disease group, and GT in the control group with GT in the disease group, then calling one major or minor serves no purpose. I hope this helps. If not, just let the control group define the major and minor alleles and you'll be fine.

-Erik TA ecoronap@stanford.edu

3:

Question/Issue: on bmi217compute.stanford.edu

-bash-3.2$ cp /data/shared/ps1.RData . cp: cannot stat `/data/shared/ps1.RData': No such file or directory

Please grant access to the directory or there is no such directory

Thanks, Kate

Answer: We are working on getting this up ASAP. Meanwhile, you can just download ps1.RData from [[http://www.stanford.edu/~dpchen/ps1.RData|here].

You can then upload it to bmi217compute.stanford.edu like this… scp <location of ps1.RData> <your sunet ID>@bmi17compute.stanford.edu:~

Sorry for the temporary inconvenience.

-Erik TA ecoronap@stanford.edu

4:

“Some of the SNPs were unable to be measured in some individuals. These missing measurements need to be ignored in the statistics” Does this mean we ignore that patient for this computation?

  • Answer: No, just ignore the specific missing measurement.

-Tiffany

5:

In my computation some (very few) loci do not correspond to any of the (AA,Aa or aa) genotypes. i.e. some other base in place of the ones specified for this locus in the .map file. Should this even happen? or am I making an error. If it does happen then should it be ignored(i.e. not counted for any genotype)?

  • This occurs quite rarely. Do not count the genotype in this case, simply do the summation over your counts of A and a.

-Tiffany

6:

The first part of the problem set asks us to calculate the fraction of AA/Aa/aa genotypes at each gene locus among all individuals. My solution to this problem takes a very long time to run. From the R review session, I know that for loops are very slow in R, and my solution uses a for loop. However, since we have to look at two different elements of the individual matrix (one for each chromosome) to determine what genotype an individual is, I do not see how the problem can be solved with apply instead of a loop. Is there some method of making this calculation faster that I am missing, or should I just bite the bullet and let the code run for however many hours it takes?

  • Yes, for loops will SIGNIFICANTLY slow your code down in R. Your code may take many many many hours if you use for loops. When using apply, however, you can pass in an entire row or column. This will have multiple elements in it. You can also rbind or cbind to combine items to pass into a function. Please also remember that when you use apply, you can pass in any function, including one that you create on your own. An example of creating a function in R is the following:
   myFunction = function(current){
        #your data here
   };

When you do apply, you are passing in the entire row or column into your function. e.g. apply(dataset, 1, myFunction) . This way, you can pass a parameter (current) into your function, which contains all of the items in your current row or column.

Also, don't hesitate to take a look at the other types of apply where you can reformat your data (sapply, lapply), or other types of functions in R which will help you do your counting.

-Tiffany

Question: The problem is how to apply “apply” if the function should work with individual members of the dataset. I can do something like - if dataset_pre[1,4]<>dataset_pre[1,5] then “Aa” in a loop In other words, I use individual indexes or work with cells in loop There is an other logic and I can not understand it.

7:

01/14/2009 (Wednesday) lection was not published . :(
The lecture will be up soon. Sorry for any inconvenience this may cause.

-Erik TA ecoronap@stanford.edu

8:

Question: Having trouble using rbind() with two matrices.
Answer: To use rbind(), you should make sure that the names of the columns match or else R will keep you from binding them.
You can just make the names match like so ... names(firstMatrix) = names(secondMatrix)
and then rbind() will work properly.

-Erik TA ecoronap@stanford.edu

9:

Question: Draw out a representative chi-square table for the SNP locus (rs...
I do not understand what is representative chi-square table for a specific locus
Answer: A chi-square table refers to the table that is typically used to compute the chi square value.
It involves expected and observed values for the genotype frequencies.
The following website has an example: http://math.hws.edu/javamath/ryan/ChiSquare.html.

10:

Question: (an FYI really) control_chr11_pre and parkinson_chr11_pre seem to me to have different dimensions (control is 1 wider).
Specifically for parkinson Field 1 contains ND-#### whereas for control the information is split into fields 1 and 2 ie ND and ####.
This wrecked considerable havoc to my computations.

-Erik TA ecoronap@stanford.edu

11: Access for the extra files. http://biomedin217.stanford.edu/pset1 -Tiffany

12: Question: So I'm trying to convert the matrix of SNPs into a matrix that simply tells me whether the allele was major or minor. I wanted to use apply with MARGIN = c(1,2) so that it would iterate through each element in the matrix. However, the function that determines whether the allele is major or minor needs to know what position it's in in order to reference the map file. Is there any way to parse that kind of information to the function within apply? I've tried to look for this in the documentation, but am unable to find it.

  • A lot of people have banged their head over this problem. There are many different ways to approach it, but one of the most common (and simplest) ways is to make your own index. hint: you can store indexes separately.

-Tiffany

13: Question: I was wondering, for the chi-squared test, are we supposed to use a contingency test or a goodness-of-fit test?

  • The question you are asking is the underlying question of #2 on the problem set. Why is it appropriate for us to use the Chi-square? Think about the two data sets you are studying (Control and Parkinsons), and the significant loci you are determining–what are we trying to get from this data, and what does it mean that things are “significant”?

-Tiffany

Question: Also, how do we deal with an expected frequency of zero? (This happens when the control group has a frequency of 0 for a genotype, in a goodness-of-fit test, or when both control and parkinson groups have frequency of 0 for a genotype, in the contingency test.)

  • You are allowed to ignore columns or rows which sum to 0 in your chi-squared calculations

-Tiffany

14: Question: is the Chi-Squared statistics expected to give correct results when the counts of the different groups (control vs parkinson) are different, or does it rely on same counts? Do we then have to only compute statistics for loci that have the same example counts?

  • Although in some cases you can compare chi square values even when the total observed and expected counts aren't equal, this is not one of those cases. Missing genotypes would erroneously boost differences between parkinson's and the control population. As far as getting around this issue, as long as it is based on sound statistics and you justify it so we can verify, you should be fine. Normalization to the same total counts may introduce variation on the side with less counts, but at least we'd be comparing apples to apples in terms of proportion of AA, Aa, and aa genotypes.

I would first resolve the different total counts between observed and expected frequencies, and then if it is still an extremely significant SNP, we can look into why it was excluded from the original published results. -Erik

15: Question: Do we have to include a readme.txt? It's easier for me to talk about the technical details of the script just by commenting in the code.

  • Yes, you are REQUIRED to submit a readme. Still make sure your code is commented, however, so we can understand it. The readme is a basic overview, enough so that we can get an idea of what you did.

-Tiffany

16: This is a question about your answer to question 12. So I tried to have an external variable from the function as an index which I could update every time the function is called and thereby keep track of which iteration the function is in. However, when I edit the variable within the function, the variable does not change outside the function and remains constant. I would return the variable except that what I actually want the function to return is new value of the matrix that I used apply on. Is there anyway to specify within a function that the variable is in essence a “global” variable and that changes to it should hold?

  • You can make global variables in R, but they are rarely used and pretty tricky. You can make them using the following operator: ”«-” Another hint is that instead of storing just one index and iterating over it, think about the utility of storing all your indices.

-Tiffany

17: Question: Table 1 from the study notes (at the very bottom line) that they only include SNPs which had measured genotypes from more than 95% of the patients. The instructions from homework 1 seem to indicate that we should skip this complication. My homework results are fairly different from the study's results, which is kind of alarming, so I tried this modification, and it changed my results to be more like the study's results. I just want make sure whether we're intended to do this filtering step or not. (Douglas Greiman)

  • We did not expect you to do this filtering step. Please justify your changes in your README and we will take a look at it. We prefer your solution to be without the filtering, but will also take a look at your filtered results.

-Tiffany

18: Question: I read the data from cc_chr2.pre like control_chr_pre=read.table(“cc_chr2.pre”); and use only nucleotides part: ind=control_chr_pre[,4:65415] and get weird data in the ind: length(ind[ind==“G”]) [1] 4579088 length(ind[ind==“C”])

[1] 4564784

length(ind[ind==“T”])

[1] 4212623

length(ind[ind==“A”])

[1] 4243610

length(ind[ind==“0”])

[1] 45518

length(ind[ind==“TRUE”])

[1] 81029 What is “TRUE”? The total sum is exactly 271*65412 I do not see “TRUE” in the file. It looks like there is an error in the reading data from file or something.

  • You are reading the file and not adding any parameters. make sure header is false, as.is is true, and you have the right colClass type.

-Tiffany

**Thanks. It helped.=Kate

 
public/q_a_win0809.txt · Last modified: 2009/02/08 18:27 by tjchen
 
Recent changes RSS feed Creative Commons License Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki