R learning: September 2007

2007-09-26

The main dish is coming? Q1.1 ~ 4... XD

Q1.1) fold change:
easy thought, easy Done! Sorting data by AVE(schz)/AVE(ctrl) and then pick up the first 200 genes. But how?

>aa<-array(round(rnorm(24),2),dim=c(6,4))  # randomly generate number based on normal dist
>aa
      [,1]  [,2]  [,3]  [,4]
[1,] -0.53 -1.26  0.87  1.92
[2,]  1.00  0.26 -0.40 -0.67
[3,]  1.39 -0.16 -0.11 -0.06
[4,] -0.51  0.35  0.28  0.07
[5,] -1.59  1.80  0.52 -0.63
[6,]  0.03 -0.60  0.61 -0.17
>order(aa[ ,2])                            # sort by column 2
>order(aa[3, ])                            # sort by row 5
>order(aa[3, ], decreasing=TRUE)           # this one is the default
>order(aa[3, ], decreasing=FALSE)
>aa[order(aa[3,]),]
>aa[order(aa[3,])[1:2],3:5]             #aa[order(aa[3,])[row],col]

As you can see the order() returning the order of row or column as a vector, and that's why it's a beautiful thing for us when selecting genes by kinda order.

Heatmap is so pretty and vivid.. XD

Don't just try heatmap..There are more options such as heatmap.2 from gplots package, and heatmap_2 from Heatplus package.
Both heatmap.2 and heatmap_2 can show the bar of color code for arbitrary unit (color key ); in addition to that, heatmap_2 can repress the dendrogram for rows (but unable to suppress the clustering) and show different colors for dendrogram of columns based on their clusters.
Don't worry about the data format! They "eat" the same format of data, meaning what heatmap can load is good for heatmap.2 & heatmap_2 too.
Some miscellaneous setup just for pretty look
1) color pattern
Color palette can be easily assigned by using col=ColorPattern.
Here are some internal/pre-defined color palettes:
heat.colors(n), topo.colors(n), redgreen(n) and cm.colors(n)
Change (n) to set up how many levels you want in the color gradient
rev(): reverse the color pattern, for example rev(redgreen(100)) will get a gradient from green to red instead of the one from red to green.

2) color-stripe for specific group of sample
scol=c(...) is a vector with the length equaling to the number of your column.
Just assign the color you want in th vector, like c("red","red","skyblue","skyblue") for a 4-coulmned dataset.
3) scaling
scaling="row", "col", "none"; although keeping this option ON usually makes the heatmap pretty, I thought turning it OFF can help us have an ideal how many genes is in relatively low expression levels.

How to feed the program ?

The first challenge met is what kind of function I should use to feed the program.
Checked the data file, even with .xls; actually it's a tab separated text file, and thus I decided to use read.table() function.

Read.table has very flexible parameters allowed to set up for reading in titles/column names, change of separators(^t, ^p, "." and so on). However, R also provides several subtypes of read.table, which with mild difference of default.

One of them, read.delim(), fits to my needs. It reads in the first row as the column names (check it by using colnames()), recognizes "tab" as separator (type ?read.delim to see detail).

So far so good, right? Don't be happy too early..
Since our purpose is to visualize the data by heatmap, I'm so eager to see it first.. (what?! just put the fold change or something else on the back strove). I'm so naive that it's not a big deal to throw the data into the heatmap(). Here comes the problem... Orz

>t<-read.delim("sch.dat") >t
>colnames(t)
>dim(t)
>heatmap(t[2:7072,6:29],col=rev(heat.colors(100)))
Error in heatmap(t[2:7072, 6:29], col = rev(heat.colors(100))) : 
        'x' must be a numeric matrix
#since the column(1:5) are not numeric, they are probe/gene information
#It's so weird, the range I select t[2:7072, 6:29] should be ALL numeric.
#NOW learn a lesson what you saw is NOT what you think!!!
> t[1,7:8]
    C2.A..  C3.A..
1 18291.58 19402.7
> is.numeric(t[1,7:8]) # check if it is numeric.. XD XD XD it's NOT.. Orz.. no wonder..
[1] FALSE
>NewSet <- array(rnorm(7071*25),dim=c(7071,25)) # put some number to array (just for sure it's numeric!!!)
>for(i in 1:7071){for(j in 1:24){NewSet[i,j]=t[i,5+j]}} # transfer data from original array to new one
>NewSet[1,2:3] # look almost identical to t[1,7:8]
[1] 18291.58 19402.70
>is.numeric(NewSet[1,2:3])
[1] TRUE
>rownames(NewSet)
NULL
>colnames(NewSet)
NULL
#now the data matrix has no row/col names, which is not good for us to know which is which!
#assign the names to them
>rownames(NewSet)=t[,1] # row=genes
>colnames(NewSet)<-c("c1","c2","c3","c4","c5","c6","c7","c8","c9","c10","c11","c12","p1","p2","p3","p4","p5","p6","p7","p8","p9","p10","p11","p12","") ## The reason I would like to assign the row/col names is the heatmap function can automatically pick up the names and show on the plot
## OK! now the data is ready for heatmap !!!!?

2007-09-24

Some hints for analyzing microarray data

http://strimmerlab.org/notes/rexpress.html

2007-09-18

Good workshops !!

Found out some workshop materials for practice,
http://www.bioconductor.org/workshops/
http://bioinf.wehi.edu.au/marray/ibc2004/
Enjoy yourself..!

Identify the differentially expressed genes

A slight difference of parameter will make a totally different gene list.. Must watch out what we did.. As I knew currently, the most important things are the design matrix and the "coefficient". There is other minor adjustment such as topTable(adjust), which is for p-value adjustment, but I would like to stick to "FDR" first.

>design<-cbind(Dye=1, Beta7=c(1,-1,-1,1,1,-1))
>fit<-lmFit(nordata,design,weights=NULL)
>fit<-eBayes(fit)
>Table1<-topTable(fit,coef="Dye",adjust="fdr")
>Table1$Name<-substring(Table1$Name,1,12)
>Table1
             ID         Name    logFC   AveExpr        t      P.Value  adj.P.Val        B
10747 H200005892 HLA-DQB1 - M 1.566674 12.002316 13.24244 9.728991e-06 0.02544872 3.807952
8819  H200006035 TYRP1 - Tyro 1.661024 12.159866 13.00588 1.082891e-05 0.02544872 3.733779
15579 H200006025 DAP - Death- 1.435546 12.644359 12.47848 1.384254e-05 0.02544872 3.559907
9305  H200005780 C19orf7 - Ch 1.353122 13.101057 12.39516 1.440280e-05 0.02544872 3.531313
12312 H200011083 KIAA0692 - K 1.358942 10.883141 12.14326 1.626307e-05 0.02544872 3.442918
10913 H200013564 RBMX - RNA b 1.289367 12.517112 11.79361 1.932612e-05 0.02544872 3.315169
15126 H200007318 HAP1 - Hunti 1.623051 11.536758 11.66654 2.060115e-05 0.02544872 3.267230
17192 H200013376 SLC8A2 - Sol 1.737073  4.501546 11.66589 2.060792e-05 0.02544872 3.266983
16546 H200006019 PPP5C - Prot 1.489265 10.131609 11.66234 2.064493e-05 0.02544872 3.265631
20416 H200006376 ARHA - Ras h 1.884627 13.675812 11.62780 2.100903e-05 0.02544872 3.252450

>design<-c(1,-1,-1,1,1,-1)
>fit<-lmFit(nordata,design,weights=NULL)
>fit<-eBayes(fit)
>Table2<-topTable(fit,adjust="fdr")
>Table2$Name<-substring(Table2$Name,1,12)
>Table2

          ID         Name      logFC   AveExpr          t      P.Value adj.P.Val            B
6647  H200017286 GPR2 - G pro -2.4511865  7.787727 -10.770494 8.123551e-06 0.1834054  1.171025368
11025 H200018884 Homo sapiens -1.5975971  6.616640  -8.764280 3.424187e-05 0.2828096  0.707261894
3152  H200012024 ITGA1 - Inte  1.2673703  6.979029   8.390349 4.617165e-05 0.2828096  0.594759959
21405 H200007427 CENTG2 - Cen -1.1868237  6.092311  -8.290405 5.010579e-05 0.2828096  0.562965478
4910  H200003784 SEMA5A - Sem -1.3489761  6.809355  -6.893363 1.723876e-04 0.6096875  0.027552798
7832  H200004937 Homo sapiens -1.2424628  6.301858  -6.851093 1.794831e-04 0.6096875  0.008303493
18925 H200003977 F5 - Coagula -1.1046977  7.498769  -6.547843 2.410666e-04 0.6096875 -0.135922701
20812 H200001929 EPLIN - Epit -1.0963658  8.615247  -6.294852 3.107515e-04 0.6096875 -0.264929553
16442 H200001079 EGFL5 - EGF- -0.9630056  7.739278  -6.273861 3.174729e-04 0.6096875 -0.276006763
1755  H200014446 P2Y5 - Purin -0.9152066 12.226345  -6.147830 3.613902e-04 0.6096875 -0.343757901

2007-09-15

Guestbook

Here is the guestbook.
Welcome any suggestions and comments. Hopefully, we can learn more from ALL of our try-and-error.

2007-09-13

Check the results before next move...

Just found out the maQualityPlots() produced several diagnostic plots including both MA plts before and after normalization.. Orz Orz Orz Orz Orz..

Well.. but from this "mistake" we did learn things including how to extract subsets of data from an object, and to get MA plots after normalization.

Next, I would try to generate diagnostic plots without maQualityPlots()...

Damn it! It's a matter of memory...

I re-started the R, and try the following commands. My purpose is to find out the maximal numbers of objects able to be print out in a single ps file.

>x<-rawdata@maRf[,1]
>y<-rawdata@maGf[,1]
>plot(y~x)
>x<-rawdata@maRf[,1:2]
>y<-rawdata@maGf[,1:2]
>plot(y~x)
>x<-rawdata@maRf[,1:3]
>y<-rawdata@maGf[,1:3]
>plot(y~x)
>x<-rawdata@maRf[,1:4]
>y<-rawdata@maGf[,1:4]
>plot(y~x)
>x<-rawdata@maRf[,1:5]
>y<-rawdata@maGf[,1:5]
>plot(y~x)
>x<-rawdata@maRf[,1:6]
>y<-rawdata@maGf[,1:6]
>plot(y~x)
>x<-rawdata@maRf
>y<-rawdata@maGf
>plot(y~x)
> maQualityPlots(rawdata[,1], dev="postscript")
Error in frame() : figure margins too large

They are ALL fine for plots()
But maQualityPlots() still sucks!

A primitive way to generate plots

>rawdata
An object of class "marrayRaw"
@maRf
6Hs.195.1.gpr 6Hs.168.gpr 6Hs.166.gpr 6Hs.187.1.gpr 6Hs.194.gpr
[1,] 227 370 139 69 187
[2,] 208 273 181 80 213
.
.
.

@maGf
6Hs.195.1.gpr 6Hs.168.gpr 6Hs.166.gpr 6Hs.187.1.gpr 6Hs.194.gpr
[1,] 357 304 189 106 218
[2,] 292 235 255 133 216
[3,] 284 223 256 123 290
.
.
.
>x <- rawdata@maRf
>y <- rawdata@maGf
>plot(y~x)
Error in plot.new() : figure margins too large

1) check what variables we got within the dataset; here we have @maRf, @maGf and so on
2) Get them
3) plot it, but encounter the same results (this problem has been partially resloved by re-starting R [see NEXT article] )

2007-09-12

Got a new toy!

The marray Package..
Now I can directly generate MA plots from either rawdata (Green/Red intensities) or transformed data (M/A values).

RG -> MA !!!!!!

Ha~
Just figured it out ~~~~~
The program read in the red/green raw data (intensities), and then convert them to M/A values. So if checking the data before/after normalization, we can find out the variables were changed, from $R & $G to $M & $A. That's why we can not get MA plots by using maQualityPlots, which converts $R/$G to $M/$A. How could we expect it to convert a dataset which is already $M/$A !? Right?

BEFORE normalization

AFTER normalization

>rawdata <- read.GenePix(target=ArrayInfo) >rawdata
An object of class "marrayRaw"
@maRf
6Hs.195.1.gpr 6Hs.168.gpr 6Hs.166.gpr 6Hs.187.1.gpr 6Hs.194.gpr
[1,] 227 370 139 69 187
[2,] 208 273 181 80 213
[3,] 239 288 238 77 377
[4,] 247 301 228 91 329
[5,] 221 290 218 91 351
6Hs.243.1.gpr
[1,] 241
[2,] 241
[3,] 307
[4,] 280
[5,] 236
23179 more rows ...

@maGf
6Hs.195.1.gpr 6Hs.168.gpr 6Hs.166.gpr 6Hs.187.1.gpr 6Hs.194.gpr
[1,] 357 304 189 106 218
[2,] 292 235 255 133 216
[3,] 284 223 256 123 290
[4,] 320 254 302 141 301
[5,] 361 290 328 156 327
6Hs.243.1.gpr
[1,] 165
[2,] 183
[3,] 194
[4,] 252
[5,] 247
23179 more rows ...
.
.
.

>NormalizedData=maNorm(rawdata)
>NormalizedData
An object of class "marrayNorm"
@maA
6Hs.195.1.gpr 6Hs.168.gpr 6Hs.166.gpr 6Hs.187.1.gpr 6Hs.194.gpr 6Hs.243.1.gpr
[1,] 6.895683 7.298095 NA NA NA 3.336213
[2,] 5.965369 6.322153 5.608070 3.953445 NA 5.213656
[3,] 6.332001 5.879320 6.272724 3.390680 6.759634 5.960176
[4,] 6.765325 6.701772 6.569456 4.761781 6.698837 6.356586
[5,] 6.821026 6.923822 6.699372 5.099223 6.954853 5.620098
23179 more rows ...

@maM
6Hs.195.1.gpr 6Hs.168.gpr 6Hs.166.gpr 6Hs.187.1.gpr 6Hs.194.gpr 6Hs.243.1.gpr
[1,] -0.4520881 0.11756657 NA NA NA 5.2940001
[2,] -0.1146129 0.02906567 -0.5395663 -0.9326132 NA 1.2865839
[3,] 0.7722506 0.91978728 0.8007346 -0.6557331 0.69575087 2.0297161
[4,] 0.2966088 0.07101715 -0.2995052 -0.2816260 0.01501081 0.3527075
[5,] -0.5549713 -0.56805515 -0.6199517 -0.6012264 -0.21962736 -0.2105729
23179 more rows ...
.
.
.

2007-09-11

MA Plots after Normalization ? We have tried limma too !

Yes..!!! We got a NEW problem!
We use maNorm() to normalize the dataset and want to compare the MA plots before/after normalization. However when we will apply maQualityPlots, it showed error message again!

>NormalizedData=maNorm(rawdata)
>maQualityPlots(NormalizedData, dev="JPEG")
Error in 1:dim(mrawObj)[2] : NA/NaN argument
> summary(NormalizedData)
Normalized intensity data: Object of class marrayNorm.
.
.
> NormalizedData
An object of class "marrayNorm"
@maA
6Hs.195.1.gpr 6Hs.168.gpr 6Hs.166.gpr 6Hs.187.1.gpr 6Hs.194.gpr 6Hs.243.1.gpr
[1,] 6.895683 7.298095 NA NA NA 3.336213
[2,] 5.965369 6.322153 5.608070 3.953445 NA 5.213656
[3,] 6.332001 5.879320 6.272724 3.390680 6.759634 5.960176
.
.
.
>write.marray(NormalizedData)
NULL

1) maNorm() to normalize the dataset
2) tried to get MA plots
3) check the normalized dataset -> Yes, it's normailized
4) By looking the data, we found the NA data within normalized dataset. We guess that's the reason made maQualityPlots() not work!
5) But we still can write the normalized data into a new file by using write.marray().

We decided to try another package for normalization, which is limma

>source("http://bioconductor.org/biocLite.R")
>biocLite("limma")
>library(limma)
>target <- readTargets("TargetBeta7.txt") >f <- function(x) as.numeric(x$Flags > -99)
>RG<-read.maimages(target,source="genepix",wt.fun=f)
Read 6Hs.195.1.gpr
Read 6Hs.168.gpr
Read 6Hs.166.gpr
Read 6Hs.187.1.gpr
Read 6Hs.194.gpr
Read 6Hs.243.1.gpr
>RG<-backgroundCorrect(RG, method="normexp",offset=50) Corrected array 1
Corrected array 2
Corrected array 3
Corrected array 4
Corrected array 5
Corrected array 6
>maQualityPlots(RG, dev="jpeg")
>nRG<-normalizeWithinArrays(RG) >maQualityPlots(nRG, dev="jpeg")
Error in do.call("gpFlagWt", list(mraw@maW)) :
no slot of name "maW" for this object of class "MAList"

1) Install it first
2) then load the package
3) read in the list for *.gpr file
4) Set up a filter so that any spot with a flag of −99 or less gets zero weight (based on limma userguide)
5) Read in the Red/Green raw intensities to "RG"
6) Do an adaptive background correction which is optional
but recommended for GenePix data
7) get the MA plots
8) run normalization
9) try to get MA plots after normalization -> failed again

2007-09-07

The postscript problem in HW #1

The first part for reading in the dataset is fine, like this:

>library("arrayQuality")

>ArrayInfo <- read.marrayInfo("targetBeta7.txt")

>rawdata <- read.GenePix(target=ArrayInfo)

1) Loading the external package first
2) Read array layout into the customer variable "ArrayInfo"
3) Read in the raw data

then we can get MA plots as PNG or JPEG format,

>maQualityPlots(rawdata, dev="png")
or
>maQualityPlots(rawdata, dev="jpeg")

Either output as PNG or JPEG format

However, we can't get the output as postscript format expect some error messages,

> maQualityPlots(rawdata, dev="ps")
[1] "Format error, format will be set to PNG"
[1] TRUE
[1] TRUE
[1] TRUE
.
.
.
> maQualityPlots(rawdata, dev="postscript")
Error in frame() : figure margins too large
>postscript(maQualityPlots(rawdata))

Basically, dev="ps" is not a right parameter in R 2.5.1 under win32. We checked the code for maQualityPlots(), and it actually recognizes "postscript".
If using dev="postscript", we got the error message.
But we did generate postscript ouptout by postscript(cmd), which can output the first plot from maQualityPlots() only (check the ps file here). (we didn't know why yet!)

We checked the manual for maQualityPlots (said nothing) and frame(), and found out frame() is an alias for plot.new(). Then we searched the R-help/discussion group, and it showed there are lots of such problems existed for plot.new() or frame() about "figure margins too large", and it said this problem came from the too small margins or too big/complicated figures, for example, too many points or lines.
People suggested to re-set the margins to resolve this problem. So first we check the margin set for postscript,

>ps.options()
.
$width
[1] 0
$height
[1] 0
.
>ps.options(width=1600, height=1200)
>ps.options()
.
$width
[1] 1600
$height
[1] 1200
.

it surprised us that both width and height are "ZERO".
So we tried to fix it by using ps.options(width=1600, height=1200), and the results look good.

However, after setting up the width and height, maQualityPlots still showed the error "Error in frame() : figure margins too large". Next we approached the issue by decreasing the margins.

>mai=c(0.2,0.2,0.2,0.2)
>mar=c(1,1,1,1)
>par()
.
.
$mai
[1] 0.612 0.492 0.492 0.252
$mar
[1] 5.1 4.1 4.1 2.1
.
.

>par(mai=c(0.2,0.2,0.2,0.2),mar=c(1,1,1,1))

After all, we still can't get the right results. It still showed "Error in frame() : figure margins too large"!

Here is the link for homework #1 (DUE 9/20)

http://www.stat.uiuc.edu/~pingma/stat530/stat530hw.html

R learning