R 繪制 heatmap - fhqdddddd的日志 - 網(wǎng)易博客

文汐茂祥 2010-12-10

展開全文

R 繪制 heatmap

R 繪制 heatmap NBA聯(lián)盟50位頂級球員的指標(biāo)表現(xiàn)
介紹如何使用 R 繪制 heatmap 的文章。

今天無意間在Flowingdata看到一篇關(guān)于如何使用 R 來做 heatmap 的文章（請移步到這里）。雖然 heatmap 只是 R 中一個很普通的圖形函數(shù)，但這個例子使用了2008-2009賽季 NBA 50個頂級球員數(shù)據(jù)做了一個極佳的演示，效果非常不錯。對 R 大致了解的童鞋可以直接在 R console 上敲

？heatmap

直接查看幫助即可。

沒有接觸過 R 的童鞋繼續(xù)圍觀，下面會仔細介紹如何使用 R 實現(xiàn) NBA 50位頂級球員指標(biāo)表現(xiàn)熱圖：

關(guān)于 heatmap，中文一般翻譯為“熱圖”，其統(tǒng)計意義wiki上解釋的很清楚：

A heat map is a graphical representation of data where the values taken by a variable in a two-dimensional map are represented as colors.Heat maps originated in 2D displays of the values in a data matrix. Larger values were represented by small dark gray or black squares (pixels) and smaller values by lighter squares.

下面這個圖即是Flowingdata用一些 R 函數(shù)對2008-2009 賽季NBA 50名頂級球員指標(biāo)做的一個熱圖（點擊參看大圖）：

先解釋一下數(shù)據(jù)：

這里共列舉了50位球員，估計愛好籃球的童鞋對上圖右邊的每個名字都會耳熟能詳。這些球員每個人會有19個指標(biāo)，包括打了幾場球（G)、上場幾分鐘（MIN)、得分（PTS)……這樣就行成了一個50行×19列的矩陣。但問題是，數(shù)據(jù)有些多，需要使用一種比較好的辦法來展示，So it comes, heatmap!

簡單的說明：

比如從上面的熱圖上觀察得分前3名（Wade、James、Bryant）PTS、FGM、FGA比較高，但Bryant的FTM、FTA和前兩者就差一些；Wade在這三人中STL是佼佼者；而James的DRB和TRB又比其他兩人好一些……

姚明的3PP（3 Points Percentage）這條數(shù)據(jù)很有意思，非常出色！仔細查了一下這個數(shù)值，居然是100%。仔細回想一下，似乎那個賽季姚明好像投過一個3分，并且中了，然后再也沒有3p。這樣本可真夠小的！

最后是如何做這個熱圖（做了些許修改）：

Step 0. Download R

R 官網(wǎng)：http://www.，它是免費的。官網(wǎng)上面提供了Windows,Mac,Linux版本（或源代碼）的R程序。

Step 1. Load the data

R 可以支持網(wǎng)絡(luò)路徑，使用讀取csv文件的函數(shù)read.csv。

讀取數(shù)據(jù)就這么簡單：

nba <-read.csv("http://datasets./ppg2008.csv", sep=",")

Step 2. Sort data

按照球員得分，將球員從小到大排序：

nba <- nba[order(nba$PTS),]

當(dāng)然也可以選擇MIN,BLK,STL之類指標(biāo)

Step 3. Prepare data

把行號換成行名（球員名稱）：

row.names(nba) <- nba$Name

去掉第一列行號：

nba <- nba[,2:20] # or nba <- nba[,-1]

Step 4. Prepare data, again

把 data frame 轉(zhuǎn)化為我們需要的矩陣格式：

nba_matrix <- data.matrix(nba)

Step 5. Make a heatmap

# R 的默認(rèn)還會在圖的左邊和上邊繪制 dendrogram，使用Rowv=NA, Colv=NA去掉

heatmap(nba_matrix, Rowv=NA, Colv=NA, col=cm.colors(256), revC=FALSE, scale='column')

這樣就得到了上面的那張熱圖。

Step 6. Color selection

或者想把熱圖中的顏色換一下：

heatmap(nba_matrix, Rowv=NA, Colv=NA, col=heat.colors(256), revC=FALSE, scale="column", margins=c(5,10))

Bioinformatics and Computational Biology Solutions Using R and Bioconductor 第10章的

例子：

Heatmaps, or false color images have a reasonably long history, as has the
notion of rearranging the columns and rows to show structure in the data.
They were applied to microarray data by Eisen et al. (1998) and have
become a standard visualization method for this type of data.
A heatmap is a two-dimensional, rectangular, colored grid. It displays
data that themselves come in the form of a rectangular matrix. The color
of each rectangle is determined by the value of the corresponding entry
in the matrix. The rows and columns of the matrix can be rearranged
independently. Usually they are reordered so that similar rows are placed
next to each other, and the same for columns. Among the orderings that
are widely used are those derived from a hierarchical clustering, but many
other orderings are possible. If hierarchical clustering is used, then it is
customary that the dendrograms are provided as well. In many cases the
resulting image has rectangular regions that are relatively homogeneous
and hence the graphic can aid in determining which rows (generally the
genes) have similar expression values within which subgroups of samples
(generally the columns).
The function heatmap is an implementation with many options. In particular,
users can control the ordering of rows and columns independently
from each other. They can use row and column labels of their own choosing
or select their own color scheme.

> library("ALL")
> data("ALL")
> selSamples <- ALL$mol.biol %in% c("ALL1/AF4",
+ "E2A/PBX1")
> ALLs <- ALL[, selSamples]
> ALLs$mol.biol <- factor(ALLs$mol.biol)
> colnames(exprs(ALLs)) <- paste(ALLs$mol.biol,
+ colnames(exprs(ALLs)))

>library("genefilter")
> meanThr <- log2(100)
> g <- ALLs$mol.biol
> s1 <- rowMeans(exprs(ALLs)[, g == levels(g)[1]]) >
+ meanThr
> s2 <- rowMeans(exprs(ALLs)[, g == levels(g)[2]]) >
+ meanThr
> s3 <- rowttests(ALLs, g)$p.value < 2e-04
> selProbes <- (s1 | s2) & s3
> ALLhm <- ALLs[selProbes, ]

>library(RColorBrewer)

> hmcol <- colorRampPalette(brewer.pal(10, "RdBu"))(256)
> spcol <- ifelse(ALLhm$mol.biol == "ALL1/AF4",
+ "goldenrod", "skyblue")
> heatmap(exprs(ALLhm), col = hmcol, ColSideColors = spcol)

最后，可以

>help(heatmap) 查找?guī)椭纯磶椭o提供的例子

也可以看這的例子：

http://www2./fac/sci/moac/students/peter_cock/r/heatmap/

Using R to draw a Heatmap from Microarray Data

[c]

The first section of this page uses R to analyse an Acute lymphocytic leukemia (ALL) microarray dataset, producing a heatmap (with dendrograms) of genes differentially expressed between two types of leukemia.

There is a follow on page dealing with how to do this from Python using RPy.

The original citation for the raw data is "Gene expression profile of adult T-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival" by Chiaretti et al. Blood 2004. (PMID: 14684422)

The analysis is a "step by step" recipe based on this paper, Bioconductor: open software development for computational biology and bioinformatics, Gentleman et al. 2004. Their Figure 2 Heatmap, which we recreate, is reproduced here:
[Published heatmap, gentleman et al. 2004]

Assuming you have a recent version of R (from The R Project) and BioConductor (see Windows XP installation instructions), the example dataset can be downloaded as the BioConductor ALL package.

You should be able to install this from within R as follows:

> source("http://www./biocLite.R")
> biocLite("ALL")

Running bioCLite version 0.1  with R version  2.1.1 
...

Alternatively, you can download the package by hand from here or here.

If you are using Windows, download ALL_1.0.2.zip (or similar) and save it. Then from within the R program, use the menu option "Packages", "Install package(s) from local zip files..." and select the ZIP file.

On Linux, download ALL_1.0.2.tar.gz (or similar) and use sudo R CMD INSTALL ALL_1.0.2.tar.gz at the command prompt.

With that out of the way, you should be able to start R and load this package with the library and data commands:

> library("ALL")
Loading required package: Biobase
Loading required package: tools
Welcome to Bioconductor 
         Vignettes contain introductory material.  To view, 
         simply type: openVignette() 
         For details on reading vignettes, see
         the openVignette help page.
> data("ALL")

If you inspect the resulting ALL variable, it contains 128 samples with 12625 genes, and associated phenotypic data.

> ALL
Expression Set (exprSet) with 
        12625 genes
        128 samples
                 phenoData object with 21 variables and 128 cases
         varLabels
                cod:  Patient ID
                diagnosis:  Date of diagnosis
                sex:  Gender of the patient
                age:  Age of the patient at entry
                BT:  does the patient have B-cell or T-cell ALL
                remission:  Complete remission(CR), refractory(REF) or NA. Derived from CR
                CR:  Original remisson data
                date.cr:  Date complete remission if achieved
                t(4;11):  did the patient have t(4;11) translocation. Derived from citog
                t(9;22):  did the patient have t(9;22) translocation. Derived from citog
                cyto.normal:  Was cytogenetic test normal? Derived from citog
                citog:  original citogenetics data, deletions or t(4;11), t(9;22) status
                mol.biol:  molecular biology
                fusion protein:  which of p190, p210 or p190/210 for bcr/able
                mdr:  multi-drug resistant
                kinet:  ploidy: either diploid or hyperd.
                ccr:  Continuous complete remission? Derived from f.u
                relapse:  Relapse? Derived from f.u
                transplant:  did the patient receive a bone marrow transplant? Derived from f.u
                f.u:  follow up data available
                date last seen:  date patient was last seen

We can looks at the results of molecular biology testing for the 128 samples:

> ALL$mol.biol
  [1] BCR/ABL  NEG      BCR/ABL  ALL1/AF4 NEG      NEG      NEG      NEG      NEG     
 [10] BCR/ABL  BCR/ABL  NEG      E2A/PBX1 NEG      BCR/ABL  NEG      BCR/ABL  BCR/ABL 
 [19] BCR/ABL  BCR/ABL  NEG      BCR/ABL  BCR/ABL  NEG      ALL1/AF4 BCR/ABL  ALL1/AF4
      ...  
[127] NEG      NEG     
Levels: ALL1/AF4 BCR/ABL E2A/PBX1 NEG NUP-98 p15/p16

Ignoring the samples which came back negative on this test (NEG), most have been classified as having a translocation between chromosomes 9 and 22 (BCR/ABL), or a translocation between chromosomes 4 and 11 (ALL1/AF4).

For the purposes of this example, we are only interested in these two subgroups, so we will create a filtered version of the dataset using this as a selection criteria:

> eset <- ALL[, ALL$mol.biol %in% c("BCR/ABL", "ALL1/AF4")]

The resulting variable, eset, contains just 47 samples - each with the full 12,625 gene expression levels.

This is far too much data to draw a heatmap with, but we can do one for the first 100 genes as follows:

> heatmap(exprs(eset[1:100,]))

According to the BioConductor paper we are following, the next step in the analysis was to use the lmFit function (from the limma package) to look for genes differentially expressed between the two groups. The fitted model object is further processed by the eBayes function to produce empirical Bayes test statistics for each gene, including moderated t-statistics, p-values and log-odds of differential expression.

> library(limma)
> f <- factor(as.character(eset$mol.biol))
> design <- model.matrix(~f)
> fit <- eBayes(lmFit(eset,design))

If the limma package isn't installed, you'll need to install it first:

> source("http://www./biocLite.R")
> biocLite("limma")

Running bioCLite version 0.1  with R version  2.1.1 
...

We can now reproduce Figure 1 from the paper.

> topTable(fit, coef=2)
              ID         M        A         t      P.Value        B
1016     1914_at -3.076231 4.611284 -27.49860 5.887581e-27 56.32653
7884    37809_at -3.971906 4.864721 -19.75478 1.304570e-20 44.23832
6939    36873_at -3.391662 4.284529 -19.61497 1.768670e-20 43.97298
10865   40763_at -3.086992 3.474092 -17.00739 7.188381e-18 38.64615
4250    34210_at  3.618194 8.438482  15.45655 3.545401e-16 35.10692
11556   41448_at -2.500488 3.733012 -14.83924 1.802456e-15 33.61391
3389    33358_at -2.269730 5.191015 -12.96398 3.329289e-13 28.76471
8054    37978_at -1.036051 6.937965 -10.48777 6.468996e-10 21.60216
10579 40480_s_at  1.844998 7.826900  10.38214 9.092033e-10 21.27732
330      1307_at  1.583904 4.638885  10.25731 1.361875e-09 20.89145

The leftmost numbers are row indices, ID is the Affymetrix HGU95av2 accession number, M is the log ratio of expression, A is the log average expression, t the moderated t-statistic, and B is the log odds of differential expression.

Next, we select those genes that have adjusted p-values below 0.05, using a very stringent Holm method to select a small number (165) of genes.

> selected  <- p.adjust(fit$p.value[, 2]) <0.05
> esetSel <- eset [selected, ]

The variable esetSel has data on (only) 165 genes for all 47 samples . We can easily produce a heatmap as follows (in R-2.1.1 this defaults to a yellow/red "heat" colour scheme):

> heatmap(exprs(esetSel))

[Heatmap picture, default colours]

If you have the topographical colours installed (yellow-green-blue), you can do this:

> heatmap(exprs(esetSel), col=topo.colors(100))

[Heatmap figure]

This is getting very close to Gentleman et al.'s Figure 2, except they have added a red/blue banner across the top to really emphasize how the hierarchical clustering has correctly split the data into the two groups (10 and 37 patients).

To do that, we can use the heatmap function's optional argument of ColSideColors. I created a small function to map the eselSet$mol.biol values to red (#FF0000) and blue (#0000FF), which we can apply to each of the molecular biology results to get a matching list of colours for our columns:

> color.map <- function(mol.biol) { if (mol.biol=="ALL1/AF4") "#FF0000" else "#0000FF" }
> patientcolors <- unlist(lapply(esetSel$mol.bio, color.map))
> heatmap(exprs(esetSel), col=topo.colors(100), ColSideColors=patientcolors)

[Heatmap figure]

Looks pretty close now, doesn't it:
[Heatmap figure]

To recap, this is "all" we needed to type into R to achieve this:

library("ALL")
data("ALL")
eset <- ALL[, ALL$mol.biol %in% c("BCR/ABL", "ALL1/AF4")]
library("limma")
f <- factor(as.character(eset$mol.biol))
design <- model.matrix(~f)
fit <- eBayes(lmFit(eset,design))
selected  <- p.adjust(fit$p.value[, 2]) <0.05
esetSel <- eset [selected, ]
color.map <- function(mol.biol) { if (mol.biol=="ALL1/AF4") "#FF0000" else "#0000FF" }
patientcolors <- unlist(lapply(esetSel$mol.bio, color.map))
heatmap(exprs(esetSel), col=topo.colors(100), ColSideColors=patientcolors)

One subtle point in the previous examples is that the heatmap function has automatically scaled the colours for each row (i.e. each gene has been individually normalised across patients). This can be disabled using scale="none", which you might want to do if you have already done your own normalisation (or this may not be appropriate for your data):

heatmap(exprs(esetSel), col=topo.colors(75), scale="none", ColSideColors=patientcolors, cexRow=0.5)

[Heatmap figure]

You might also have noticed in the above snippet, that I have shrunk the row captions which were so big they overlapped each other. The relevant options are cexRow and cexCol.

So far so good - but what if you wanted a key to the colours shown? The heatmap function doesn't offer this, but the good news is that heatmap.2 from the gplots library does. In fact, it offers a lot of other features, many of which I deliberately turn off in the following example:

library("gplots")
heatmap.2(exprs(esetSel), col=topo.colors(75), scale="none", ColSideColors=patientcolors,
          key=TRUE, symkey=FALSE, density.info="none", trace="none", cexRow=0.5)

[Heatmap picture, topographical colours without scaling, with patient type colour bar and color key]

By default, heatmap.2 will also show a trace on each data point (removed this with trace="none"). If you ask for a key (using key=TRUE) this function will actually give you a combined "color key and histogram", but that can be overridden (with density.info="none").

Don't like the colour scheme? Try using the functions bluered/redblue for a red-white-blue spread, or redgreen/greenred for the red-black-green colour scheme often used with two-colour microarrays:

library("gplots")
heatmap.2(exprs(esetSel), col=redgreen(75), scale="row", ColSideColors=patientcolors,
           key=TRUE, symkey=FALSE, density.info="none", trace="none", cexRow=0.5)

[Heatmap figure]

So, how can we do that from within Python? One way is using RPy (R from Python), and this is discussed on this page.

P.S. If you want to use heatmap.2 from within python using RPy, use the syntax heatmap_2 due to the differences in how R and Python handle full stops and underscores.

What about other microarray data?

Well, I have also documented how you can load NCBI GEO SOFT files into R as a BioConductor expression set object. As long as you can get your data into R as a matrix or data frame, converting it into an exprSet shouldn't be too hard.

本站是提供個人知識管理的網(wǎng)絡(luò)存儲空間，所有內(nèi)容均由用戶發(fā)布，不代表本站觀點。請注意甄別內(nèi)容中的聯(lián)系方式、誘導(dǎo)購買等信息，謹(jǐn)防詐騙。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請點擊一鍵舉報。

小男孩‘自慰网亚洲一区二区,亚洲一级在线播放毛片,亚洲中文字幕av每天更新,黄aⅴ永久免费无码,91成人午夜在线精品,色网站免费在线观看,亚洲欧洲wwwww在线观看

R 繪制 heatmap - fhqdddddd的日志 - 網(wǎng)易博客

R 繪制 heatmap