【原】aglient芯片原始數(shù)據(jù)處理

健明 2021-07-14

展開全文

導(dǎo)讀

我多次在學(xué)徒作業(yè)強調(diào)了 3大基因芯片產(chǎn)商里面，就Agilent公司的芯片比較難搞，比如Agilent芯片表達(dá)矩陣處理(學(xué)徒作業(yè)) 以及 oligo包可以處理agilent芯片嗎，這個作業(yè)難度非常高，不過我們生信技能樹優(yōu)秀講師：小潔在繁重的授課壓力下抽空整理了相關(guān)數(shù)據(jù)處理經(jīng)驗分享給大家，下面看她的表演：

本文講的是aglient芯片原始數(shù)據(jù)的處理，參考資料是limma 的userguide文檔。GEO數(shù)據(jù)庫下載的表達(dá)矩陣不符合預(yù)期，比如是空的，或者是有負(fù)值的，那我們就處理一下它的原始數(shù)據(jù)。aglient的芯片應(yīng)用也很廣泛，舉個OSCC的栗子：GSE23558，跟著學(xué)習(xí)學(xué)習(xí)。

1.下載和讀取數(shù)據(jù)

1.1獲取臨床信息數(shù)據(jù)

從前，提到GEO數(shù)據(jù)下載，我們只有GEOquery，神功蓋世，但是死于網(wǎng)速。后來就有了中國人寄幾的GEO鏡像，AnnoProbe包。還沒有正式發(fā)表，就已經(jīng)初露鋒芒了，因簡單易學(xué)，下載迅速，在我們的粉絲圈子里很受歡迎。

rm(list = ls())
library(stringr)
library(AnnoProbe)
library(GEOquery)
library(limma)
gse = "GSE23558"
geoChina(gse)

## you can also use getGEO from GEOquery, by 
## getGEO("GSE23558", destdir=".", AnnotGPL = F, getGPL = F)

## $GSE23558_series_matrix.txt.gz
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 41000 features, 32 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: GSM577914 GSM577915 ... GSM577945 (32 total)
##   varLabels: title geo_accession ... tissue:ch1 (42 total)
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
##   pubMedIds: 22072328
## 28433800 
## Annotation: GPL6480

提供一個GSE編號就可以下載啦。因為表達(dá)矩陣是處理過的，我們不要，所以只提取臨床信息表格，從中獲得分組信息。

load("GSE23558_eSet.Rdata")
pd <- pData(gset[[1]])

調(diào)整pd的行名與文件讀取的順序一致，并定義分組信息。

raw_dir = "rawdata/GSE23558_RAW/"
raw_datas = paste0(raw_dir,"/",dir(raw_dir))

#調(diào)整pd與rawdata的順序一致
raw_order = str_extract(raw_datas,"GSM\\d*")
pd = pd[match(raw_order,rownames(pd)),]

group_list <- ifelse(stringr::str_detect(pd$title,"HealthyControl"),"normal","tumor")
group_list <- factor(group_list, levels=c("normal","tumor"))

1.2 讀取原始數(shù)據(jù)

這個原始數(shù)據(jù)下載，在GEO主頁，可能對大家的網(wǎng)絡(luò)有一點點要求，可以參考：下載GEO數(shù)據(jù)太慢？快用axel

x <- read.maimages(raw_datas,
                   source="agilent", 
                   green.only=TRUE,
                   other.columns = "gIsWellAboveBG")

## Read rawdata/GSE23558_RAW//GSM577914.txt 
## Read rawdata/GSE23558_RAW//GSM577915.txt 
## Read rawdata/GSE23558_RAW//GSM577916.txt 
## Read rawdata/GSE23558_RAW//GSM577917.txt 
## Read rawdata/GSE23558_RAW//GSM577918.txt 
## Read rawdata/GSE23558_RAW//GSM577919.txt 
## Read rawdata/GSE23558_RAW//GSM577920.txt 
## Read rawdata/GSE23558_RAW//GSM577921.txt 
## Read rawdata/GSE23558_RAW//GSM577922.txt 
## Read rawdata/GSE23558_RAW//GSM577923.txt 
## Read rawdata/GSE23558_RAW//GSM577924.txt 
## Read rawdata/GSE23558_RAW//GSM577925.txt 
## Read rawdata/GSE23558_RAW//GSM577926.txt 
## Read rawdata/GSE23558_RAW//GSM577927.txt 
## Read rawdata/GSE23558_RAW//GSM577928.txt 
## Read rawdata/GSE23558_RAW//GSM577929.txt 
## Read rawdata/GSE23558_RAW//GSM577930.txt 
## Read rawdata/GSE23558_RAW//GSM577931.txt 
## Read rawdata/GSE23558_RAW//GSM577932.txt 
## Read rawdata/GSE23558_RAW//GSM577933.txt 
## Read rawdata/GSE23558_RAW//GSM577934.txt 
## Read rawdata/GSE23558_RAW//GSM577935.txt 
## Read rawdata/GSE23558_RAW//GSM577936.txt 
## Read rawdata/GSE23558_RAW//GSM577937.txt 
## Read rawdata/GSE23558_RAW//GSM577938.txt 
## Read rawdata/GSE23558_RAW//GSM577939.txt 
## Read rawdata/GSE23558_RAW//GSM577940.txt 
## Read rawdata/GSE23558_RAW//GSM577941.txt 
## Read rawdata/GSE23558_RAW//GSM577942.txt 
## Read rawdata/GSE23558_RAW//GSM577943.txt 
## Read rawdata/GSE23558_RAW//GSM577944.txt 
## Read rawdata/GSE23558_RAW//GSM577945.txt

dim(x)

## [1] 45015    32

2.背景校正和標(biāo)準(zhǔn)化

y <- backgroundCorrect(x, method="normexp")

## Array 1 corrected
## Array 2 corrected
## Array 3 corrected
## Array 4 corrected
## Array 5 corrected
## Array 6 corrected
## Array 7 corrected
## Array 8 corrected
## Array 9 corrected
## Array 10 corrected
## Array 11 corrected
## Array 12 corrected
## Array 13 corrected
## Array 14 corrected
## Array 15 corrected
## Array 16 corrected
## Array 17 corrected
## Array 18 corrected
## Array 19 corrected
## Array 20 corrected
## Array 21 corrected
## Array 22 corrected
## Array 23 corrected
## Array 24 corrected
## Array 25 corrected
## Array 26 corrected
## Array 27 corrected
## Array 28 corrected
## Array 29 corrected
## Array 30 corrected
## Array 31 corrected
## Array 32 corrected

y <- normalizeBetweenArrays(y, method="quantile")
class(y)

## [1] "EList"
## attr(,"package")
## [1] "limma"

3. 基因過濾

去除對照探針
去除匹配不到genesymbol的探針
去除不表達(dá)的探針，去除的標(biāo)準(zhǔn)是：至少在一半樣本中高于背景，通過y(other)gIsWellAboveBG來判斷。
我自己加上了一個，測到多次的基因，只保留一個探針。

Control <- y$genes$ControlType==1L;table(Control)

## Control
## FALSE  TRUE 
## 43529  1486

NoSymbol <- is.na(y$genes$GeneName);table(NoSymbol)

## NoSymbol
## FALSE 
## 45015

IsExpr <- rowSums(y$other$gIsWellAboveBG > 0) >= 16;table(IsExpr)

## IsExpr
## FALSE  TRUE 
## 13088 31927

Isdup <- duplicated(y$genes$GeneName);table(Isdup)

## Isdup
## FALSE  TRUE 
## 30328 14687

yfilt <- y[!Control & !NoSymbol & IsExpr & !Isdup, ]
dim(yfilt)

## [1] 20650    32

可以看到，過濾后剩下了2萬多個探針。

4.得到表達(dá)矩陣

exp = yfilt@.Data[[1]]
boxplot(exp)

exp[1:2,1:2]

##      rawdata/GSE23558_RAW//GSM577914 rawdata/GSE23558_RAW//GSM577915
## [1,]                        9.284154                       11.473334
## [2,]                        7.341236                        7.474406

得到的表達(dá)矩陣沒問題，但行名和列名均有問題。行名應(yīng)該是探針名，列名是樣本名，調(diào)整一下。

4.1獲得樣本名

colnames(exp) = str_extract(colnames(exp),"GSM\\d*")
exp[1:2,1:2]

##      GSM577914 GSM577915
## [1,]  9.284154 11.473334
## [2,]  7.341236  7.474406

4.2 獲得基因名

limma文檔里寫的是用了注釋R包，在本例的原文件是里有探針注釋的，這里直接使用。

可以直接將exp的行名轉(zhuǎn)為基因名。行名不能重復(fù)，所以先去重

anno = yfilt$genes
nrow(anno);nrow(exp)

## [1] 20650

## [1] 20650

rownames(exp)=rownames(anno)
ids = unique(anno$GeneName)
exp = exp[!duplicated(anno$GeneName),]
rownames(exp) = ids
exp[1:4,1:4]

##             GSM577914 GSM577915 GSM577916 GSM577917
## APOBEC3B     9.284154 11.473334 10.439071 11.661000
## A_32_P77178  7.341236  7.474406  7.310818  7.397149
## ATP11B       9.963452  8.915621 10.193873  9.321954
## DNAJA1      13.469790 13.201078 12.827357 13.389431

至此，得到了標(biāo)準(zhǔn)的表達(dá)矩陣。后面要做什么就看你啦，這就相當(dāng)于修復(fù)了一下數(shù)據(jù)庫里那個被標(biāo)準(zhǔn)化過的表達(dá)矩陣。

5.差異分析

design <- model.matrix(~group_list)
fit <- lmFit(exp,design)
fit <- eBayes(fit,trend=TRUE,robust=TRUE)
summary(decideTests(fit))

##        (Intercept) group_listtumor
## Down             0            2102
## NotSig           0           16928
## Up           20650            1620

deg = topTable(fit,coef=2,n=dim(y)[1])
boxplot(exp[rownames(deg)[1],]~group_list)

這里直接走limma的簡易流程，可以畫差異最顯著的那個基因表達(dá)量看看，可以看到差異是超級明顯了！

save(exp,group_list,deg,gse,file = paste0(gse,"deg.Rdata"))

后面的步驟就是我們GEO數(shù)據(jù)挖掘課程的標(biāo)配啦，如果大家對這一系列“騷操作”感興趣，歡迎報名我們的GEO數(shù)據(jù)挖掘課程哈，全年滾動開班，直播互動教學(xué)以及答疑，下一期是7月6號開課，可以花時間了解一下：

生信爆款入門-全球聽（買一得五）（第5期）（可能是最后一期）你的生物信息學(xué)入門課
(必看！)數(shù)據(jù)挖掘第3期（兩天變?nèi)?，實力加量?/a>，醫(yī)學(xué)生/臨床醫(yī)師首選技能提高課

贊賞

共11人贊賞

轉(zhuǎn)藏 分享

QQ空間 QQ好友新浪微博微信

獻(xiàn)花（0） +1

來自：健明 > 《待分類》

舉報/認(rèn)領(lǐng)

0條評論

發(fā)表

請遵守用戶評論公約

類似文章 更多

健明

關(guān)注對話

TA的最新館藏

（IF=48.8）頂刊雜志的帶有超詳細(xì)python單細(xì)胞代碼和數(shù)據(jù)學(xué)習(xí)的文獻(xiàn)分享
給我那剛?cè)腴T單細(xì)胞的朋友畫一個單細(xì)胞中的堆積柱狀圖
六大scRNA測序CNV檢測工具大比拼；新一代seismic工具發(fā)布；用深度學(xué)習(xí)解鎖單細(xì)胞間“對話”；單細(xì)胞助力AD治療新突破等
一行代碼給你的單細(xì)胞UMAP圖添加左下角小箭頭坐標(biāo)軸
社交指南10步曲：與人交談時，如何傳達(dá)生物信息學(xué)是一個令人興奮、迷人且節(jié)奏快速的領(lǐng)域？
基于Signac的scATAC-seq流程學(xué)習(xí)

喜歡該文的人也喜歡更多

熱門閱讀換一換

小男孩‘自慰网亚洲一区二区,亚洲一级在线播放毛片,亚洲中文字幕av每天更新,黄aⅴ永久免费无码,91成人午夜在线精品,色网站免费在线观看,亚洲欧洲wwwww在线观看