【原】單細(xì)胞測序最好的教程（十一）：差異表達(dá)基因分析｜或許比pseudobulk更優(yōu)

健明 2024-01-04 發(fā)布于廣東

展開全文

作者按
本章節(jié)主要講解了單細(xì)胞數(shù)據(jù)的差異表達(dá)基因分析方法，詳細(xì)對比了ttest與DEseq2在所有細(xì)胞進行差異表達(dá)分析的異同，以及為什么要使用元細(xì)胞分析的原因。本教程首發(fā)于單細(xì)胞最好的中文教程^[1]，未經(jīng)授權(quán)許可，禁止轉(zhuǎn)載。
全文字?jǐn)?shù)|預(yù)計閱讀時間: 3000|5min
——Starlitnightly（星夜）

1. 背景

我們在前面注釋的章節(jié)中，研究了不同細(xì)胞的特異性marker（標(biāo)記）基因，但很多時候，我們更關(guān)心在某一類細(xì)胞中，兩種不同狀態(tài)下的組別差異，例如藥物治療與未經(jīng)藥物治療，腫瘤細(xì)胞與正常細(xì)胞（癌旁）細(xì)胞等。因此，我們希望能在單細(xì)胞水平上，進行差異表達(dá)分析。

單細(xì)胞RNA-seq與批量RNA-seq數(shù)據(jù)相比，有著更強的稀疏性，例如dropout現(xiàn)象的普遍出現(xiàn)，這使得我們?nèi)绻麅H僅將每一個細(xì)胞視作樣本進行差異表達(dá)分析，由于顯著性檢驗的P值對于樣本量的敏感度很高，因此如果使用全部細(xì)胞進行差異表達(dá)分析，那么分析結(jié)果中的顯著性差異可能是不可靠的。我們在下面的對比實驗會進一步說明這種現(xiàn)象。為了解決單細(xì)胞稀疏性與樣本過大的問題，我們可以采用一種叫做偽Bulk（pseudo-bulk）的方法來聚合單細(xì)胞數(shù)據(jù)，從而進行差異表達(dá)分析。但這又引入另一個問題，最近的一項研究強調(diào)了偽Bulk的問題，其中推論統(tǒng)計應(yīng)用于統(tǒng)計上不獨立的生物復(fù)制。未能考慮重復(fù)（來自同一個體的細(xì)胞）的內(nèi)在相關(guān)性會增加錯誤發(fā)現(xiàn)率（FDR）。因此，應(yīng)在差異表達(dá)分析之前，應(yīng)先應(yīng)用批量效應(yīng)校正或通過每個個體的總和、平均或隨機效應(yīng)（即偽Bulk生成）對個體內(nèi)的細(xì)胞類型特異性表達(dá)值進行聚合，以解釋樣本內(nèi)相關(guān)性。

元細(xì)胞的概念（代表不同細(xì)胞狀態(tài)的細(xì)胞組，其中元細(xì)胞內(nèi)的變異是由于技術(shù)而非生物來源）被提出作為保持統(tǒng)計效用同時最大化有效數(shù)據(jù)分辨率的一種方式。與偽Bulk不同，SEACells 尋求以與數(shù)據(jù)模態(tài)無關(guān)的方式將單個細(xì)胞聚合成代表不同細(xì)胞狀態(tài)的元細(xì)胞。使用計數(shù)矩陣作為輸入，它提供每個元單元的每個單元權(quán)重、每個元單元的每個單元硬分配以及每個元單元的聚合計數(shù)作為輸出，故在本教程中，我們將展示如何使用SEACells完成差異表達(dá)分析。

import omicverse as ov
import scanpy as sc
import scvelo as scv

ov.utils.ov_plot_set()

       ____            _     _    __                  
      / __ \____ ___  (_)___| |  / /__  _____________ 
     / / / / __ `__ \/ / ___/ | / / _ \/ ___/ ___/ _ \ 
    / /_/ / / / / / / / /__ | |/ /  __/ /  (__  )  __/ 
    \____/_/ /_/ /_/_/\___/ |___/\___/_/  /____/\___/                                              
    
    Version: 1.5.5, Tutorials: https://omicverse./

2. 加載數(shù)據(jù)

在這里，我們使用pertpy的演示數(shù)據(jù)Kang 數(shù)據(jù)集，它是來自 8 名狼瘡患者 INF-β 治療前后 6 小時前后的 10 倍基于 scRNA-seq 的外周血單核細(xì)胞 (PBMC) 數(shù)據(jù)（總共 16 個樣本）

下載地址：https:///ndownloader/files/34464122

adata = ov.read('data/kang.h5ad')
adata

    # AnnData object with n_obs × n_vars = 24673 × 15706
    #     obs: 'nCount_RNA', 'nFeature_RNA', 'tsne1', 'tsne2', 'label', 'cluster', 'cell_type', 'replicate', 'nCount_SCT', 'nFeature_SCT', 'integrated_snn_res.0.4', 'seurat_clusters'
    #     var: 'name'
    #     obsm: 'X_pca', 'X_umap'

adata.X.max()

與別的方法不一樣，我們在這里使用ttest方法計算差異表達(dá)基因，因此我們需要使用log對數(shù)化處理后的數(shù)據(jù)，因此我們對數(shù)據(jù)進行預(yù)處理

#quantity control
adata=ov.pp.qc(adata,
              tresh={'mito_perc': 0.2, 'nUMIs': 500, 'detected_genes': 250})
#normalize and high variable genes (HVGs) calculated
adata=ov.pp.preprocess(adata,mode='shiftlog|pearson',n_HVGs=2000,)

#save the whole genes and filter the non-HVGs
adata.raw = adata
adata = adata[:, adata.var.highly_variable_features]

#scale the adata.X
ov.pp.scale(adata)

#Dimensionality Reduction
ov.pp.pca(adata,layer='scaled',n_pcs=50)

adata.obsm['X_mde']=ov.utils.mde(adata.obsm['scaled|original|X_pca'])

ov.utils.embedding(adata,
                   basis='X_mde',
                    frameon='small',
                   color=['replicate','label','cell_type'])

3. 批次效應(yīng)的校正

我們發(fā)現(xiàn)數(shù)據(jù)存在著一定的批次效應(yīng)，我們這里使用scVU進行批次效應(yīng)的校正。當(dāng)然，你也可以使用別的方法，如harmony等，更多的批次效應(yīng)校正的方法見omicverse的教程：https://omicverse./en/latest/Tutorials-single/t_single_batch/

adata=ov.single.batch_correction(adata,batch_key='replicate',
                           methods='scVI',n_layers=2, n_latent=30, gene_likelihood="nb")
adata.obsm["X_mde_scVI"] = ov.utils.mde(adata.obsm["X_scVI"])

#     ...Begin using scVI to correct batch effect


#     GPU available: True (cuda), used: True
#     TPU available: False, using: 0 TPU cores
#     IPU available: False, using: 0 IPUs
#     HPU available: False, using: 0 HPUs
#     LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


#     Epoch 326/326: 100%|██████████| 326/326 [13:22<00:00,  2.47s/it, loss=599, v_num=1]
    
#     `Trainer.fit` stopped: `max_epochs=326` reached.


#     Epoch 326/326: 100%|██████████| 326/326 [13:22<00:00,  2.46s/it, loss=599, v_num=1]



# ```python
# ov.utils.embedding(adata,
#                    basis='X_mde_scVI',
#                     frameon='small',
#                    color=['replicate','label','cell_type'])

4. 差異表達(dá)分析（全細(xì)胞水平）

在使用元細(xì)胞分析前，我想先使用全細(xì)胞水平進行分析，以展示全細(xì)胞水平下的差異表達(dá)分析效果，在教程中，我們只分析CD4 T cells的差異表達(dá)基因，希望能通過CD4 T cells的差異來表明治療前后免疫應(yīng)答響應(yīng)的變化。

我們這里可以使用ttest或者是DESeq2兩種模型，兩種模型的效果我將分別演示：

4.1 基于ttest進行差異表達(dá)分析

對于ttest統(tǒng)計方法而言，數(shù)據(jù)要求滿足正態(tài)分布的基本形式。中心極限定理，是指概率論中討論隨機變量和分布漸近于正態(tài)分布的定理，是數(shù)理統(tǒng)計學(xué)和誤差分析的理論基礎(chǔ)，指出了大量隨機變量近似服從正態(tài)分布的條件。故我們使用原數(shù)據(jù)的對數(shù)標(biāo)準(zhǔn)化后的內(nèi)容，來進行差異表達(dá)分析。

test_adata=adata[adata.obs['cell_type'].isin(['CD4 T cells'])]
dds=ov.bulk.pyDEG(test_adata.to_df(layer='lognorm').T)
dds.drop_duplicates_index()
print('... drop_duplicates_index success')

    # ... drop_duplicates_index success

我們選取label分別為Ctrl與Stim作為需要進行差異分析的兩個組別

treatment_groups=test_adata.obs[test_adata.obs['label']=='stim'].index.tolist()
control_groups=test_adata.obs[test_adata.obs['label']=='ctrl'].index.tolist()
dds.deg_analysis(treatment_groups,control_groups,method='ttest')

并且我們不指定差異表達(dá)倍數(shù)，由算法根據(jù)數(shù)據(jù)的分布自動計算合適的閾值

# -1 means automatically calculates
dds.foldchange_set(fc_threshold=-1,
                   pval_threshold=0.05,
                   logp_max=10)

    # ... Fold change threshold: 0.8351151943206787

我們使用自帶的火山圖函數(shù)plot_volcano來觀察差異表達(dá)基因的分布

dds.plot_volcano(title='DEG Analysis',figsize=(4,4),
                 plot_genes_num=8,plot_genes_fontsize=12,)

我們還可以使用自帶的箱線圖函數(shù)plot_boxplot來觀察差異表達(dá)基因在不同組的差異情況

dds.plot_boxplot(genes=['NELL2','TNFSF13B'],treatment_groups=treatment_groups,
                control_groups=control_groups,figsize=(2,3),fontsize=12,
                 legend_bbox=(2,0.55))

除此之外，我們還可以在低維流型圖umap中，觀察差異表達(dá)基因在不同組的分布情況

ov.utils.embedding(test_adata,
                   basis='X_mde_scVI',
                    frameon='small',
                   color=['label','NELL2','TNFSF13B'])

4.2 基于DEseq2進行差異表達(dá)分析

DEseq2被廣泛使用在差異表達(dá)分析的各種場合，得益于其負(fù)二項分布模型的設(shè)計。使得我們可以使用原始計數(shù)Counts進行直接分析。DESeq2將對原始reads進行建模，使用標(biāo)準(zhǔn)化因子(scale factor)來解釋庫深度的差異。然后，DESeq2估計基因的離散度，并縮小這些估計值以生成更準(zhǔn)確的離散度估計，從而對reads count進行建模。最后，DESeq2擬合負(fù)二項分布的模型，并使用Wald檢驗或似然比檢驗進行假設(shè)檢驗。

與ttest不同，使用DEseq2時不用考慮數(shù)據(jù)與正態(tài)分布之間的關(guān)系，也不用考慮數(shù)據(jù)特征。

test_adata=adata[adata.obs['cell_type'].isin(['CD4 T cells'])]
dds=ov.bulk.pyDEG(test_adata.to_df(layer='counts').T)
dds.drop_duplicates_index()
print('... drop_duplicates_index success')

treatment_groups=test_adata.obs[test_adata.obs['label']=='stim'].index.tolist()
control_groups=test_adata.obs[test_adata.obs['label']=='ctrl'].index.tolist()
dds.deg_analysis(treatment_groups,control_groups,method='DEseq2')

    # ... drop_duplicates_index success
    # Fitting size factors...
    # ... done in 0.20 seconds.
    
    # Fitting dispersions...
    # ... done in 3.69 seconds.
    
    # Fitting dispersion trend curve...
    # ... done in 0.60 seconds.
    
    # logres_prior=nan, sigma_prior=nan
    # Fitting MAP dispersions...
    # ... done in 73.18 seconds.
    
    # Fitting LFCs...
    # ... done in 299.53 seconds.
    
    # Refitting 5 outliers.
    
    # Fitting dispersions...
    # ... done in 0.09 seconds.
    
    # Fitting MAP dispersions...
    # ... done in 0.23 seconds.
    
    # Fitting LFCs...
    # ... done in 0.70 seconds.
    
    # Running Wald tests...
    # ... done in 71.47 seconds.
    
    # Log2 fold change & Wald test p-value: condition Treatment vs Control

# -1 means automatically calculates
dds.foldchange_set(fc_threshold=1.2,
                   pval_threshold=0.05,
                   logp_max=10)

    #... Fold change threshold: 1.2

dds.plot_volcano(title='DEG Analysis',figsize=(4,4),
                 plot_genes_num=8,plot_genes_fontsize=12,)

dds.plot_boxplot(genes=['IFI6','RGCC'],treatment_groups=treatment_groups,
                control_groups=control_groups,figsize=(2,3),fontsize=12,
                 legend_bbox=(2,0.55))

ov.utils.embedding(test_adata,
                   basis='X_mde_scVI',
                    frameon='small',
                   color=['label','IFI6','RGCC'])

5. 基于元細(xì)胞的差異表達(dá)分析

我們可以看到，無論是使用ttest與DEseq2，都可以一定程度上計算出兩個組別的細(xì)胞差異表達(dá)基因，并且也確實在兩個組別中實現(xiàn)了分離。但是，這也帶來了大量的生物學(xué)噪音。兩種細(xì)胞并沒有實現(xiàn)非常完美的分離，差異表達(dá)基因的表達(dá)在兩類細(xì)胞中也有著一定的豐度。

為了解決這一問題，擴大樣本（細(xì)胞）的深度，我們使用SEACells聚合細(xì)胞，然后在元細(xì)胞水平上，執(zhí)行差異表達(dá)分析。值得一提的是，SEACells聚合的元細(xì)胞在信號聚合和細(xì)胞分辨率之間實現(xiàn)了最佳平衡，并且它們捕獲整個表型譜中的細(xì)胞狀態(tài)，包括罕見狀態(tài)。元細(xì)胞保留了樣本之間微妙的生物學(xué)差異，這些差異通過替代方法作為批次效應(yīng)被消除，因此，為數(shù)據(jù)集成提供了比稀疏單個細(xì)胞更好的起點。

5.1 元細(xì)胞初始化

在這里，我們使用scVI作為元細(xì)胞特征向量的初始化，你也可以使用X_pca或者是omicverse計算得到的scaled|original|X_pca來作為元細(xì)胞的初始輸入。

meta_obj=ov.single.MetaCell(adata,use_rep='X_scVI',n_metacells=250,
                           use_gpu=True)

    # Welcome to SEACells GPU!
    # Computing kNN graph using scanpy NN ...
    # computing neighbors
    #     finished: added to `.uns['neighbors']`
    #     `.obsp['distances']`, distances for each pair of neighbors
    #     `.obsp['connectivities']`, weighted adjacency matrix (0:00:03)
    # Computing radius for adaptive bandwidth kernel...



    #   0%|          | 0/24528 [00:00<?, ?it/s]


    # Making graph symmetric...
    # Parameter graph_construction = union being used to build KNN graph...
    # Computing RBF kernel...



    #   0%|          | 0/24528 [00:00<?, ?it/s]


    # Building similarity LIL matrix...



    #   0%|          | 0/24528 [00:00<?, ?it/s]


    # Constructing CSR matrix...

我們需要初始化元細(xì)胞的原型，也就是其在特征空間scVI上的初始位置，這里可能有一些難理解，如果我們在umap圖上來理解的話，可以表示為umap圖上的元細(xì)胞的位置，或許會更加通俗一些。

meta_obj.initialize_archetypes()

    # Building kernel on X_scVI
    # Computing diffusion components from X_scVI for waypoint initialization ... 
    # computing neighbors
    #     finished: added to `.uns['neighbors']`
    #     `.obsp['distances']`, distances for each pair of neighbors
    #     `.obsp['connectivities']`, weighted adjacency matrix (0:00:02)
    # Done.
    # Sampling waypoints ...
    # Done.
    # Selecting 239 cells from waypoint initialization.
    # Initializing residual matrix using greedy column selection
    # Initializing f and g...


    # 100%|██████████| 21/21 [00:00<00:00, 42.44it/s]
    
    # Selecting 11 cells from greedy initialization.

5.2 元細(xì)胞模型訓(xùn)練

初始完模型結(jié)構(gòu)后，我們就需要對元細(xì)胞進行聚合，在這里我們通過訓(xùn)練SEACells模型來完成這一目的

meta_obj.train(min_iter=10, max_iter=50)

    # Randomly initialized A matrix.
    # Setting convergence threshold at 0.28753
    # Starting iteration 1.
    # Completed iteration 1.
    # Converged after 8 iterations.
    # Converged after 9 iterations.
    # Starting iteration 10.
    # Completed iteration 10.
    # Converged after 10 iterations.

我們提供了一個保存與讀取函數(shù)，避免重復(fù)計算

meta_obj.save('seacells/model.pkl')
meta_obj.load('seacells/model.pkl')

5.3 預(yù)測元細(xì)胞

我們可以使用 predicted 來預(yù)測原始 scRNA-seq 數(shù)據(jù)的元胞。有兩種方法可供選擇，一種是 "soft"方法，另一種是 "hard"方法。

在 "soft"方法中，匯總每個 SEACell 中的細(xì)胞，對所有原始數(shù)據(jù)求和 x 為屬于一個 SEACell 的所有細(xì)胞分配權(quán)重。數(shù)據(jù)未標(biāo)準(zhǔn)化，偽原始聚合計數(shù)存儲在 .layer['raw']中。與變量（.var）相關(guān)的屬性會被復(fù)制過來，但每個 SEACell 的相關(guān)屬性必須手動復(fù)制，因為某些屬性可能需要求和或取平均值等，具體取決于屬性。

在 "hard"方法中，匯總每個 SEACell 中的單元格，對屬于一個 SEACell 的所有單元格的所有原始數(shù)據(jù)求和。數(shù)據(jù)未經(jīng)標(biāo)準(zhǔn)化處理，原始匯總計數(shù)存儲在 .layer['raw']中。與變量（.var）相關(guān)的屬性會被復(fù)制過來，但每個 SEACell 的相關(guān)屬性必須手動復(fù)制，因為某些屬性可能需要求和或取平均值等，具體取決于屬性。

我們首先需要串聯(lián)細(xì)胞類型(cell_type)與組別(label)的標(biāo)簽用于預(yù)測。

meta_obj.adata.obs['celltype-label']=[i+'-'+j for i,j in zip(meta_obj.adata.obs['cell_type'],
                                                           meta_obj.adata.obs['label'])]

ad=meta_obj.predicted(method='soft',celltype_label='celltype-label',
                     summarize_layer='lognorm')

  #  100%|██████████| 250/250 [01:03<00:00,  3.94it/s]

ad

    # AnnData object with n_obs × n_vars = 250 × 2000
    #     obs: 'Pseudo-sizes', 'celltype', 'celltype_purity'

import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(4,4))
ov.utils.embedding(
    meta_obj.adata,
    basis="X_mde_scVI",
    color=['cell_type'],
    frameon='small',
    title="Meta cells",
    #legend_loc='on data',
    legend_fontsize=14,
    legend_fontoutline=2,
    size=10,
    ax=ax,
    alpha=0.2,
    #legend_loc='', 
    add_outline=False, 
    #add_outline=True,
    outline_color='black',
    outline_width=1,
    show=False,
    #palette=ov.utils.blue_color[:],
    #legend_fontweight='normal'
)
ov.single._metacell.plot_metacells(ax,meta_obj.adata,
                                   use_rep='X_mde_scVI',color='#CB3E35',
                                  )

sc.pp.highly_variable_genes(ad,  n_top_genes=1500, inplace=True)
sc.tl.pca(ad, use_highly_variable=True)
sc.pp.neighbors(ad, use_rep='X_pca')
sc.tl.umap(ad)

    # If you pass `n_top_genes`, all cutoffs are ignored.
    # extracting highly variable genes
    #     finished (0:00:00)
    # --> added
    #     'highly_variable', boolean vector (adata.var)
    #     'means', float vector (adata.var)
    #     'dispersions', float vector (adata.var)
    #     'dispersions_norm', float vector (adata.var)
    # computing PCA
    #     on highly variable genes
    #     with n_comps=50
    #     finished (0:00:00)
    # computing neighbors
    #     finished: added to `.uns['neighbors']`
    #     `.obsp['distances']`, distances for each pair of neighbors
    #     `.obsp['connectivities']`, weighted adjacency matrix (0:00:00)
    # computing UMAP
    #     finished: added
    #     'X_umap', UMAP coordinates (adata.obsm) (0:00:00)

import matplotlib.pyplot as plt
from matplotlib import patheffects
fig, ax = plt.subplots(figsize=(4,4))
ov.utils.embedding(
    ad,
    basis="X_umap",
    color=['celltype'],
    frameon='small',
    title="metacells celltypes",
    #legend_loc='on data',
    legend_fontsize=14,
    legend_fontoutline=2,
    #size=10,
    ax=ax,
    legend_loc=None, add_outline=False, 
    #add_outline=True,
    outline_color='black',
    outline_width=1,
    show=False,
    #palette=ov.palette()[12:],
    #legend_fontweight='normal'
)
ov.utils.gen_mpl_labels(
    ad,
    'celltype',
    exclude=("None",),  
    basis='X_umap',
    ax=ax,
    adjust_kwargs=dict(arrowprops=dict(arrowstyle='-', color='black')),
    text_kwargs=dict(fontsize= 12 ,weight='bold',
                     path_effects=[patheffects.withStroke(linewidth=2, foreground='w')] ),
)

test_adata=ad[ad.obs['celltype'].isin(['CD4 T cells-stim','CD4 T cells-ctrl'])]
dds_meta=ov.bulk.pyDEG(test_adata.to_df().T)
dds_meta.drop_duplicates_index()
print('... drop_duplicates_index success')

treatment_groups=test_adata.obs[test_adata.obs['celltype']=='CD4 T cells-stim'].index.tolist()
control_groups=test_adata.obs[test_adata.obs['celltype']=='CD4 T cells-ctrl'].index.tolist()
dds_meta.deg_analysis(treatment_groups,control_groups,method='ttest')

# -1 means automatically calculates
dds_meta.foldchange_set(fc_threshold=-1,
                   pval_threshold=0.05,
                   logp_max=10)

   # ... drop_duplicates_index success
   # ... Fold change threshold: 1.4815978396550413

dds_meta.plot_volcano(title='DEG Analysis',figsize=(4,4),
                 plot_genes_num=8,plot_genes_fontsize=12,)

dds_meta.plot_boxplot(genes=['IFIT1','OSCP1'],treatment_groups=treatment_groups,
                control_groups=control_groups,figsize=(2,3),fontsize=12,
                 legend_bbox=(2,0.55))

ov.utils.embedding(test_adata,
                   basis='X_umap',
                    frameon='small',
                   color=['IFIT1','OSCP1','celltype'],
                  cmap='RdBu_r')

ov.utils.embedding(adata,
                   basis='X_mde_scVI',
                    frameon='small',
                   color=['label','IFIT1','OSCP1'])

我們發(fā)現(xiàn)，IFIT1和OSCP1分別在stim和ctrl組中特異性下調(diào)與上調(diào)，其中IFIT1編碼一種含有四肽重復(fù)序列的蛋白質(zhì)，最初被鑒定為干擾素治療誘導(dǎo)的。元細(xì)胞顯示出了更優(yōu)的差異基因的識別優(yōu)勢，并且排除了dropout的干擾，我們對比箱線圖可得到相同的觀點。

test_adata=adata[adata.obs['cell_type'].isin(['CD4 T cells'])]
treatment_groups=test_adata.obs[test_adata.obs['label']=='stim'].index.tolist()
control_groups=test_adata.obs[test_adata.obs['label']=='ctrl'].index.tolist()
dds.plot_boxplot(genes=['IFIT1','OSCP1'],treatment_groups=treatment_groups,
                control_groups=control_groups,figsize=(2,3),fontsize=12,
                 legend_bbox=(2,0.55))

5. 總結(jié)

我們在對單細(xì)胞數(shù)據(jù)進行差異表達(dá)分析的時候，可以從全細(xì)胞和元細(xì)胞兩個角度去考慮，通常來說為了避免生物學(xué)噪音和dropout的干擾，我們會選擇元細(xì)胞作為分析策略來進行。但更多時候，我們會同時使用兩種方法，并進行相互印證，來表明分析的準(zhǔn)確性與可靠性。