希望得知TCGA的每个癌种中,T cell占比分布。
或者说哪种肿瘤T cell在所有细胞中占比多(浸润的多/热肿瘤)
gene expression matrix:HUGO gene symbols as row names and the columns are samples
xCell uses the expression levels ranking and not the actual values, thus normalization does not have an effect, however normalizing to gene length is required,也就是说microarray,FPKM,TPM值可以直接作为输入,但是CPM,RSEM不可以。
As a minimal requirement, at least 5000 genes are required to be available in the input gene expression matrix; however, low number of shared genes may affect the accuracy of the results. 也就是输入矩阵不可以是稀疏矩阵比如单细胞的转录组数据,矩阵最少有5000个基因(考虑有些样本不表达一些基因,最好是输入行多一些较好)
# full list of signatures
xCell.data$signatures
# 查看内部数据结构
str(xCell.data$signatures)
# 实质是 GSEAbase结构的存放方式,共489个gene set
rawEnrichmentAnalysis function reads a gene expression matrix and returns an enrichment score for each of the 64 cell types across the input gene expression samples.
该软件运行第一步,计算enrichment score
For each cell type, the average of the multiple scores from the multiple corresponding signatures is calculated. Finally, average scores are shifted such that the minimal score for each cell type is zero.
这一句话的意思是说,一个cell type enrichment score 将跨越整个矩阵的列(跨样本)进行数值矫正,(我记得做GSEA分析时enrichment score是可以为负数的,可能是这个原因)将最小值转变为0,也就是整个数列偏移让最小值等于0.
当你的某个 cell type enrichment score 跨样本一样时,enrichment score会变得很小或者同时为0,所以要求样本中细胞组成占比最要多样化。
transformScores function is used to transform scores from raw enrichment scores to a linear scale that resembles percentages, xCell uses precomputed calibration parameters for the transformations.
这是该软件运行的第二步,将enrichment score 转化为 百分比。
spillOver function is used to perform the spill-over compensation applied by xCell to reduce dependency between scores of
closely related cell types. The spill over is performed using a precomputed dependency matrix ‘K‘.
这是该软件运行的第三步,说实话我没理解。
# install
devtools::install_github("dviraran/xCell")
# load packages
library(xCell)
# gene symbols
length(xCell.data$genes)
# gene set
str(xCell.data$signatures)
?xCellAnalysis
# xCellAnalysis(expr, signatures = NULL, genes = NULL, spill = NULL,
# rnaseq = TRUE, file.name = NULL, scale = TRUE, alpha = 0.5,
# save.raw = FALSE, parallel.sz = 4, parallel.type = "SOCK",
# cell.types.use = NULL)
# expr: the gene expression data set. A matrix with row names as symbols and columns as samples.
# signatures: a GMT object of signatures.
# genes: list of genes to use in the analysis.
# spill: the Spillover object for adjusting the scores.
# rnaseq: if true than use RNAseq spillover and calibration paramters, else use array parameters.
# file.name: string for the file name for saving the scores. Default is NULL.
# scale: if TRUE, uses scaling to trnasform scores using fit.vals
# alpha: a value to override the spillover alpha parameter. Deafult = 0.5
# save.raw: TRUE to save a raw
# parallel.sz: integer for the number of threads to use. Default is 4.
# parallel.type: Type of cluster architecture when using snow.
# 'SOCK' or 'FORK'. Fork is faster, but is not supported in windows.
# cell.types.use:
# a character list of the cell types to use in the analysis. If NULL runs xCell with all cell types.
# The spillover compensation step may over compensate, thus it is always better to run xCell
# with a list of cell types that are expected to be in the mixture. The names of cell types
# in this list must be a subset of the cell types that are inferred by xCell.
# gene expression matrix
setwd("G:/20240103-- T cell enrichment score/")
list.files()
expr_file <- "G:/20240103-- T cell enrichment score/tcga_RSEM_gene_fpkm/tcga_RSEM_gene_fpkm"
# ananlysis own gene expression martix
exprMatrix = read.table(expr_file,header=TRUE,row.names=1, as.is=TRUE)
dim(exprMatrix)
# [1] 60498 10535
exprMatrix[1:5,1:5]
# TCGA.19.1787.01 TCGA.S9.A7J2.01 TCGA.G3.A3CH.11 TCGA.EK.A2RE.01 TCGA.44.6778.01
# ENSG00000242268.2 -9.966 -0.394 -9.966 -9.966 -9.966
# ENSG00000259041.1 -9.966 -9.966 -9.966 -9.966 -9.966
# ENSG00000270112.3 -4.293 -3.816 -9.966 -9.966 -6.506
# ENSG00000167578.16 4.832 4.196 3.395 3.910 4.903
# ENSG00000278814.1 -9.966 -9.966 -9.966 -9.966 -9.966
# 需要转换一下 rownames
probe <- "G:/20240103-- T cell enrichment score/tcga_RSEM_gene_fpkm/probeMap%2Fgencode.v23.annotation.gene.probemap"
probe <- read.table(probe,header=TRUE,row.names=1)
head(probe)
# gene chrom chromStart chromEnd strand
# ENSG00000223972.5 DDX11L1 chr1 11869 14409 +
# ENSG00000227232.5 WASH7P chr1 14404 29570 -
# ENSG00000278267.1 MIR6859-1 chr1 17369 17436 -
# ENSG00000243485.3 RP11-34P13.3 chr1 29554 31109 +
# ENSG00000274890.1 MIR1302-2 chr1 30366 30503 +
# ENSG00000237613.2 FAM138A chr1 34554 36081 -
probe <- read.table(probe,header=TRUE,row.names=1)
probe <- probe[!duplicated(probe$gene),]
# sum(duplicated(probe$gene))
newrow <- rownames(probe)
exprMatrix <- exprMatrix[newrow,]
rownames(exprMatrix) <- probe$gene
# This function performs all three steps in xCell, which can be performed seperately as well:
# rawEnrichmentAnalysis
# transformScores
# spillOver
scores <- xCellAnalysis(exprMatrix)
dim(scores)
scores[1:10,1:10]
# write.table(scores,file = "xCell_scores_TCGA_XENA.csv",sep=",")
Tcells <- rownames(scores)[grep(rownames(scores),pattern = "^CD")]
NK <- c("NK cells","NKT")
Tregs <- c("Tregs")
sub_scores <- scores[Tcells,]
dim(sub_scores)
total_Tscore <- apply(sub_scores,MARGIN = 2,sum,na.rm=TRUE)
total_Tscore <- round(total_Tscore,digits = 4)
# 读取病例信息,进行癌种分类以及可视化
list.files()
phenotype <- read.table("TCGA_phenotype_denseDataOnlyDownload.tsv",sep="\t",header = T,row.names = 1)
dim(phenotype)
head(phenotype)
table(phenotype$X_primary_disease)
patiens <- str_replace_all(colnames(sub_scores),pattern = "\\.",replacement = "-")
sub_phenotype <- phenotype[patiens,]
df <- cbind(sub_phenotype,total_Tscore)
head(df)
df <- na.omit(df)
ggplot(df,aes(x = X_primary_disease,y = total_Tscore,group = X_primary_disease))+
geom_boxplot(fill="darkblue")+
xlab("")+
theme(axis.text.x = element_text(angle = 60, hjust = 1))+
mytheme
# apply(scores,MARGIN = 2,sum,na.rm=T)
df2 <- df[df$X_primary_disease != 'diffuse large B-cell lymphoma' & df$X_primary_disease != "thymoma",]
df2[df2$total_Tscore >=1,"total_Tscore"] <- 1
ggplot(df2,aes(x = X_primary_disease,y = total_Tscore,group = X_primary_disease))+
geom_boxplot(fill="darkblue")+
xlab("")+
theme(axis.text.x = element_text(angle = 60, hjust = 1))+
mytheme
XENA 网站放了一个数据集:immune signature scores (denis
e wolf et al),可以和这个数据集相互验证一下
这个工具的结果也仅仅是有参考意义,enrichment score 是对基因表达量的一种抽象总结,enrichment score又会转化为相关性有限的百分比,基因表达量和细胞占比也仅仅是一种相关性关系,这些数据都不能真实的反应细胞分类后的统计结果。
# list.files()
xena_score <- read.table("TCGA_pancancer_10852whitelistsamples_68ImmuneSigs.xena",
header = T,sep="\t")
xena_score[1:10,1:10]
xena_score$X
colnames(xena_score) <- str_replace_all(colnames(xena_score),pattern = "\\.",replacement ="-")
col <- intersect(rownames(sub_phenotype),colnames(xena_score))
xena_Tscore <- xena_score[xena_score$X == "Tcell_21978456",col]
xena_Tscore <- as.data.frame(t(xena_Tscore))
xena_Tscore$cancertype <- sub_phenotype[rownames(xena_Tscore),"X_primary_disease"]
head(xena_Tscore)
colnames(xena_Tscore) <- c("score","cancertype")
ggplot(xena_Tscore,aes(x = cancertype,y = score,group = cancertype))+
geom_boxplot(fill="darkblue")+
xlab("")+
theme(axis.text.x = element_text(angle = 60, hjust = 1))+
mytheme