Kidney Renal Papillary Cell Carcinoma (KIRP) is a distinct subtype of renal cell carcinoma that represents approximately 10–15% of all kidney cancers. It is characterized by unique histological features and molecular alterations that differ from the more common clear cell renal carcinoma. KIRP is further classified into two main subtypes, type 1 and type 2, which vary in their cellular characteristics and clinical behavior (Delahunt & Eble, 1997). Despite advances in understanding some genetic mutations involved in KIRP, such as alterations in the MET gene and fumarate hydratase (FH), much remains unknown about the full spectrum of molecular changes driving this disease (TCGA Research Network, 2016; Menko et al., 2014).
RNA sequencing (RNA-seq) is a powerful technology that enables the comprehensive analysis of gene expression across the entire transcriptome. By examining RNA-Seq data from 50 tumor samples and 50 matched normal kidney tissue controls, we can identify differentially expressed genes (DEGs) and disrupted molecular pathways specific to KIRP. This analysis will deepen our understanding of the biological mechanisms underlying tumor initiation and progression. Additionally, it may reveal novel biomarkers for diagnosis and potential therapeutic targets, ultimately contributing to improved clinical management of KIRP patients (Choueiri et al., 2015; Truong & Shen, 2011).
This project focuses on the analysis of RNA-seq count data derived from KIRP samples obtained from The Cancer Genome Atlas (TCGA). The goal is to explore the molecular differences between tumor and normal tissues through a comprehensive bioinformatics pipeline. Key steps include data preprocessing, filtering for protein-coding genes using the biomaRt package, and identifying DEGs via the edgeR package. Visualization techniques such as volcano plots and heatmaps are used to interpret the differential expression results. Gene set enrichment analysis (GSEA) is performed using the fgsea package to uncover significantly altered pathways in cancer. In addition, we investigate transcription factors (TFs) with enriched binding motifs in the promoters of regulated genes and examine protein-protein interaction (PPI) networks using STRING and igraph. This multi-layered approach provides insights into the transcriptional landscape of KIRP and highlights potential biomarkers or therapeutic targets for further research.
library(biomaRt) # For querying Ensembl databases to retrieve gene information
library(edgeR) # For differential expression analysis of RNA-seq data
library(ggplot2) # For data visualization
library(ComplexHeatmap) # For creating customizable heatmaps
library(circlize) # For being able to use color functions in ComplexHeatmap
library(RColorBrewer) # For color palettes
library(fgsea) # For performing gene set enrichment analysis
library(msigdbr) # For loading MSigDB gene sets
library(MotifDb) # For working with transcription factor binding motifs
library(PWMEnrich) # For motif enrichment analysis
library(PWMEnrich.Hsapiens.background) # For human background motif data
library(Biostrings) # For handling biological sequences
library(GenomicRanges) # For working with genomic ranges and annotations
library(TxDb.Hsapiens.UCSC.hg38.knownGene) # For gene annotations and genomic features (hg38)
library(TxDb.Hsapiens.UCSC.hg19.knownGene) # For gene annotations and genomic features (hg19)
library(BSgenome.Hsapiens.UCSC.hg38) # For accessing the human genome sequence (hg38)
library(dplyr) # For data manipulation and transformation
library(STRINGdb) # For accessing the STRING database for protein-protein interactions
library(igraph) # For creating and analyzing networks
library(ggraph) # For visualizing networks with ggplot2 syntax
library(tidygraph) # For tidy data manipulation of graph objects
# Setting dynamic working directory
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
# Reading the KIRP data
load("./BR_project_data_kidney_renal_papillary_cell_carcinoma.RData")
# Exploring the data
ls()
## [1] "c_anno_df" "params" "r_anno_df" "raw_counts_df"
## TCGA-B0-5696-11A TCGA-B0-5706-11A TCGA-CJ-6030-11A
## ENSG00000278704 0 0 0
## ENSG00000277400 0 0 0
## ENSG00000274847 0 0 0
## ENSG00000277428 0 0 0
## ENSG00000276256 0 0 0
## ENSG00000276932 0 0 0
## TCGA-CJ-5682-01A TCGA-CW-6090-11A
## ENSG00000278704 0 0
## ENSG00000277400 0 0
## ENSG00000274847 0 0
## ENSG00000277428 0 0
## ENSG00000276256 0 0
## ENSG00000276932 0 0
## TCGA-B0-5696-11A TCGA-B0-5706-11A TCGA-CJ-6030-11A TCGA-CJ-5682-01A
## Min. : 0 Min. : 0.0 Min. : 0 Min. : 0.0
## 1st Qu.: 0 1st Qu.: 0.0 1st Qu.: 0 1st Qu.: 0.0
## Median : 1 Median : 1.0 Median : 1 Median : 1.0
## Mean : 1067 Mean : 967.9 Mean : 1491 Mean : 828.3
## 3rd Qu.: 169 3rd Qu.: 164.0 3rd Qu.: 236 3rd Qu.: 126.0
## Max. :1671266 Max. :1921435.0 Max. :2356239 Max. :1311502.0
## TCGA-CW-6090-11A TCGA-CW-5585-11A TCGA-CJ-6033-11A TCGA-CZ-4864-11A
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0
## Median : 1 Median : 1 Median : 1 Median : 2
## Mean : 1202 Mean : 1340 Mean : 1009 Mean : 1476
## 3rd Qu.: 176 3rd Qu.: 200 3rd Qu.: 162 3rd Qu.: 210
## Max. :1871220 Max. :2338889 Max. :2478321 Max. :4221197
## TCGA-CZ-5984-11A TCGA-B8-5549-11A TCGA-BP-4992-01A TCGA-CJ-4872-01A
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0
## Median : 1 Median : 1 Median : 3 Median : 2
## Mean : 1235 Mean : 1167 Mean : 1132 Mean : 1166
## 3rd Qu.: 202 3rd Qu.: 196 3rd Qu.: 155 3rd Qu.: 204
## Max. :3612075 Max. :1182164 Max. :8586200 Max. :1669574
## TCGA-B2-5636-11A TCGA-BP-4763-01A TCGA-AK-3427-01A TCGA-B8-A54I-01A
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0
## Median : 2 Median : 1 Median : 1 Median : 1
## Mean : 1195 Mean : 1047 Mean : 1160 Mean : 1070
## 3rd Qu.: 216 3rd Qu.: 163 3rd Qu.: 92 3rd Qu.: 134
## Max. :2530340 Max. :1021672 Max. :2486228 Max. :5473939
## TCGA-B0-5711-01A TCGA-CW-5581-11A TCGA-CZ-5985-11A TCGA-B0-5402-11A
## Min. : 0 Min. : 0.0 Min. : 0.0 Min. : 0
## 1st Qu.: 0 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0
## Median : 1 Median : 1.0 Median : 1.0 Median : 2
## Mean : 1128 Mean : 931.5 Mean : 777.6 Mean : 1397
## 3rd Qu.: 177 3rd Qu.: 140.0 3rd Qu.: 100.0 3rd Qu.: 240
## Max. :1030598 Max. :1796358.0 Max. :2238533.0 Max. :3691184
## TCGA-B0-5705-11A TCGA-CZ-5453-11A TCGA-CZ-5458-11A TCGA-CZ-5452-11A
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0
## Median : 2 Median : 2 Median : 2 Median : 1
## Mean : 1331 Mean : 1397 Mean : 1798 Mean : 1168
## 3rd Qu.: 208 3rd Qu.: 235 3rd Qu.: 286 3rd Qu.: 193
## Max. :3454246 Max. :1367858 Max. :1774756 Max. :1136014
## TCGA-AK-3456-01A TCGA-CJ-5689-11A TCGA-CZ-5986-01A TCGA-DV-5573-01A
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.0
## Median : 1.0 Median : 1.0 Median : 1.0 Median : 2.0
## Mean : 891.7 Mean : 853.6 Mean : 713.8 Mean : 963.6
## 3rd Qu.: 119.0 3rd Qu.: 159.0 3rd Qu.: 106.0 3rd Qu.: 206.0
## Max. :2175270.0 Max. :1137302.0 Max. :573980.0 Max. :847629.0
## TCGA-B8-5163-01A TCGA-CZ-5451-11A TCGA-B8-A54J-01A TCGA-BP-4999-01A
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0
## Median : 1 Median : 1 Median : 2 Median : 3
## Mean : 1617 Mean : 1456 Mean : 1047 Mean : 1611
## 3rd Qu.: 244 3rd Qu.: 231 3rd Qu.: 185 3rd Qu.: 280
## Max. :1208410 Max. :2571419 Max. :1090268 Max. :2670620
## TCGA-BP-4982-01A TCGA-B8-5552-11A TCGA-BP-4760-01A TCGA-B0-5097-01A
## Min. : 0.0 Min. : 0 Min. : 0.0 Min. : 0
## 1st Qu.: 0.0 1st Qu.: 0 1st Qu.: 0.0 1st Qu.: 0
## Median : 3.0 Median : 1 Median : 1.0 Median : 1
## Mean : 1452.3 Mean : 1134 Mean : 1104.1 Mean : 1032
## 3rd Qu.: 277.2 3rd Qu.: 192 3rd Qu.: 170.2 3rd Qu.: 161
## Max. :1404119.0 Max. :710732 Max. :630137.0 Max. :751463
## TCGA-BP-4975-01A TCGA-CZ-5462-11A TCGA-BP-5192-01A TCGA-CW-6087-11A
## Min. : 0.0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 0.0 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0
## Median : 2.0 Median : 2 Median : 1 Median : 2
## Mean : 1777.0 Mean : 1811 Mean : 1261 Mean : 1446
## 3rd Qu.: 277.2 3rd Qu.: 283 3rd Qu.: 180 3rd Qu.: 198
## Max. :2102684.0 Max. :1992403 Max. :1142488 Max. :4231548
## TCGA-B0-5703-01A TCGA-AS-3777-01A TCGA-CW-6097-01A TCGA-B0-5711-11A
## Min. : 0.0 Min. : 0 Min. : 0.0 Min. : 0
## 1st Qu.: 0.0 1st Qu.: 0 1st Qu.: 0.0 1st Qu.: 0
## Median : 1.0 Median : 2 Median : 1.0 Median : 2
## Mean : 900.3 Mean : 1435 Mean : 978.1 Mean : 1332
## 3rd Qu.: 122.0 3rd Qu.: 144 3rd Qu.: 171.0 3rd Qu.: 255
## Max. :2035471.0 Max. :3540438 Max. :396749.0 Max. :1305357
## TCGA-BP-4756-01A TCGA-BP-4968-01A TCGA-CW-5584-11A TCGA-CZ-5469-11A
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0
## Median : 2 Median : 2 Median : 1 Median : 2
## Mean : 1220 Mean : 1421 Mean : 1009 Mean : 1441
## 3rd Qu.: 202 3rd Qu.: 236 3rd Qu.: 183 3rd Qu.: 244
## Max. :3352500 Max. :2533030 Max. :1199558 Max. :1554627
## TCGA-B0-4710-01A TCGA-A3-3358-11A TCGA-B0-5115-01A TCGA-AK-3436-01A
## Min. : 0 Min. : 0.0 Min. : 0 Min. : 0
## 1st Qu.: 0 1st Qu.: 0.0 1st Qu.: 0 1st Qu.: 0
## Median : 3 Median : 1.0 Median : 2 Median : 1
## Mean : 1092 Mean : 931.6 Mean : 1144 Mean : 1024
## 3rd Qu.: 247 3rd Qu.: 155.0 3rd Qu.: 210 3rd Qu.: 122
## Max. :854364 Max. :1967076.0 Max. :1209950 Max. :964895
## TCGA-A3-3306-01A TCGA-B8-4620-11A TCGA-CZ-5470-11A TCGA-B0-5712-11A
## Min. : 0.0 Min. : 0.0 Min. : 0 Min. : 0
## 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0 1st Qu.: 0
## Median : 1.0 Median : 1.0 Median : 2 Median : 2
## Mean : 783.4 Mean : 702.8 Mean : 1517 Mean : 1198
## 3rd Qu.: 115.0 3rd Qu.: 129.0 3rd Qu.: 263 3rd Qu.: 229
## Max. :568914.0 Max. :1368123.0 Max. :1914992 Max. :1378659
## TCGA-CZ-4857-01A TCGA-CJ-5672-11A TCGA-CJ-5677-11A TCGA-B0-5702-01A
## Min. : 0.0 Min. : 0 Min. : 0 Min. : 0.0
## 1st Qu.: 0.0 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0.0
## Median : 2.0 Median : 2 Median : 1 Median : 0.0
## Mean : 933.5 Mean : 1363 Mean : 1189 Mean : 430.8
## 3rd Qu.: 179.0 3rd Qu.: 228 3rd Qu.: 194 3rd Qu.: 32.0
## Max. :722856.0 Max. :2539216 Max. :2716893 Max. :331564.0
## TCGA-B8-4622-11A TCGA-A3-3387-11A TCGA-CZ-5455-11A TCGA-CZ-5989-11A
## Min. : 0.0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 0.0 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0
## Median : 1.0 Median : 1 Median : 1 Median : 2
## Mean : 909.8 Mean : 1039 Mean : 1416 Mean : 1101
## 3rd Qu.: 134.0 3rd Qu.: 143 3rd Qu.: 213 3rd Qu.: 210
## Max. :2023546.0 Max. :2420963 Max. :2239105 Max. :1227352
## TCGA-A3-3367-01A TCGA-B8-4619-11A TCGA-B2-5641-11A TCGA-CW-5590-01A
## Min. : 0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 0 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.0
## Median : 3 Median : 1.0 Median : 1.0 Median : 1.0
## Mean : 1435 Mean : 745.5 Mean : 921.7 Mean : 1186.3
## 3rd Qu.: 246 3rd Qu.: 108.0 3rd Qu.: 154.0 3rd Qu.: 183.2
## Max. :1715223 Max. :2021786.0 Max. :1487356.0 Max. :1317656.0
## TCGA-CJ-5679-11A TCGA-B2-5633-01A TCGA-CZ-5461-11A TCGA-B2-3923-01A
## Min. : 0 Min. : 0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0.0 1st Qu.: 0.0
## Median : 2 Median : 2 Median : 2.0 Median : 2.0
## Mean : 1224 Mean : 1308 Mean : 1638.7 Mean : 888.9
## 3rd Qu.: 228 3rd Qu.: 162 3rd Qu.: 275.2 3rd Qu.: 103.0
## Max. :1455094 Max. :6503844 Max. :2569014.0 Max. :1708724.0
## TCGA-A3-A6NL-01A TCGA-CJ-6031-01A TCGA-BP-5178-01A TCGA-CJ-4923-01A
## Min. : 0.0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 0.0 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0
## Median : 2.0 Median : 1 Median : 1 Median : 3
## Mean : 859.7 Mean : 1265 Mean : 1436 Mean : 1141
## 3rd Qu.: 169.0 3rd Qu.: 194 3rd Qu.: 158 3rd Qu.: 234
## Max. :435298.0 Max. :1964829 Max. :4046892 Max. :1662324
## TCGA-BP-5009-01A TCGA-BP-4347-01A TCGA-B0-5694-11A TCGA-CJ-5680-11A
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0
## Median : 2 Median : 2 Median : 1 Median : 2
## Mean : 1517 Mean : 1578 Mean : 1098 Mean : 1045
## 3rd Qu.: 216 3rd Qu.: 275 3rd Qu.: 208 3rd Qu.: 215
## Max. :4500670 Max. :2466437 Max. :2100458 Max. :1055485
## TCGA-B0-4824-01A TCGA-B0-4700-11A TCGA-CW-5589-11A TCGA-CJ-4870-01A
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0
## Median : 2 Median : 2 Median : 1 Median : 1
## Mean : 803 Mean : 1014 Mean : 1201 Mean : 1020
## 3rd Qu.: 148 3rd Qu.: 142 3rd Qu.: 205 3rd Qu.: 127
## Max. :1644719 Max. :4518414 Max. :1621172 Max. :1119356
## TCGA-A3-3380-01A TCGA-AK-3450-01A TCGA-CZ-5987-11A TCGA-CZ-5986-11A
## Min. : 0 Min. : 0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0.0 1st Qu.: 0.0
## Median : 2 Median : 1 Median : 1.0 Median : 1.0
## Mean : 1230 Mean : 1060 Mean : 700.7 Mean : 903.2
## 3rd Qu.: 224 3rd Qu.: 161 3rd Qu.: 128.0 3rd Qu.: 153.0
## Max. :1099466 Max. :1142804 Max. :712959.0 Max. :2298241.0
## TCGA-MM-A563-01A TCGA-CZ-5465-11A TCGA-B4-5843-01A TCGA-CJ-4907-01A
## Min. : 0.0 Min. : 0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 0.0 1st Qu.: 0 1st Qu.: 0.0 1st Qu.: 0.0
## Median : 1.0 Median : 2 Median : 1.0 Median : 2.0
## Mean : 803.5 Mean : 1479 Mean : 924.5 Mean : 1351.1
## 3rd Qu.: 173.0 3rd Qu.: 225 3rd Qu.: 144.0 3rd Qu.: 229.2
## Max. :951818.0 Max. :1982372 Max. :1037058.0 Max. :1249624.0
## TCGA-CW-6088-11A TCGA-CJ-5678-11A TCGA-CJ-4897-01A TCGA-CW-5587-11A
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0
## Median : 2 Median : 1 Median : 2 Median : 1
## Mean : 1588 Mean : 1243 Mean : 1657 Mean : 1242
## 3rd Qu.: 252 3rd Qu.: 163 3rd Qu.: 247 3rd Qu.: 195
## Max. :2418690 Max. :4820798 Max. :1679652 Max. :1908773
## TCGA-BP-4338-01A TCGA-BP-5190-01A TCGA-B0-5701-01A TCGA-CZ-5457-11A
## Min. : 0 Min. : 0 Min. : 0.0 Min. : 0
## 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0.0 1st Qu.: 0
## Median : 1 Median : 1 Median : 1.0 Median : 1
## Mean : 1580 Mean : 1185 Mean : 702.8 Mean : 1341
## 3rd Qu.: 165 3rd Qu.: 136 3rd Qu.: 106.0 3rd Qu.: 205
## Max. :2069060 Max. :4347907 Max. :882388.0 Max. :1738977
## sample condition
## Length:100 Length:100
## Class :character Class :character
## Mode :character Mode :character
## ensembl_gene_id external_gene_name length
## Length:62940 Length:62940 Min. : 7
## Class :character Class :character 1st Qu.: 534
## Mode :character Mode :character Median : 3280
## Mean : 30993
## 3rd Qu.: 22689
## Max. :2473538
## [1] 62940 100
## [1] 100 2
## [1] 62940 3
# Preprocessing required fields
c_anno_df$condition <- factor(c_anno_df$condition)
summary(c_anno_df$condition)
## case control
## 50 50
The RData file contains three data frames:
Data Frame | Description | Dimensions |
---|---|---|
raw_counts_df |
Raw RNA-seq counts | 62940 x 100 |
c_anno_df |
Sample names and conditions | 100 x 2 |
r_anno_df |
Ensembl gene IDs, gene lengths, and gene symbols | 62940 x 3 |
Additionally, the condition column in c_anno_df
was
converted to a factor to facilitate downstream analysis.
# Querying the Ensembl database for homo sapiens gene information
ensembl <- useEnsembl(biomart = "genes", dataset = "hsapiens_gene_ensembl")
# Getting gene information for the Ensembl IDs in r_anno_df
gene_info <- getBM(attributes = c("ensembl_gene_id", "gene_biotype"),
filters = "ensembl_gene_id",
values = r_anno_df$ensembl_gene_id,
mart = ensembl)
# Extracting protein-coding genes from the gene_info data frame
protein_coding_genes <- gene_info[gene_info$gene_biotype == "protein_coding", ]
cat("> Number of protein-coding genes:", nrow(protein_coding_genes), "\n")
## > Number of protein-coding genes: 22152
# Updating data frames to include only protein-coding genes
raw_counts_df <- raw_counts_df[rownames(raw_counts_df) %in% protein_coding_genes$ensembl_gene_id, ]
r_anno_df <- r_anno_df[r_anno_df$ensembl_gene_id %in% protein_coding_genes$ensembl_gene_id, ]
From the annotated gene list, we filtered for protein-coding genes using the Ensembl BioMart database. This resulted in a subset of 22152 protein-coding genes out of the original 62940 input genes. The raw count matrix and annotation were aligned accordingly to ensure consistency for downstream expression analysis.
The updated dimensions of the data frames are as follows:
Data Frame | Dimensions Before | Dimensions After |
---|---|---|
raw_counts_df |
62940 x 100 | 22152 x 100 |
r_anno_df |
62940 x 3 | 22152 x 3 |
group <- c_anno_df$condition
# Creating DGEList object
dge <- DGEList(counts = raw_counts_df, group = group)
# Filtering genes
# Condition: keep if count > 20 in at least 5 samples of either group
keep <- rowSums(dge$counts[, group == "case"] > 20) >= 5 |
rowSums(dge$counts[, group == "control"] > 20) >= 5
dge <- dge[keep, , keep.lib.sizes = FALSE]
# Extracting results
results <- topTags(lrt, n = Inf)$table
# Adding average log CPM using edgeR model-based estimate
results$logCPM <- aveLogCPM(dge)
# Classifying DEGs
results$DEG <- "not_significant"
results$DEG[results$FDR < 0.05 & results$logFC > 1.5 & results$logCPM > 1] <- "up"
results$DEG[results$FDR < 0.05 & results$logFC < -1.5 & results$logCPM > 1] <- "down"
# Displaying summary
table(results$DEG)
##
## down not_significant up
## 1566 14543 955
After applying differential expression analysis using the edgeR package, a total of 2521 genes were identified as significantly altered between the case and control groups. Among these, 955 genes showed significant up-regulation, and 1566 genes were significantly down-regulated, based on the thresholds of FDR < 0.05, log2 fold change > 1.5 (or < -1.5), and logCPM > 1.
The majority of analyzed genes (14543 genes) did not show statistically significant expression changes under these criteria. These results indicate a focused subset of genes potentially involved in the biological differences between the two conditions.
Regulation | Count | Condition |
---|---|---|
Up | 955 | if FDR < 0.05, log2FC > 1.5, and logCPM > 1 |
Down | 1566 | if FDR < 0.05, log2FC < -1.5, and logCPM > 1 |
Not Significant | 14543 | if FDR >= 0.05 or logCPM <= 1 |
Total | 17064 |
# Plotting the volcano plot
ggplot(results, aes(x = logFC, y = -log10(FDR), color = DEG)) +
geom_point(alpha = 0.6, size = 1.2) +
scale_color_manual(
values = c("up" = "#D62728", "down" = "#1F77B4", "not_significant" = "grey80")
) +
theme_minimal(base_size = 14) +
labs(
title = "Volcano Plot of Differential Expression",
x = expression("Log"[2]*" Fold Change"),
y = expression("-Log"[10]*" FDR"),
color = "Regulation"
) +
geom_vline(xintercept = c(-1.5, 1.5), linetype = "dashed", color = "black", alpha = 0.5) +
geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "black", alpha = 0.5) +
theme(
plot.title = element_text(hjust = 0.5),
legend.position = "top",
panel.grid.minor = element_blank()
)
This volcano plot
visualizes the differential
gene expression between tumor (case) and normal (control)
samples in KIRP. Each point represents a gene, plotted by its
log2 fold change (x-axis) and statistical significance as −log10 FDR
(y-axis). Genes that are significantly up-regulated
(lo2FC > 1.5, FDR < 0.05, and logCPM > 1) are shown in
red, while significantly
down-regulated genes (log2FC < −1.5) are in
blue. Non-significant genes are shown
in grey. The dashed vertical and horizontal lines
indicate the fold-change and FDR thresholds used. The plot highlights a
substantial number of both up- and down-regulated genes, reflecting
widespread transcriptional alterations in the tumor
samples, with several genes showing both strong expression
changes and high statistical significance. Notably, the plot reveals a
slight asymmetry, with more down-regulated genes than
up-regulated ones.
# Generating the expression matrix for heatmap
logCPM_mat <- cpm(dge, log = TRUE)
deg_genes <- rownames(results)[results$DEG %in% c("up", "down")]
heat_data <- logCPM_mat[deg_genes, ]
heat_data_scaled <- t(scale(t(heat_data)))
# Generating the annotation data frame
condition <- c_anno_df$condition
names(condition) <- colnames(heat_data_scaled)
# Definining heatmap annotation and colors
ha <- HeatmapAnnotation(
Condition = condition,
col = list(Condition = c("case" = "#02BA0F", "control" = "#3DED97")),
annotation_legend_param = list(title = "Condition")
)
col_fun <- colorRamp2(c(-8, 0, 8), c("#053061", "#f7f7f7", "#67001f"))
# Plotting the heatmap
ht_opt$message = FALSE # Just to supress messages from ComplexHeatmap library
Heatmap(
matrix = heat_data_scaled,
name = "Z-score logCPM",
top_annotation = ha,
col = col_fun,
show_row_names = FALSE,
show_column_names = FALSE,
cluster_rows = TRUE,
cluster_columns = TRUE,
column_title = "Differentially Expressed Genes (Z-score logCPM)",
heatmap_legend_param = list(
title = "Z-score logCPM"
)
)
This heatmap
shows the expression profiles of
DEGs across tumor (case) and normal (control) samples, using
Z-score normalized logCPM values. Samples cluster clearly by condition,
reflecting distinct gene expression patterns between groups. Red
and blue blocks indicate genes consistently up- or
down-regulated within each condition. Z-score normalization
emphasizes relative expression changes, facilitating comparison across
genes with different baseline levels. The observed clustering of both
samples and genes highlights strong transcriptional
differences and supports the reliability of the differential
expression analysis.
# Matching gene symbols from annotation
results$gene <- r_anno_df$external_gene_name[match(rownames(results), r_anno_df$ensembl_gene_id)]
# Cleaning gene names by removing empty and duplicated entries
results_clean <- results[results$gene != "" & !is.na(results$gene), ]
results_clean <- results_clean[!duplicated(results_clean$gene), ]
# Creating named vector of logFC values with gene symbols
fc_vector <- results_clean$logFC
names(fc_vector) <- results_clean$gene
# Loading gene sets from MSigDB for Homo sapiens
msig_c4 <- msigdbr(species = "Homo sapiens", category = "C4")
msig_c6 <- msigdbr(species = "Homo sapiens", category = "C6")
# Converting to list format for fgsea
c4_list <- split(msig_c4$gene_symbol, msig_c4$gs_name)
c6_list <- split(msig_c6$gene_symbol, msig_c6$gs_name)
The Molecular Signatures Database (MSigDB) is a curated collection of annotated gene sets used for gene set enrichment analysis. It organizes genes based on shared biological functions, regulatory motifs, and pathways, enabling deeper interpretation of gene expression data.
In this section of our study, we focus on two key MSigDB collections:
The C4 collection
includes computational gene sets
derived from unsupervised clustering of cancer-related gene expression
data, often reflecting tissue-specific programs, immune responses, and
co-expression modules.
The C6 collection
consists of oncogenic signatures
based on gene expression changes resulting from experimental activation
or suppression of oncogenes and tumor suppressors.
# Setting seed for reproducibility
set.seed(101)
# Running GSEA with fgsea function
fgsea_c4 <- fgsea(pathways = c4_list, stats = fc_vector, nperm = 10000)
fgsea_c6 <- fgsea(pathways = c6_list, stats = fc_vector, nperm = 10000)
# Sorting the results by Normalized Enrichment Score (NES)
fgsea_c4 <- fgsea_c4[order(fgsea_c4$NES, decreasing = TRUE), ]
fgsea_c6 <- fgsea_c6[order(fgsea_c6$NES, decreasing = TRUE), ]
# Getting the top 10 up- and down-regulated gene sets from C4 and C6
top10_up_c4 <- head(fgsea_c4, 10)
top10_down_c4 <- tail(fgsea_c4, 10)
top10_up_c6 <- head(fgsea_c6, 10)
top10_down_c6 <- tail(fgsea_c6, 10)
# Displaying summaries
print(top10_up_c4[, c("pathway", "NES", "padj")])
## pathway NES padj
## <char> <num> <num>
## 1: GAVISH_3CA_METAPROGRAM_EPITHELIAL_METABOLISM_KIDNEY_2 2.544611 0.002075391
## 2: GAVISH_3CA_METAPROGRAM_EPITHELIAL_RESPIRATION 2.280733 0.002090007
## 3: GAVISH_3CA_METAPROGRAM_EPITHELIAL_METABOLISM_KIDNEY_1 2.243889 0.002075391
## 4: MODULE_221 2.123694 0.001983454
## 5: MODULE_184 2.083426 0.001983454
## 6: MODULE_78 2.025267 0.003355783
## 7: GAVISH_3CA_METAPROGRAM_EPITHELIAL_EPI_2 2.015994 0.003733247
## 8: MODULE_305 1.952085 0.003355783
## 9: MODULE_180 1.925057 0.002565122
## 10: MODULE_141 1.824045 0.011641183
## pathway NES padj
## <char> <num> <num>
## 1: GNF2_ITGB2 -2.207099 0.001160725
## 2: MODULE_45 -2.230670 0.001160725
## 3: GNF2_HPX -2.240167 0.001160725
## 4: MODULE_84 -2.244901 0.001160725
## 5: GNF2_LCAT -2.266469 0.001160725
## 6: GAVISH_3CA_METAPROGRAM_CD8_T_CELLS_CYTOTOXIC -2.281415 0.001160725
## 7: GAVISH_3CA_METAPROGRAM_CD4_T_CELLS_CYTOTOXIC -2.295184 0.001160725
## 8: GNF2_CEBPA -2.299420 0.001160725
## 9: GNF2_IL2RB -2.309512 0.001160725
## 10: CAR_HPX -2.374754 0.001160725
## pathway NES padj
## <char> <num> <num>
## 1: BMI1_DN_MEL18_DN.V1_DN 1.772549 0.01750811
## 2: MEL18_DN.V1_DN 1.735040 0.01750811
## 3: KRAS.KIDNEY_UP.V1_UP 1.715648 0.01750811
## 4: SINGH_KRAS_DEPENDENCY_SIGNATURE 1.511300 0.18228148
## 5: BMI1_DN.V1_DN 1.495581 0.05235193
## 6: KRAS.300_UP.V1_UP 1.455893 0.08255735
## 7: IL2_UP.V1_DN 1.370885 0.08255735
## 8: KRAS.PROSTATE_UP.V1_DN 1.367707 0.12240803
## 9: IL21_UP.V1_DN 1.345186 0.09336492
## 10: LTE2_UP.V1_UP 1.338744 0.08579564
## pathway NES padj
## <char> <num> <num>
## 1: ATM_DN.V1_UP -1.503579 0.06243118
## 2: RPS14_DN.V1_UP -1.510817 0.04216269
## 3: BMI1_DN_MEL18_DN.V1_UP -1.532090 0.04216269
## 4: PRC2_EED_DN.V1_DN -1.533858 0.02759890
## 5: MEL18_DN.V1_UP -1.543990 0.04115537
## 6: ALK_DN.V1_UP -1.568256 0.04216269
## 7: P53_DN.V2_UP -1.616406 0.02759890
## 8: SNF5_DN.V1_UP -1.638530 0.01257485
## 9: CTIP_DN.V1_UP -1.644272 0.02759890
## 10: KRAS.KIDNEY_UP.V1_DN -1.838788 0.01257485
We conducted GSEA on the ranked expression profile of our samples to identify key biological and regulatory pathways. Specifically, we compared our ranked gene list against the C4 and C6 collections from MSigDB.
For the C4 collection
, the top
upregulated gene sets were enriched for epithelial
metabolism and kidney-specific transcriptional programs, suggesting a
metabolically active epithelial phenotype. Conversely, the top
downregulated gene sets were associated with cytotoxic
T cell activity and immune signaling, indicating an immune-suppressed or
immune-excluded tumor microenvironment.
In the C6 collection
, upregulated gene
sets revealed activation of KRAS signaling, cytokine response pathways
(including IL2 and IL21), and repression signatures linked to BMI1 and
MEL18 inhibition, supporting a profile of oncogenic activation and
proliferative signaling. Downregulated gene sets
included p53 and ATM tumor suppressor pathways, along with chromatin
regulators such as PRC2 and SNF5, pointing to disrupted tumor suppressor
activity and potential epigenetic dysregulation.
Together, these findings suggest that our tumor samples exhibit a transcriptional profile characterized by epithelial dominance, immune coldness, and oncogenic signaling activation, consistent with an aggressive tumor-like phenotype.
# Ensuring that the fgsea results are data frames to avoid plotting errors
fgsea_c4_df <- as.data.frame(fgsea_c4)
fgsea_c6_df <- as.data.frame(fgsea_c6)
# Plotting GSEA table for top 10 up-regulated in c4
plotGseaTable(
pathways = c4_list[top10_up_c4$pathway],
stats = fc_vector,
fgseaRes = fgsea_c4_df,
gseaParam = 0.5
)
# Plotting GSEA table for top 10 down-regulated in c4
plotGseaTable(
pathways = c4_list[top10_down_c4$pathway],
stats = fc_vector,
fgseaRes = fgsea_c4_df,
gseaParam = 0.5
)
# Plotting GSEA table for top 10 up-regulated in c6
plotGseaTable(
pathways = c6_list[top10_up_c6$pathway],
stats = fc_vector,
fgseaRes = fgsea_c6_df,
gseaParam = 0.5
)
# Plotting GSEA table for top 10 down-regulated in c6
plotGseaTable(
pathways = c6_list[top10_down_c6$pathway],
stats = fc_vector,
fgseaRes = fgsea_c6_df,
gseaParam = 0.5
)
To further illustrate the enrichment results, we generated GSEA tables highlighting the top up- and downregulated gene sets from the C4 and C6 collections.
# Plotting the enrichment score for the single top up-regulated pathway in C4
plotEnrichment(c4_list[[top10_up_c4$pathway[1]]], fc_vector) +
labs(title = paste("C4 Top Upregulated:", top10_up_c4$pathway[1]))
# Plotting the enrichment score for the single top down-regulated pathway in C4
plotEnrichment(c4_list[[top10_down_c4$pathway[1]]], fc_vector) +
labs(title = paste("C4 Top Downregulated:", top10_down_c4$pathway[1]))
# Plotting the enrichment score for the single top up-regulated pathway in C6
plotEnrichment(c6_list[[top10_up_c6$pathway[1]]], fc_vector) +
labs(title = paste("C6 Top Upregulated:", top10_up_c6$pathway[1]))
# Plotting the enrichment score for the single top down-regulated pathway in C6
plotEnrichment(c6_list[[top10_down_c6$pathway[1]]], fc_vector) +
labs(title = paste("C6 Top Downregulated:", top10_down_c6$pathway[1]))
Each enrichment plot displays a green line representing the enrichment score (ES), a running-sum statistic that increases when a gene from the selected set is encountered in the ranked list and decreases otherwise. Tick marks along the x-axis indicate the positions of these genes within the ranked list. A sharp rise near the beginning of the plot suggests strong enrichment at the top of the gene list (upregulation), while a pronounced decline indicates enrichment at the bottom of the gene list (downregulation).
In the C4 collection
, the top
upregulated gene set was
GAVISH_3CA_METAPROGRAM_EPITHELIAL_METABOLISM, where the ES
sharply peaks early, showing strong enrichment of epithelial metabolism
genes at the top of the ranked list, which is consistent with active
metabolic and biosynthetic processes in epithelial cells. In contrast,
GNF2_ITGB2 was the top downregulated C4 gene
set, with a trough in the ES at the bottom of the ranked list,
indicating strong suppression of immune-related genes, particularly
those associated with integrin β2 (ITGB2), a key player in leukocyte
adhesion and immune cell signaling.
For the C6 oncogenic signature plots
, the top
upregulated gene set, BMI1_DN_MEL18_DN_V1_DN, had an
enrichment score peak early in the ranked gene list. This reflects
upregulation of genes normally repressed by Polycomb group proteins BMI1
and MEL18, suggesting epigenetic derepression, which is often linked to
stemness and aggressive tumor phenotypes. Lastly, the
ATM_DN.V1_UP gene set was the most
downregulated, with its ES curve dipping in the mid-to-late
portion of the ranked list. This pattern reflects decreased expression
of genes normally activated during ATM pathway inhibition, indicating a
potentially impaired DNA damage response and greater genomic
instability—hallmarks of tumor progression.
# Preparing the DEGs
de_results <- results_clean
de_up <- de_results %>% filter(logFC > 1.5 & FDR < 0.05 & logCPM > 1)
de_down <- de_results %>% filter(logFC < -1.5 & FDR < 0.05 & logCPM > 1)
# Getting gene lists from the top GSEA sets
top_pathways_c4 <- c(top10_up_c4$pathway, top10_down_c4$pathway)
top_pathways_c6 <- c(top10_up_c6$pathway, top10_down_c6$pathway)
# Extracting genes from the C4 and C6 lists
genes_from_c4 <- unlist(c4_list[top_pathways_c4])
genes_from_c6 <- unlist(c6_list[top_pathways_c6])
all_genes <- unique(c(genes_from_c4, genes_from_c6))
# Filtering the DEGs by pathway genes
selected_up_genes <- intersect(de_up$gene, all_genes)
selected_down_genes <- intersect(de_down$gene, all_genes)
# Mapping gene symbols to Entrez IDs using biomaRt
mart <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
all_genes_info <- getBM(attributes = c("external_gene_name", "entrezgene_id"),
filters = "external_gene_name",
values = unique(c(selected_up_genes, selected_down_genes)),
mart = mart)
# Loading gene coordinates from TxDb and mapping to gene names
txdb_genes <- genes(TxDb.Hsapiens.UCSC.hg38.knownGene)
txdb_genes$gene_name <- all_genes_info$external_gene_name[match(txdb_genes$gene_id,
all_genes_info$entrezgene_id)]
# Getting 500 bp upstream promoters for selected genes
promoters_up <- subset(promoters(txdb_genes, upstream = 500, downstream = 0),
gene_name %in% selected_up_genes)
promoters_down <- subset(promoters(txdb_genes, upstream = 500, downstream = 0),
gene_name %in% selected_down_genes)
To investigate upstream regulatory elements of DEGs, we first selected genes that were both significantly altered in expression and members of key GSEA-enriched pathways. These genes were mapped to their corresponding Entrez Gene IDs, and their promoter regions, defined as the 500 base pairs upstream of the transcription start site (TSS), were retrieved using the Ensembl database via the biomaRt package. This set of promoter sequences served as the input for downstream motif enrichment analysis.
# Retrieving sequences from hg19 genome
seqs_up <- getSeq(BSgenome.Hsapiens.UCSC.hg38, promoters_up)
seqs_down <- getSeq(BSgenome.Hsapiens.UCSC.hg38, promoters_down)
# Loading motif database
data(PWMLogn.hg19.MotifDb.Hsap)
# Running motif enrichment (output suppressed)
set.seed(444) # Setting seed for reproducibility
res_up <- motifEnrichment(seqs_up, PWMLogn.hg19.MotifDb.Hsap, score = "affinity")
## Calculating motif enrichment scores ...
## Calculating motif enrichment scores ...
# Reporting top motifs
report_up <- sequenceReport(res_up, 1)
report_down <- sequenceReport(res_down, 1)
For motif enrichment analysis, we employed the PWMEnrich package, which requires background models computed for a specific genome build. We used the hg19 human genome as the reference for statistical scoring, as it is the only build currently supported by the precomputed background models embedded in PWMEnrich. Although the hg38 genome assembly is more recent and biologically accurate, PWMEnrich does not provide corresponding background models, and conflicts with other packages under recent R versions.
To address this, we extracted promoter sequences based on the hg38 genome build, which allowed us to leverage improved gene annotations and coordinate accuracy. These sequences were then analyzed using PWMEnrich with the hg19-based background model, which could introduce a minor inconsistency due to the mismatch between genome builds. However, this approach allows reliable statistical scoring while benefiting from the improved annotations of hg38.
# Plotting the top enriched motifs for up- and down-regulated genes
plot(report_up[report_up$p.value < 0.01], fontsize = 7, id.fontsize = 6)
To uncover transcriptional regulators behind gene expression changes, we analyzed motifs in the 500 bp promoter regions of genes significantly differentially expressed and enriched in key GSEA pathways (C4 and C6), based on the hg19 genome.
Promoters of upregulated genes were enriched for motifs linked to TFs like FOXB1, FOXJ3, DUXA, and GSC2. These factors are known for roles in embryonic development, cell fate decisions, and maintaining pluripotency or regenerative states. Some enriched motifs lack detailed functional annotation, suggesting novel or context-specific regulatory elements. Overall, this points to upregulated genes being controlled by TFs that promote cellular plasticity and stem-like transcriptional programs, indicative of a shift toward a more primitive or regenerative state, relevant in tissue remodeling or tumorigenesis.
In contrast, downregulated genes showed motif enrichment for TFs such as SOX9, SOX10, FOXC1, TFAP2A, and ELK1, which are involved in neural crest development, differentiation, vascular formation, and immune response. The enrichment suggests repression of immune activity and lineage-specific functions. Some under-characterized motifs hint at unknown TFs contributing to suppressing mature cellular identities.
Together, these results reveal a biphasic regulatory shift: activation of developmental plasticity factors alongside suppression of differentiation and immune-related TFs. This pattern reflects a biological transition toward less differentiated, more flexible cellular states, commonly observed in cancer progression, regeneration, and reprogramming.
# Extracting significant TFs
sig_up <- as.data.frame(report_up[report_up$p.value < 0.01, ])
sig_down <- as.data.frame(report_down[report_down$p.value < 0.01, ])
tf_up <- sig_up %>%
group_by(target) %>%
slice_min(p.value, n = 1) %>%
ungroup()
tf_down <- sig_down %>%
group_by(target) %>%
slice_min(p.value, n = 1) %>%
ungroup()
# Getting unique TFs
unique_tf_up <- unique(tf_up$target)
unique_tf_down <- unique(tf_down$target)
# Displaying the counts of unique TFs
length(unique_tf_up)
## [1] 22
## [1] 18
In this step, we extracted significant TFs by filtering for interactions with a false discovery rate (FDR)-adjusted p-value less than 0.01 from both upregulated and downregulated gene sets. This adjustment controls for multiple testing and reduces the likelihood of false positives. For each target gene, we selected the most significant TF (with the lowest adjusted p-value) to avoid redundancy. We then identified the unique TFs involved in each set and counted them, resulting in 22 unique TFs associated with upregulated targets and 18 with downregulated targets. This process highlights the most statistically robust regulators potentially driving differential gene expression in the dataset.
# Extracting SOX9 PWMs from MotifDb for Homo sapiens
motifs_sox9 <- subset(MotifDb, organism == 'Hsapiens' & geneSymbol == 'SOX9')
pwms_sox9 <- toPWM(as.list(motifs_sox9))
names(pwms_sox9) <- sapply(names(pwms_sox9), function(x) strsplit(x, "-")[[1]][3])
# Ensuring that the promoter sequences of downregulated genes are available
stopifnot(exists("seqs_down")) # Should be a DNAStringSet object
# Computing raw motif scores for each promoter sequence
scores_list <- lapply(seqs_down, function(prom) motifScores(prom, pwms_sox9, raw.score = TRUE))
names(scores_list) <- names(seqs_down)
# Computing empirical background score distributions using hg19 for SOX9 motifs
ecdf <- motifEcdf(pwms_sox9, organism = "hg19", quick = TRUE)
# Calculating 99.75% log2-transformed threshold for each motif
sox9_thresholds <- sapply(names(pwms_sox9), function(name) {
log2(quantile(ecdf[[name]], 1 - 0.0025, na.rm = TRUE))
})
names(sox9_thresholds) <- names(pwms_sox9)
For downstream analysis, SOX9, one of the top-enriched TFs, was selected to investigate its potential binding in the promoters of downregulated genes. We retrieved 16 SOX9 PWMs for H. sapiens from MotifDb, for which we computed raw motif scores across the promoter sequences of downregulated genes. To assess significance, we calculated empirical score distributions using the hg19 genomic background and determined log2-transformed thresholds at the 99.75th percentile. These thresholds represent cutoff values above which SOX9 binding sites are unlikely to occur by chance. From these PWMs, we selected one representative binding model, SOX9_HUMAN.H10MO.B from the HOCOMOCO v10 database, to visualize predicted binding affinities in promoter regions. For this PWM, a threshold score of 4.265822 was computed from the empirical background distribution.
# Selecting motif and gene to plot
motif_name_to_plot <- names(sox9_thresholds)[1]
cutoff_to_plot <- sox9_thresholds[[motif_name_to_plot]]
# Plotting the motif scores without cutoff
plotMotifScores(scores_list[[1]], sel.motifs = motif_name_to_plot,
cols = c("steelblue", "forestgreen", "tomato"),
main = paste("SOX9 Motif:", motif_name_to_plot, "| Gene:", names(scores_list)[1]))
# Plotting the motif scores with cutoff
plotMotifScores(scores_list[[1]], sel.motifs = motif_name_to_plot,
cols = c("steelblue", "forestgreen", "tomato"),
cutoff = cutoff_to_plot,
main = paste("SOX9 Motif w/ Cutoff:", motif_name_to_plot, "| Gene:", names(scores_list)[1]))
The two graphs illustrate SOX9 motif matches within a 500 bp region of Gene 100, with and without applying a score cutoff.
In the first graph
, all potential binding sites are
shown regardless of their strength, including both strong and weak
matches on the forward and reverse strands. This provides a
comprehensive view but may include low-confidence, biologically
insignificant sites.
In contrast, the second graph
applies a cutoff score of
4.265, filtering out weaker matches and retaining only the stronger,
more likely functional binding sites—resulting in six clear matches on
the forward strand.
Overall, applying a cutoff enhances the specificity of the analysis by highlighting high-confidence SOX9 binding regions, making the results more biologically meaningful.
# Initializing the STRINGdb object
string_db <- STRINGdb$new(version = "11.5", species = 9606, score_threshold = 400)
# Species 9606 = the NCBI taxonomy ID of Homo sapiens
# Score threshold = the minimum confidence score for interactions, here set to 400
# Preparing the DEG gene list
selected_up_genes <- results$gene[results$DEG == "up"]
selected_down_genes <- results$gene[results$DEG == "down"]
deg_genes <- unique(c(selected_up_genes, selected_down_genes))
# Converting to a data frame
deg_genes_df <- data.frame(gene = deg_genes, stringsAsFactors = FALSE)
# Mapping gene symbols to STRING IDs
mapped <- string_db$map(deg_genes_df, "gene", removeUnmappedRows = TRUE)
## Warning: we couldn't map to STRING 2% of your identifiers
In this step, we initialized the STRINGdb object to access the STRING PPI database for Homo sapiens (taxonomy ID: 9606). We compiled a list of DEGs, combining both up- and down-regulated genes, and mapped their gene symbols to STRING identifiers. Using these STRING IDs, we retrieved high-confidence PPIs (combined score >= 400), which represent biologically meaningful associations based on multiple evidence channels such as experimental data, co-expression, and database annotations. This forms the foundation of our interaction network for downstream analysis.
# Creating a look-up table to map each gene id to its corresponding gene name
id_to_gene <- mapped[, c("STRING_id", "gene")]
# Merging gene symbols for both ends of the interaction
labeled_ppi_network <- ppi_network %>%
left_join(id_to_gene, by = c("from" = "STRING_id")) %>%
rename(from_gene = gene) %>%
left_join(id_to_gene, by = c("to" = "STRING_id")) %>%
rename(to_gene = gene)
# Reordering columns for readability
labeled_ppi_network <- labeled_ppi_network[, c("from", "from_gene", "to", "to_gene", "combined_score")]
# Saving labeled network as a TSV file
write.table(labeled_ppi_network, file = "labeled_ppi_network.tsv", sep = "\t", quote = FALSE, row.names = FALSE)
The retrieved PPI network initially used internal STRING IDs, which are not easily interpretable. In this step, we mapped each STRING ID back to its corresponding gene symbol using our mapping table. This resulted in a labeled PPI network where each node is annotated with a human-readable gene name. We then exported this labeled network as a TSV file for easier interpretation, visualization, or integration with external tools such as Cytoscape or network analysis pipelines.
# Using the full interaction table with gene names
ppi_edges <- labeled_ppi_network %>% select(from_gene, to_gene, combined_score)
# Creating an undirected igraph object with gene symbols
g <- graph_from_data_frame(ppi_edges, directed = FALSE)
# Displaying some graph information
cat("Number of nodes in the graph:", gorder(g), "\nNumber of edges in the graph:", gsize(g), "\n")
## Number of nodes in the graph: 2371
## Number of edges in the graph: 49824
Utilizing the generated labeled PPI network, we created an undirected graph object using the igraph package, which allows us to represent the PPI network as a graph structure. The edges of the graph are defined by the interactions between genes, with each edge weighted by the combined score from STRING, indicating the confidence of the interaction. This graph structure enables us to perform various network analyses and visualizations.
# Calculating the degree of each node using the degree() function from igraph
deg <- degree(g)
deg_df <- data.frame(gene = names(deg), degree = deg)
# Obtaining the frequency of each degree value
deg_freq <- deg_df %>%
count(degree)
# Plotting the degree distribution
ggplot(deg_freq, aes(x = degree, y = n)) +
geom_col(fill = "#800080", color = "#800080") +
theme_minimal() +
labs(title = "Degree Distribution of the PPI Network", x = "Degree", y = "Frequency") +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
axis.title = element_text(face = "bold")
)
The degree distribution of the PPI network is highly right-skewed, indicating that the majority of genes participate in only a few interactions, while a small subset of genes act as hubs with over 500 connections. This pattern is characteristic of a scale-free network architecture, where a few highly connected nodes play a central role in maintaining connectivity. Although the average node has around 20 connections, the network consists of 49,824 edges out of a possible 2.8 million \[\frac{N(N-1)}{2} = \frac{2371 \times 2370}{2} = 2,\!809,\!635\], resulting in an edge density of approximately 1.6%. This confirms that the network is sparse overall, a typical feature of biological systems, which favor efficiency, modularity, and robustness through selective connectivity.
# Computing centrality metrics
centrality_df <- data.frame(
gene = V(g)$name,
degree = degree(g),
closeness = closeness(g, normalized = TRUE),
betweenness = betweenness(g, normalized = TRUE)
)
# Obtaining the top 10 genes by each centrality measure
top_degree <- centrality_df %>% arrange(desc(degree)) %>% slice(1:10)
top_closeness <- centrality_df %>% arrange(desc(closeness)) %>% slice(1:10)
top_betweenness <- centrality_df %>% arrange(desc(betweenness)) %>% slice(1:10)
top_degree
In our network analysis, we examined three key centrality measures to uncover the roles and influence of individual genes within the PPI network:
Degree Centrality
Closeness Centrality
Betweenness Centrality
Each metric offers a distinct perspective on how genes contribute to the network’s structure and function.
Degree centrality
reflects the number of direct
connections a gene holds in the network. Genes with a high degree act as
hubs, serving as major points of interaction within the
network. For instance, ALB (Albumin) was identified as the most
connected gene, with 602 interactions. Similarly, genes like
PTPRC, GAPDH, CD8A, and VEGFA also showed high connectivity,
indicating their involvement in crucial pathways such as immune
signaling, metabolism, and blood vessel formation.
Closeness centrality
, on the other hand, measures
how close a gene is to all others in the network, reflecting how
quickly it can affect or be affected by the entire system. Some
genes, such as C6ORF52, SLFN12L, TMSB15B and C18ORF63, had
perfect closeness scores of 1.0, but these were
confined to small disconnected subgraphs, where all
nodes are directly adjacent. While mathematically high, these values
do not reflect true global centrality. In contrast, key
genes within the main connected component, such as ALB, GAPDH, FN1,
VEGFA, and PTPRC, also ranked highly in closeness, highlighting
their importance in rapidly influencing a broad range of other genes
within the core of the network.
Betweenness centrality
identifies genes that
serve as bridges within the network, controlling the flow of information
along the shortest paths between other genes. Genes with high
betweenness scores are often critical for maintaining communication
within the network. In our analysis, ALB again ranked highest,
alongside GAPDH, FN1, EGF, and CDH1. Notably, genes like
EGF, CDH1, and CFTR showed moderate degree but high
betweenness, suggesting their importance as bottlenecks
or mediators connecting otherwise distant network modules.
Overall, the recurring prominence of genes such as ALB, GAPDH, FN1, PTPRC, and VEGFA across all three centrality measures underscores their central regulatory roles. These genes are likely key drivers of immune response, metabolism, cell adhesion, and vascular remodeling, reflecting their functional significance in the biological system under study.
# Decompose the graph into components
components_info <- components(g)
# Extract LCC
lcc <- induced_subgraph(g, which(components_info$membership == which.max(components_info$csize)))
# Summary
cat("Number of connected components:", components_info$no, "\n")
## Number of connected components: 3
## Size of largest component: 2367 nodes and 49818 edges
To better understand the structural organization of the DEG-based PPI network, we decomposed the graph into its connected components. A connected component is a subgraph in which any two nodes are connected to each other by a path.
Our analysis revealed 3 total connected components,
with the largest connected component (LCC)
comprising
2367 nodes and 49,818 edges. This dominant cluster
contains the majority of DEGs and their interactions, reflecting a
highly interconnected core of the biological network. Such a large LCC
suggests strong functional integration and extensive signaling
cross-talk among the deregulated genes, reinforcing the biological
coherence of the differential expression results.
ggraph(lcc, layout = "fr") +
geom_edge_link(alpha = 0.3, color = "gray60") +
geom_node_point(size = 4, color = "#ff7f0e") +
geom_node_text(aes(label = name), size = 3, repel = TRUE) +
theme_void() +
labs(title = "Largest Connected Component of the DEG PPI Network") +
theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"))
Visualization of the LCC using a force-directed layout reveals a visually complex yet structurally sparse network. Notably, the LCC comprises virtually the entire network (over 99% of all nodes), indicating that the majority of DEGs are part of a single, coherent interaction space. This extensive connectivity, though sparse, highlights widespread co-regulation and functional interaction among the genes, especially around central hubs and key signaling modules. The force-directed layout helps expose these topological features by clustering highly connected genes and spreading apart loosely connected ones, allowing clearer insight into the network’s functional architecture.
This study presents a comprehensive transcriptomic analysis of KIRP, identifying 2,521 DEGs that reflect widespread molecular disruptions in tumor biology. Enrichment analysis revealed activation of oncogenic pathways such as KRAS and suppression of tumor suppressor signals like p53 and ATM, indicating a transcriptional landscape conducive to tumor progression. Motif analysis highlighted the involvement of developmental TFs and a concurrent reduction in immune and differentiation-associated regulators. PPI analysis uncovered a sparse, scale-free network architecture with prominent hub genes, underscoring key nodes in tumor-associated signaling. Together, these findings offer a detailed molecular profile of KIRP and suggest potential biomarkers and regulatory targets for future diagnostic or therapeutic development.