Introduction

Kidney Renal Papillary Cell Carcinoma (KIRP) is a distinct subtype of renal cell carcinoma that represents approximately 10–15% of all kidney cancers. It is characterized by unique histological features and molecular alterations that differ from the more common clear cell renal carcinoma. KIRP is further classified into two main subtypes, type 1 and type 2, which vary in their cellular characteristics and clinical behavior (Delahunt & Eble, 1997). Despite advances in understanding some genetic mutations involved in KIRP, such as alterations in the MET gene and fumarate hydratase (FH), much remains unknown about the full spectrum of molecular changes driving this disease (TCGA Research Network, 2016; Menko et al., 2014).

RNA sequencing (RNA-seq) is a powerful technology that enables the comprehensive analysis of gene expression across the entire transcriptome. By examining RNA-Seq data from 50 tumor samples and 50 matched normal kidney tissue controls, we can identify differentially expressed genes (DEGs) and disrupted molecular pathways specific to KIRP. This analysis will deepen our understanding of the biological mechanisms underlying tumor initiation and progression. Additionally, it may reveal novel biomarkers for diagnosis and potential therapeutic targets, ultimately contributing to improved clinical management of KIRP patients (Choueiri et al., 2015; Truong & Shen, 2011).

Project Pipeline

This project focuses on the analysis of RNA-seq count data derived from KIRP samples obtained from The Cancer Genome Atlas (TCGA). The goal is to explore the molecular differences between tumor and normal tissues through a comprehensive bioinformatics pipeline. Key steps include data preprocessing, filtering for protein-coding genes using the biomaRt package, and identifying DEGs via the edgeR package. Visualization techniques such as volcano plots and heatmaps are used to interpret the differential expression results. Gene set enrichment analysis (GSEA) is performed using the fgsea package to uncover significantly altered pathways in cancer. In addition, we investigate transcription factors (TFs) with enriched binding motifs in the promoters of regulated genes and examine protein-protein interaction (PPI) networks using STRING and igraph. This multi-layered approach provides insights into the transcriptional landscape of KIRP and highlights potential biomarkers or therapeutic targets for further research.

Loading Dependencies

library(biomaRt) # For querying Ensembl databases to retrieve gene information
library(edgeR) # For differential expression analysis of RNA-seq data
library(ggplot2) # For data visualization
library(ComplexHeatmap) # For creating customizable heatmaps
library(circlize) # For being able to use color functions in ComplexHeatmap
library(RColorBrewer) # For color palettes
library(fgsea) # For performing gene set enrichment analysis
library(msigdbr) # For loading MSigDB gene sets
library(MotifDb) # For working with transcription factor binding motifs
library(PWMEnrich) # For motif enrichment analysis
library(PWMEnrich.Hsapiens.background) # For human background motif data
library(Biostrings) # For handling biological sequences
library(GenomicRanges) # For working with genomic ranges and annotations
library(TxDb.Hsapiens.UCSC.hg38.knownGene) # For gene annotations and genomic features (hg38)
library(TxDb.Hsapiens.UCSC.hg19.knownGene) # For gene annotations and genomic features (hg19)
library(BSgenome.Hsapiens.UCSC.hg38) # For accessing the human genome sequence (hg38)
library(dplyr) # For data manipulation and transformation
library(STRINGdb) # For accessing the STRING database for protein-protein interactions
library(igraph) # For creating and analyzing networks
library(ggraph) # For visualizing networks with ggplot2 syntax
library(tidygraph) # For tidy data manipulation of graph objects

1. Loading, Exploring & Preprocessing the Data

# Setting dynamic working directory
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))

# Reading the KIRP data
load("./BR_project_data_kidney_renal_papillary_cell_carcinoma.RData")

# Exploring the data
ls()
## [1] "c_anno_df"     "params"        "r_anno_df"     "raw_counts_df"
head(raw_counts_df[, 1:5])
##                 TCGA-B0-5696-11A TCGA-B0-5706-11A TCGA-CJ-6030-11A
## ENSG00000278704                0                0                0
## ENSG00000277400                0                0                0
## ENSG00000274847                0                0                0
## ENSG00000277428                0                0                0
## ENSG00000276256                0                0                0
## ENSG00000276932                0                0                0
##                 TCGA-CJ-5682-01A TCGA-CW-6090-11A
## ENSG00000278704                0                0
## ENSG00000277400                0                0
## ENSG00000274847                0                0
## ENSG00000277428                0                0
## ENSG00000276256                0                0
## ENSG00000276932                0                0
head(c_anno_df)
head(r_anno_df)
summary(raw_counts_df)
##  TCGA-B0-5696-11A  TCGA-B0-5706-11A    TCGA-CJ-6030-11A  TCGA-CJ-5682-01A   
##  Min.   :      0   Min.   :      0.0   Min.   :      0   Min.   :      0.0  
##  1st Qu.:      0   1st Qu.:      0.0   1st Qu.:      0   1st Qu.:      0.0  
##  Median :      1   Median :      1.0   Median :      1   Median :      1.0  
##  Mean   :   1067   Mean   :    967.9   Mean   :   1491   Mean   :    828.3  
##  3rd Qu.:    169   3rd Qu.:    164.0   3rd Qu.:    236   3rd Qu.:    126.0  
##  Max.   :1671266   Max.   :1921435.0   Max.   :2356239   Max.   :1311502.0  
##  TCGA-CW-6090-11A  TCGA-CW-5585-11A  TCGA-CJ-6033-11A  TCGA-CZ-4864-11A 
##  Min.   :      0   Min.   :      0   Min.   :      0   Min.   :      0  
##  1st Qu.:      0   1st Qu.:      0   1st Qu.:      0   1st Qu.:      0  
##  Median :      1   Median :      1   Median :      1   Median :      2  
##  Mean   :   1202   Mean   :   1340   Mean   :   1009   Mean   :   1476  
##  3rd Qu.:    176   3rd Qu.:    200   3rd Qu.:    162   3rd Qu.:    210  
##  Max.   :1871220   Max.   :2338889   Max.   :2478321   Max.   :4221197  
##  TCGA-CZ-5984-11A  TCGA-B8-5549-11A  TCGA-BP-4992-01A  TCGA-CJ-4872-01A 
##  Min.   :      0   Min.   :      0   Min.   :      0   Min.   :      0  
##  1st Qu.:      0   1st Qu.:      0   1st Qu.:      0   1st Qu.:      0  
##  Median :      1   Median :      1   Median :      3   Median :      2  
##  Mean   :   1235   Mean   :   1167   Mean   :   1132   Mean   :   1166  
##  3rd Qu.:    202   3rd Qu.:    196   3rd Qu.:    155   3rd Qu.:    204  
##  Max.   :3612075   Max.   :1182164   Max.   :8586200   Max.   :1669574  
##  TCGA-B2-5636-11A  TCGA-BP-4763-01A  TCGA-AK-3427-01A  TCGA-B8-A54I-01A 
##  Min.   :      0   Min.   :      0   Min.   :      0   Min.   :      0  
##  1st Qu.:      0   1st Qu.:      0   1st Qu.:      0   1st Qu.:      0  
##  Median :      2   Median :      1   Median :      1   Median :      1  
##  Mean   :   1195   Mean   :   1047   Mean   :   1160   Mean   :   1070  
##  3rd Qu.:    216   3rd Qu.:    163   3rd Qu.:     92   3rd Qu.:    134  
##  Max.   :2530340   Max.   :1021672   Max.   :2486228   Max.   :5473939  
##  TCGA-B0-5711-01A  TCGA-CW-5581-11A    TCGA-CZ-5985-11A    TCGA-B0-5402-11A 
##  Min.   :      0   Min.   :      0.0   Min.   :      0.0   Min.   :      0  
##  1st Qu.:      0   1st Qu.:      0.0   1st Qu.:      0.0   1st Qu.:      0  
##  Median :      1   Median :      1.0   Median :      1.0   Median :      2  
##  Mean   :   1128   Mean   :    931.5   Mean   :    777.6   Mean   :   1397  
##  3rd Qu.:    177   3rd Qu.:    140.0   3rd Qu.:    100.0   3rd Qu.:    240  
##  Max.   :1030598   Max.   :1796358.0   Max.   :2238533.0   Max.   :3691184  
##  TCGA-B0-5705-11A  TCGA-CZ-5453-11A  TCGA-CZ-5458-11A  TCGA-CZ-5452-11A 
##  Min.   :      0   Min.   :      0   Min.   :      0   Min.   :      0  
##  1st Qu.:      0   1st Qu.:      0   1st Qu.:      0   1st Qu.:      0  
##  Median :      2   Median :      2   Median :      2   Median :      1  
##  Mean   :   1331   Mean   :   1397   Mean   :   1798   Mean   :   1168  
##  3rd Qu.:    208   3rd Qu.:    235   3rd Qu.:    286   3rd Qu.:    193  
##  Max.   :3454246   Max.   :1367858   Max.   :1774756   Max.   :1136014  
##  TCGA-AK-3456-01A    TCGA-CJ-5689-11A    TCGA-CZ-5986-01A   TCGA-DV-5573-01A  
##  Min.   :      0.0   Min.   :      0.0   Min.   :     0.0   Min.   :     0.0  
##  1st Qu.:      0.0   1st Qu.:      0.0   1st Qu.:     0.0   1st Qu.:     0.0  
##  Median :      1.0   Median :      1.0   Median :     1.0   Median :     2.0  
##  Mean   :    891.7   Mean   :    853.6   Mean   :   713.8   Mean   :   963.6  
##  3rd Qu.:    119.0   3rd Qu.:    159.0   3rd Qu.:   106.0   3rd Qu.:   206.0  
##  Max.   :2175270.0   Max.   :1137302.0   Max.   :573980.0   Max.   :847629.0  
##  TCGA-B8-5163-01A  TCGA-CZ-5451-11A  TCGA-B8-A54J-01A  TCGA-BP-4999-01A 
##  Min.   :      0   Min.   :      0   Min.   :      0   Min.   :      0  
##  1st Qu.:      0   1st Qu.:      0   1st Qu.:      0   1st Qu.:      0  
##  Median :      1   Median :      1   Median :      2   Median :      3  
##  Mean   :   1617   Mean   :   1456   Mean   :   1047   Mean   :   1611  
##  3rd Qu.:    244   3rd Qu.:    231   3rd Qu.:    185   3rd Qu.:    280  
##  Max.   :1208410   Max.   :2571419   Max.   :1090268   Max.   :2670620  
##  TCGA-BP-4982-01A    TCGA-B8-5552-11A TCGA-BP-4760-01A   TCGA-B0-5097-01A
##  Min.   :      0.0   Min.   :     0   Min.   :     0.0   Min.   :     0  
##  1st Qu.:      0.0   1st Qu.:     0   1st Qu.:     0.0   1st Qu.:     0  
##  Median :      3.0   Median :     1   Median :     1.0   Median :     1  
##  Mean   :   1452.3   Mean   :  1134   Mean   :  1104.1   Mean   :  1032  
##  3rd Qu.:    277.2   3rd Qu.:   192   3rd Qu.:   170.2   3rd Qu.:   161  
##  Max.   :1404119.0   Max.   :710732   Max.   :630137.0   Max.   :751463  
##  TCGA-BP-4975-01A    TCGA-CZ-5462-11A  TCGA-BP-5192-01A  TCGA-CW-6087-11A 
##  Min.   :      0.0   Min.   :      0   Min.   :      0   Min.   :      0  
##  1st Qu.:      0.0   1st Qu.:      0   1st Qu.:      0   1st Qu.:      0  
##  Median :      2.0   Median :      2   Median :      1   Median :      2  
##  Mean   :   1777.0   Mean   :   1811   Mean   :   1261   Mean   :   1446  
##  3rd Qu.:    277.2   3rd Qu.:    283   3rd Qu.:    180   3rd Qu.:    198  
##  Max.   :2102684.0   Max.   :1992403   Max.   :1142488   Max.   :4231548  
##  TCGA-B0-5703-01A    TCGA-AS-3777-01A  TCGA-CW-6097-01A   TCGA-B0-5711-11A 
##  Min.   :      0.0   Min.   :      0   Min.   :     0.0   Min.   :      0  
##  1st Qu.:      0.0   1st Qu.:      0   1st Qu.:     0.0   1st Qu.:      0  
##  Median :      1.0   Median :      2   Median :     1.0   Median :      2  
##  Mean   :    900.3   Mean   :   1435   Mean   :   978.1   Mean   :   1332  
##  3rd Qu.:    122.0   3rd Qu.:    144   3rd Qu.:   171.0   3rd Qu.:    255  
##  Max.   :2035471.0   Max.   :3540438   Max.   :396749.0   Max.   :1305357  
##  TCGA-BP-4756-01A  TCGA-BP-4968-01A  TCGA-CW-5584-11A  TCGA-CZ-5469-11A 
##  Min.   :      0   Min.   :      0   Min.   :      0   Min.   :      0  
##  1st Qu.:      0   1st Qu.:      0   1st Qu.:      0   1st Qu.:      0  
##  Median :      2   Median :      2   Median :      1   Median :      2  
##  Mean   :   1220   Mean   :   1421   Mean   :   1009   Mean   :   1441  
##  3rd Qu.:    202   3rd Qu.:    236   3rd Qu.:    183   3rd Qu.:    244  
##  Max.   :3352500   Max.   :2533030   Max.   :1199558   Max.   :1554627  
##  TCGA-B0-4710-01A TCGA-A3-3358-11A    TCGA-B0-5115-01A  TCGA-AK-3436-01A
##  Min.   :     0   Min.   :      0.0   Min.   :      0   Min.   :     0  
##  1st Qu.:     0   1st Qu.:      0.0   1st Qu.:      0   1st Qu.:     0  
##  Median :     3   Median :      1.0   Median :      2   Median :     1  
##  Mean   :  1092   Mean   :    931.6   Mean   :   1144   Mean   :  1024  
##  3rd Qu.:   247   3rd Qu.:    155.0   3rd Qu.:    210   3rd Qu.:   122  
##  Max.   :854364   Max.   :1967076.0   Max.   :1209950   Max.   :964895  
##  TCGA-A3-3306-01A   TCGA-B8-4620-11A    TCGA-CZ-5470-11A  TCGA-B0-5712-11A 
##  Min.   :     0.0   Min.   :      0.0   Min.   :      0   Min.   :      0  
##  1st Qu.:     0.0   1st Qu.:      0.0   1st Qu.:      0   1st Qu.:      0  
##  Median :     1.0   Median :      1.0   Median :      2   Median :      2  
##  Mean   :   783.4   Mean   :    702.8   Mean   :   1517   Mean   :   1198  
##  3rd Qu.:   115.0   3rd Qu.:    129.0   3rd Qu.:    263   3rd Qu.:    229  
##  Max.   :568914.0   Max.   :1368123.0   Max.   :1914992   Max.   :1378659  
##  TCGA-CZ-4857-01A   TCGA-CJ-5672-11A  TCGA-CJ-5677-11A  TCGA-B0-5702-01A  
##  Min.   :     0.0   Min.   :      0   Min.   :      0   Min.   :     0.0  
##  1st Qu.:     0.0   1st Qu.:      0   1st Qu.:      0   1st Qu.:     0.0  
##  Median :     2.0   Median :      2   Median :      1   Median :     0.0  
##  Mean   :   933.5   Mean   :   1363   Mean   :   1189   Mean   :   430.8  
##  3rd Qu.:   179.0   3rd Qu.:    228   3rd Qu.:    194   3rd Qu.:    32.0  
##  Max.   :722856.0   Max.   :2539216   Max.   :2716893   Max.   :331564.0  
##  TCGA-B8-4622-11A    TCGA-A3-3387-11A  TCGA-CZ-5455-11A  TCGA-CZ-5989-11A 
##  Min.   :      0.0   Min.   :      0   Min.   :      0   Min.   :      0  
##  1st Qu.:      0.0   1st Qu.:      0   1st Qu.:      0   1st Qu.:      0  
##  Median :      1.0   Median :      1   Median :      1   Median :      2  
##  Mean   :    909.8   Mean   :   1039   Mean   :   1416   Mean   :   1101  
##  3rd Qu.:    134.0   3rd Qu.:    143   3rd Qu.:    213   3rd Qu.:    210  
##  Max.   :2023546.0   Max.   :2420963   Max.   :2239105   Max.   :1227352  
##  TCGA-A3-3367-01A  TCGA-B8-4619-11A    TCGA-B2-5641-11A    TCGA-CW-5590-01A   
##  Min.   :      0   Min.   :      0.0   Min.   :      0.0   Min.   :      0.0  
##  1st Qu.:      0   1st Qu.:      0.0   1st Qu.:      0.0   1st Qu.:      0.0  
##  Median :      3   Median :      1.0   Median :      1.0   Median :      1.0  
##  Mean   :   1435   Mean   :    745.5   Mean   :    921.7   Mean   :   1186.3  
##  3rd Qu.:    246   3rd Qu.:    108.0   3rd Qu.:    154.0   3rd Qu.:    183.2  
##  Max.   :1715223   Max.   :2021786.0   Max.   :1487356.0   Max.   :1317656.0  
##  TCGA-CJ-5679-11A  TCGA-B2-5633-01A  TCGA-CZ-5461-11A    TCGA-B2-3923-01A   
##  Min.   :      0   Min.   :      0   Min.   :      0.0   Min.   :      0.0  
##  1st Qu.:      0   1st Qu.:      0   1st Qu.:      0.0   1st Qu.:      0.0  
##  Median :      2   Median :      2   Median :      2.0   Median :      2.0  
##  Mean   :   1224   Mean   :   1308   Mean   :   1638.7   Mean   :    888.9  
##  3rd Qu.:    228   3rd Qu.:    162   3rd Qu.:    275.2   3rd Qu.:    103.0  
##  Max.   :1455094   Max.   :6503844   Max.   :2569014.0   Max.   :1708724.0  
##  TCGA-A3-A6NL-01A   TCGA-CJ-6031-01A  TCGA-BP-5178-01A  TCGA-CJ-4923-01A 
##  Min.   :     0.0   Min.   :      0   Min.   :      0   Min.   :      0  
##  1st Qu.:     0.0   1st Qu.:      0   1st Qu.:      0   1st Qu.:      0  
##  Median :     2.0   Median :      1   Median :      1   Median :      3  
##  Mean   :   859.7   Mean   :   1265   Mean   :   1436   Mean   :   1141  
##  3rd Qu.:   169.0   3rd Qu.:    194   3rd Qu.:    158   3rd Qu.:    234  
##  Max.   :435298.0   Max.   :1964829   Max.   :4046892   Max.   :1662324  
##  TCGA-BP-5009-01A  TCGA-BP-4347-01A  TCGA-B0-5694-11A  TCGA-CJ-5680-11A 
##  Min.   :      0   Min.   :      0   Min.   :      0   Min.   :      0  
##  1st Qu.:      0   1st Qu.:      0   1st Qu.:      0   1st Qu.:      0  
##  Median :      2   Median :      2   Median :      1   Median :      2  
##  Mean   :   1517   Mean   :   1578   Mean   :   1098   Mean   :   1045  
##  3rd Qu.:    216   3rd Qu.:    275   3rd Qu.:    208   3rd Qu.:    215  
##  Max.   :4500670   Max.   :2466437   Max.   :2100458   Max.   :1055485  
##  TCGA-B0-4824-01A  TCGA-B0-4700-11A  TCGA-CW-5589-11A  TCGA-CJ-4870-01A 
##  Min.   :      0   Min.   :      0   Min.   :      0   Min.   :      0  
##  1st Qu.:      0   1st Qu.:      0   1st Qu.:      0   1st Qu.:      0  
##  Median :      2   Median :      2   Median :      1   Median :      1  
##  Mean   :    803   Mean   :   1014   Mean   :   1201   Mean   :   1020  
##  3rd Qu.:    148   3rd Qu.:    142   3rd Qu.:    205   3rd Qu.:    127  
##  Max.   :1644719   Max.   :4518414   Max.   :1621172   Max.   :1119356  
##  TCGA-A3-3380-01A  TCGA-AK-3450-01A  TCGA-CZ-5987-11A   TCGA-CZ-5986-11A   
##  Min.   :      0   Min.   :      0   Min.   :     0.0   Min.   :      0.0  
##  1st Qu.:      0   1st Qu.:      0   1st Qu.:     0.0   1st Qu.:      0.0  
##  Median :      2   Median :      1   Median :     1.0   Median :      1.0  
##  Mean   :   1230   Mean   :   1060   Mean   :   700.7   Mean   :    903.2  
##  3rd Qu.:    224   3rd Qu.:    161   3rd Qu.:   128.0   3rd Qu.:    153.0  
##  Max.   :1099466   Max.   :1142804   Max.   :712959.0   Max.   :2298241.0  
##  TCGA-MM-A563-01A   TCGA-CZ-5465-11A  TCGA-B4-5843-01A    TCGA-CJ-4907-01A   
##  Min.   :     0.0   Min.   :      0   Min.   :      0.0   Min.   :      0.0  
##  1st Qu.:     0.0   1st Qu.:      0   1st Qu.:      0.0   1st Qu.:      0.0  
##  Median :     1.0   Median :      2   Median :      1.0   Median :      2.0  
##  Mean   :   803.5   Mean   :   1479   Mean   :    924.5   Mean   :   1351.1  
##  3rd Qu.:   173.0   3rd Qu.:    225   3rd Qu.:    144.0   3rd Qu.:    229.2  
##  Max.   :951818.0   Max.   :1982372   Max.   :1037058.0   Max.   :1249624.0  
##  TCGA-CW-6088-11A  TCGA-CJ-5678-11A  TCGA-CJ-4897-01A  TCGA-CW-5587-11A 
##  Min.   :      0   Min.   :      0   Min.   :      0   Min.   :      0  
##  1st Qu.:      0   1st Qu.:      0   1st Qu.:      0   1st Qu.:      0  
##  Median :      2   Median :      1   Median :      2   Median :      1  
##  Mean   :   1588   Mean   :   1243   Mean   :   1657   Mean   :   1242  
##  3rd Qu.:    252   3rd Qu.:    163   3rd Qu.:    247   3rd Qu.:    195  
##  Max.   :2418690   Max.   :4820798   Max.   :1679652   Max.   :1908773  
##  TCGA-BP-4338-01A  TCGA-BP-5190-01A  TCGA-B0-5701-01A   TCGA-CZ-5457-11A 
##  Min.   :      0   Min.   :      0   Min.   :     0.0   Min.   :      0  
##  1st Qu.:      0   1st Qu.:      0   1st Qu.:     0.0   1st Qu.:      0  
##  Median :      1   Median :      1   Median :     1.0   Median :      1  
##  Mean   :   1580   Mean   :   1185   Mean   :   702.8   Mean   :   1341  
##  3rd Qu.:    165   3rd Qu.:    136   3rd Qu.:   106.0   3rd Qu.:    205  
##  Max.   :2069060   Max.   :4347907   Max.   :882388.0   Max.   :1738977
summary(c_anno_df)
##     sample           condition        
##  Length:100         Length:100        
##  Class :character   Class :character  
##  Mode  :character   Mode  :character
summary(r_anno_df)
##  ensembl_gene_id    external_gene_name     length       
##  Length:62940       Length:62940       Min.   :      7  
##  Class :character   Class :character   1st Qu.:    534  
##  Mode  :character   Mode  :character   Median :   3280  
##                                        Mean   :  30993  
##                                        3rd Qu.:  22689  
##                                        Max.   :2473538
dim(raw_counts_df)
## [1] 62940   100
dim(c_anno_df)
## [1] 100   2
dim(r_anno_df)
## [1] 62940     3
# Preprocessing required fields
c_anno_df$condition <- factor(c_anno_df$condition)
summary(c_anno_df$condition)
##    case control 
##      50      50

The RData file contains three data frames:

Data Frame Description Dimensions
raw_counts_df Raw RNA-seq counts 62940 x 100
c_anno_df Sample names and conditions 100 x 2
r_anno_df Ensembl gene IDs, gene lengths, and gene symbols 62940 x 3

Additionally, the condition column in c_anno_df was converted to a factor to facilitate downstream analysis.

2. Extracting Protein-Coding Genes

# Querying the Ensembl database for homo sapiens gene information
ensembl <- useEnsembl(biomart = "genes", dataset = "hsapiens_gene_ensembl")

# Getting gene information for the Ensembl IDs in r_anno_df
gene_info <- getBM(attributes = c("ensembl_gene_id", "gene_biotype"),
                   filters = "ensembl_gene_id",
                   values = r_anno_df$ensembl_gene_id,
                   mart = ensembl)

# Extracting protein-coding genes from the gene_info data frame
protein_coding_genes <- gene_info[gene_info$gene_biotype == "protein_coding", ]
cat("> Number of protein-coding genes:", nrow(protein_coding_genes), "\n")
## > Number of protein-coding genes: 22152
# Updating data frames to include only protein-coding genes
raw_counts_df <- raw_counts_df[rownames(raw_counts_df) %in% protein_coding_genes$ensembl_gene_id, ]
r_anno_df <- r_anno_df[r_anno_df$ensembl_gene_id %in% protein_coding_genes$ensembl_gene_id, ]

From the annotated gene list, we filtered for protein-coding genes using the Ensembl BioMart database. This resulted in a subset of 22152 protein-coding genes out of the original 62940 input genes. The raw count matrix and annotation were aligned accordingly to ensure consistency for downstream expression analysis.

The updated dimensions of the data frames are as follows:

Data Frame Dimensions Before Dimensions After
raw_counts_df 62940 x 100 22152 x 100
r_anno_df 62940 x 3 22152 x 3

3. Differential Expression Analysis

3.1. Setup & Filtering

group <- c_anno_df$condition

# Creating DGEList object
dge <- DGEList(counts = raw_counts_df, group = group)

# Filtering genes
# Condition: keep if count > 20 in at least 5 samples of either group
keep <- rowSums(dge$counts[, group == "case"] > 20) >= 5 |
        rowSums(dge$counts[, group == "control"] > 20) >= 5
dge <- dge[keep, , keep.lib.sizes = FALSE]

3.2. Normalization & Model Fitting

# Normalizing and constructing the design matrix
dge <- calcNormFactors(dge)
design <- model.matrix(~ group)

# Estimating dispersion and fitting the model
dge <- estimateDisp(dge, design)
fit <- glmFit(dge, design)
lrt <- glmLRT(fit, coef = 2)

3.3. Annotating DEGs

# Extracting results
results <- topTags(lrt, n = Inf)$table

# Adding average log CPM using edgeR model-based estimate
results$logCPM <- aveLogCPM(dge)

# Classifying DEGs
results$DEG <- "not_significant"
results$DEG[results$FDR < 0.05 & results$logFC > 1.5 & results$logCPM > 1] <- "up"
results$DEG[results$FDR < 0.05 & results$logFC < -1.5 & results$logCPM > 1] <- "down"

# Displaying summary
table(results$DEG)
## 
##            down not_significant              up 
##            1566           14543             955
# combine the up and down 

After applying differential expression analysis using the edgeR package, a total of 2521 genes were identified as significantly altered between the case and control groups. Among these, 955 genes showed significant up-regulation, and 1566 genes were significantly down-regulated, based on the thresholds of FDR < 0.05, log2 fold change > 1.5 (or < -1.5), and logCPM > 1.

The majority of analyzed genes (14543 genes) did not show statistically significant expression changes under these criteria. These results indicate a focused subset of genes potentially involved in the biological differences between the two conditions.

Regulation Count Condition
Up 955 if FDR < 0.05, log2FC > 1.5, and logCPM > 1
Down 1566 if FDR < 0.05, log2FC < -1.5, and logCPM > 1
Not Significant 14543 if FDR >= 0.05 or logCPM <= 1
Total 17064

3.4. Results Visualization

3.4.1. Volcano Plot

# Plotting the volcano plot
ggplot(results, aes(x = logFC, y = -log10(FDR), color = DEG)) +
  geom_point(alpha = 0.6, size = 1.2) +
  scale_color_manual(
    values = c("up" = "#D62728", "down" = "#1F77B4", "not_significant" = "grey80")
  ) +
  theme_minimal(base_size = 14) +
  labs(
    title = "Volcano Plot of Differential Expression",
    
    x = expression("Log"[2]*" Fold Change"),
    y = expression("-Log"[10]*" FDR"),
    color = "Regulation"
  ) +
  geom_vline(xintercept = c(-1.5, 1.5), linetype = "dashed", color = "black", alpha = 0.5) +
  geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "black", alpha = 0.5) +
  theme(
    plot.title = element_text(hjust = 0.5),
    legend.position = "top",
    panel.grid.minor = element_blank()
)

This volcano plot visualizes the differential gene expression between tumor (case) and normal (control) samples in KIRP. Each point represents a gene, plotted by its log2 fold change (x-axis) and statistical significance as −log10 FDR (y-axis). Genes that are significantly up-regulated (lo2FC > 1.5, FDR < 0.05, and logCPM > 1) are shown in red, while significantly down-regulated genes (log2FC < −1.5) are in blue. Non-significant genes are shown in grey. The dashed vertical and horizontal lines indicate the fold-change and FDR thresholds used. The plot highlights a substantial number of both up- and down-regulated genes, reflecting widespread transcriptional alterations in the tumor samples, with several genes showing both strong expression changes and high statistical significance. Notably, the plot reveals a slight asymmetry, with more down-regulated genes than up-regulated ones.

3.4.2. Heatmap Focused on Up & Down Regulated Genes

# Generating the expression matrix for heatmap
logCPM_mat <- cpm(dge, log = TRUE)
deg_genes <- rownames(results)[results$DEG %in% c("up", "down")]
heat_data <- logCPM_mat[deg_genes, ]
heat_data_scaled <- t(scale(t(heat_data)))

# Generating the annotation data frame
condition <- c_anno_df$condition
names(condition) <- colnames(heat_data_scaled)

# Definining heatmap annotation and colors
ha <- HeatmapAnnotation(
  Condition = condition,
  col = list(Condition = c("case" = "#02BA0F", "control" = "#3DED97")),
  annotation_legend_param = list(title = "Condition")
)
col_fun <- colorRamp2(c(-8, 0, 8), c("#053061", "#f7f7f7", "#67001f"))

# Plotting the heatmap
ht_opt$message = FALSE # Just to supress messages from ComplexHeatmap library
Heatmap(
  matrix = heat_data_scaled,
  name = "Z-score logCPM",
  top_annotation = ha,
  col = col_fun,
  show_row_names = FALSE,
  show_column_names = FALSE,
  cluster_rows = TRUE,
  cluster_columns = TRUE,
  column_title = "Differentially Expressed Genes (Z-score logCPM)",
  heatmap_legend_param = list(
    title = "Z-score logCPM"
  )
)

This heatmap shows the expression profiles of DEGs across tumor (case) and normal (control) samples, using Z-score normalized logCPM values. Samples cluster clearly by condition, reflecting distinct gene expression patterns between groups. Red and blue blocks indicate genes consistently up- or down-regulated within each condition. Z-score normalization emphasizes relative expression changes, facilitating comparison across genes with different baseline levels. The observed clustering of both samples and genes highlights strong transcriptional differences and supports the reliability of the differential expression analysis.

4. Gene Set Enrichment Analysis (GSEA)

4.1. Preparing Ranked Gene List for GSEA

# Matching gene symbols from annotation
results$gene <- r_anno_df$external_gene_name[match(rownames(results), r_anno_df$ensembl_gene_id)]

# Cleaning gene names by removing empty and duplicated entries
results_clean <- results[results$gene != "" & !is.na(results$gene), ]
results_clean <- results_clean[!duplicated(results_clean$gene), ]

# Creating named vector of logFC values with gene symbols
fc_vector <- results_clean$logFC
names(fc_vector) <- results_clean$gene

4.2. Loading MSigDB Gene Sets (C4 & C6)

# Loading gene sets from MSigDB for Homo sapiens
msig_c4 <- msigdbr(species = "Homo sapiens", category = "C4")
msig_c6 <- msigdbr(species = "Homo sapiens", category = "C6")

# Converting to list format for fgsea
c4_list <- split(msig_c4$gene_symbol, msig_c4$gs_name)
c6_list <- split(msig_c6$gene_symbol, msig_c6$gs_name)

The Molecular Signatures Database (MSigDB) is a curated collection of annotated gene sets used for gene set enrichment analysis. It organizes genes based on shared biological functions, regulatory motifs, and pathways, enabling deeper interpretation of gene expression data.

In this section of our study, we focus on two key MSigDB collections:

  • The C4 collection includes computational gene sets derived from unsupervised clustering of cancer-related gene expression data, often reflecting tissue-specific programs, immune responses, and co-expression modules.

  • The C6 collection consists of oncogenic signatures based on gene expression changes resulting from experimental activation or suppression of oncogenes and tumor suppressors.

4.3. Running GSEA

# Setting seed for reproducibility
set.seed(101)

# Running GSEA with fgsea function
fgsea_c4 <- fgsea(pathways = c4_list, stats = fc_vector, nperm = 10000)
fgsea_c6 <- fgsea(pathways = c6_list, stats = fc_vector, nperm = 10000)

# Sorting the results by Normalized Enrichment Score (NES)
fgsea_c4 <- fgsea_c4[order(fgsea_c4$NES, decreasing = TRUE), ]
fgsea_c6 <- fgsea_c6[order(fgsea_c6$NES, decreasing = TRUE), ]

4.4. Extracting Top 10 Enriched Gene Sets

# Getting the top 10 up- and down-regulated gene sets from C4 and C6
top10_up_c4 <- head(fgsea_c4, 10)
top10_down_c4 <- tail(fgsea_c4, 10)
top10_up_c6 <- head(fgsea_c6, 10)
top10_down_c6 <- tail(fgsea_c6, 10)

# Displaying summaries
print(top10_up_c4[, c("pathway", "NES", "padj")])
##                                                   pathway      NES        padj
##                                                    <char>    <num>       <num>
##  1: GAVISH_3CA_METAPROGRAM_EPITHELIAL_METABOLISM_KIDNEY_2 2.544611 0.002075391
##  2:         GAVISH_3CA_METAPROGRAM_EPITHELIAL_RESPIRATION 2.280733 0.002090007
##  3: GAVISH_3CA_METAPROGRAM_EPITHELIAL_METABOLISM_KIDNEY_1 2.243889 0.002075391
##  4:                                            MODULE_221 2.123694 0.001983454
##  5:                                            MODULE_184 2.083426 0.001983454
##  6:                                             MODULE_78 2.025267 0.003355783
##  7:               GAVISH_3CA_METAPROGRAM_EPITHELIAL_EPI_2 2.015994 0.003733247
##  8:                                            MODULE_305 1.952085 0.003355783
##  9:                                            MODULE_180 1.925057 0.002565122
## 10:                                            MODULE_141 1.824045 0.011641183
print(top10_down_c4[, c("pathway", "NES", "padj")])
##                                          pathway       NES        padj
##                                           <char>     <num>       <num>
##  1:                                   GNF2_ITGB2 -2.207099 0.001160725
##  2:                                    MODULE_45 -2.230670 0.001160725
##  3:                                     GNF2_HPX -2.240167 0.001160725
##  4:                                    MODULE_84 -2.244901 0.001160725
##  5:                                    GNF2_LCAT -2.266469 0.001160725
##  6: GAVISH_3CA_METAPROGRAM_CD8_T_CELLS_CYTOTOXIC -2.281415 0.001160725
##  7: GAVISH_3CA_METAPROGRAM_CD4_T_CELLS_CYTOTOXIC -2.295184 0.001160725
##  8:                                   GNF2_CEBPA -2.299420 0.001160725
##  9:                                   GNF2_IL2RB -2.309512 0.001160725
## 10:                                      CAR_HPX -2.374754 0.001160725
print(top10_up_c6[, c("pathway", "NES", "padj")])
##                             pathway      NES       padj
##                              <char>    <num>      <num>
##  1:          BMI1_DN_MEL18_DN.V1_DN 1.772549 0.01750811
##  2:                  MEL18_DN.V1_DN 1.735040 0.01750811
##  3:            KRAS.KIDNEY_UP.V1_UP 1.715648 0.01750811
##  4: SINGH_KRAS_DEPENDENCY_SIGNATURE 1.511300 0.18228148
##  5:                   BMI1_DN.V1_DN 1.495581 0.05235193
##  6:               KRAS.300_UP.V1_UP 1.455893 0.08255735
##  7:                    IL2_UP.V1_DN 1.370885 0.08255735
##  8:          KRAS.PROSTATE_UP.V1_DN 1.367707 0.12240803
##  9:                   IL21_UP.V1_DN 1.345186 0.09336492
## 10:                   LTE2_UP.V1_UP 1.338744 0.08579564
print(top10_down_c6[, c("pathway", "NES", "padj")])
##                    pathway       NES       padj
##                     <char>     <num>      <num>
##  1:           ATM_DN.V1_UP -1.503579 0.06243118
##  2:         RPS14_DN.V1_UP -1.510817 0.04216269
##  3: BMI1_DN_MEL18_DN.V1_UP -1.532090 0.04216269
##  4:      PRC2_EED_DN.V1_DN -1.533858 0.02759890
##  5:         MEL18_DN.V1_UP -1.543990 0.04115537
##  6:           ALK_DN.V1_UP -1.568256 0.04216269
##  7:           P53_DN.V2_UP -1.616406 0.02759890
##  8:          SNF5_DN.V1_UP -1.638530 0.01257485
##  9:          CTIP_DN.V1_UP -1.644272 0.02759890
## 10:   KRAS.KIDNEY_UP.V1_DN -1.838788 0.01257485

We conducted GSEA on the ranked expression profile of our samples to identify key biological and regulatory pathways. Specifically, we compared our ranked gene list against the C4 and C6 collections from MSigDB.

For the C4 collection, the top upregulated gene sets were enriched for epithelial metabolism and kidney-specific transcriptional programs, suggesting a metabolically active epithelial phenotype. Conversely, the top downregulated gene sets were associated with cytotoxic T cell activity and immune signaling, indicating an immune-suppressed or immune-excluded tumor microenvironment.

In the C6 collection, upregulated gene sets revealed activation of KRAS signaling, cytokine response pathways (including IL2 and IL21), and repression signatures linked to BMI1 and MEL18 inhibition, supporting a profile of oncogenic activation and proliferative signaling. Downregulated gene sets included p53 and ATM tumor suppressor pathways, along with chromatin regulators such as PRC2 and SNF5, pointing to disrupted tumor suppressor activity and potential epigenetic dysregulation.

Together, these findings suggest that our tumor samples exhibit a transcriptional profile characterized by epithelial dominance, immune coldness, and oncogenic signaling activation, consistent with an aggressive tumor-like phenotype.

4.5. Plotting GSEA Tables for Top Enriched Pathways

# Ensuring that the fgsea results are data frames to avoid plotting errors
fgsea_c4_df <- as.data.frame(fgsea_c4)
fgsea_c6_df <- as.data.frame(fgsea_c6)

# Plotting GSEA table for top 10 up-regulated in c4
plotGseaTable(
  pathways = c4_list[top10_up_c4$pathway],
  stats = fc_vector,
  fgseaRes = fgsea_c4_df,
  gseaParam = 0.5
)

# Plotting GSEA table for top 10 down-regulated in c4
plotGseaTable(
  pathways = c4_list[top10_down_c4$pathway],
  stats = fc_vector,
  fgseaRes = fgsea_c4_df,
  gseaParam = 0.5
)

# Plotting GSEA table for top 10 up-regulated in c6
plotGseaTable(
  pathways = c6_list[top10_up_c6$pathway],
  stats = fc_vector,
  fgseaRes = fgsea_c6_df,
  gseaParam = 0.5
)

# Plotting GSEA table for top 10 down-regulated in c6
plotGseaTable(
  pathways = c6_list[top10_down_c6$pathway],
  stats = fc_vector,
  fgseaRes = fgsea_c6_df,
  gseaParam = 0.5
)

To further illustrate the enrichment results, we generated GSEA tables highlighting the top up- and downregulated gene sets from the C4 and C6 collections.

4.6. Enrichment Plots for One of the Top Up- & Down-regulated Pathways in C4 & C6

# Plotting the enrichment score for the single top up-regulated pathway in C4
plotEnrichment(c4_list[[top10_up_c4$pathway[1]]], fc_vector) +
  labs(title = paste("C4 Top Upregulated:", top10_up_c4$pathway[1]))

# Plotting the enrichment score for the single top down-regulated pathway in C4
plotEnrichment(c4_list[[top10_down_c4$pathway[1]]], fc_vector) +
  labs(title = paste("C4 Top Downregulated:", top10_down_c4$pathway[1]))

# Plotting the enrichment score for the single top up-regulated pathway in C6
plotEnrichment(c6_list[[top10_up_c6$pathway[1]]], fc_vector) +
  labs(title = paste("C6 Top Upregulated:", top10_up_c6$pathway[1]))

# Plotting the enrichment score for the single top down-regulated pathway in C6
plotEnrichment(c6_list[[top10_down_c6$pathway[1]]], fc_vector) +
  labs(title = paste("C6 Top Downregulated:", top10_down_c6$pathway[1]))

Each enrichment plot displays a green line representing the enrichment score (ES), a running-sum statistic that increases when a gene from the selected set is encountered in the ranked list and decreases otherwise. Tick marks along the x-axis indicate the positions of these genes within the ranked list. A sharp rise near the beginning of the plot suggests strong enrichment at the top of the gene list (upregulation), while a pronounced decline indicates enrichment at the bottom of the gene list (downregulation).

In the C4 collection, the top upregulated gene set was GAVISH_3CA_METAPROGRAM_EPITHELIAL_METABOLISM, where the ES sharply peaks early, showing strong enrichment of epithelial metabolism genes at the top of the ranked list, which is consistent with active metabolic and biosynthetic processes in epithelial cells. In contrast, GNF2_ITGB2 was the top downregulated C4 gene set, with a trough in the ES at the bottom of the ranked list, indicating strong suppression of immune-related genes, particularly those associated with integrin β2 (ITGB2), a key player in leukocyte adhesion and immune cell signaling.

For the C6 oncogenic signature plots, the top upregulated gene set, BMI1_DN_MEL18_DN_V1_DN, had an enrichment score peak early in the ranked gene list. This reflects upregulation of genes normally repressed by Polycomb group proteins BMI1 and MEL18, suggesting epigenetic derepression, which is often linked to stemness and aggressive tumor phenotypes. Lastly, the ATM_DN.V1_UP gene set was the most downregulated, with its ES curve dipping in the mid-to-late portion of the ranked list. This pattern reflects decreased expression of genes normally activated during ATM pathway inhibition, indicating a potentially impaired DNA damage response and greater genomic instability—hallmarks of tumor progression.

5. Transcription Factors (TFs) Identification

5.1. Preparing DEGs & Pathway Genes

# Preparing the DEGs
de_results <- results_clean
de_up <- de_results %>% filter(logFC > 1.5 & FDR < 0.05 & logCPM > 1)
de_down <- de_results %>% filter(logFC < -1.5 & FDR < 0.05 & logCPM > 1)

# Getting gene lists from the top GSEA sets
top_pathways_c4 <- c(top10_up_c4$pathway, top10_down_c4$pathway)
top_pathways_c6 <- c(top10_up_c6$pathway, top10_down_c6$pathway)

# Extracting genes from the C4 and C6 lists
genes_from_c4 <- unlist(c4_list[top_pathways_c4])
genes_from_c6 <- unlist(c6_list[top_pathways_c6])
all_genes <- unique(c(genes_from_c4, genes_from_c6))

# Filtering the DEGs by pathway genes
selected_up_genes <- intersect(de_up$gene, all_genes)
selected_down_genes <- intersect(de_down$gene, all_genes)

5.2. Mapping to Entrez IDs & Retrieving Promoters

# Mapping gene symbols to Entrez IDs using biomaRt
mart <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
all_genes_info <- getBM(attributes = c("external_gene_name", "entrezgene_id"),
                        filters = "external_gene_name",
                        values = unique(c(selected_up_genes, selected_down_genes)),
                        mart = mart)

# Loading gene coordinates from TxDb and mapping to gene names
txdb_genes <- genes(TxDb.Hsapiens.UCSC.hg38.knownGene)
txdb_genes$gene_name <- all_genes_info$external_gene_name[match(txdb_genes$gene_id,
                                                                all_genes_info$entrezgene_id)]
# Getting 500 bp upstream promoters for selected genes
promoters_up <- subset(promoters(txdb_genes, upstream = 500, downstream = 0),
                       gene_name %in% selected_up_genes)
promoters_down <- subset(promoters(txdb_genes, upstream = 500, downstream = 0),
                         gene_name %in% selected_down_genes)

To investigate upstream regulatory elements of DEGs, we first selected genes that were both significantly altered in expression and members of key GSEA-enriched pathways. These genes were mapped to their corresponding Entrez Gene IDs, and their promoter regions, defined as the 500 base pairs upstream of the transcription start site (TSS), were retrieved using the Ensembl database via the biomaRt package. This set of promoter sequences served as the input for downstream motif enrichment analysis.

5.3. Extracting Sequences & Running Motif Enrichment

# Retrieving sequences from hg19 genome
seqs_up <- getSeq(BSgenome.Hsapiens.UCSC.hg38, promoters_up)
seqs_down <- getSeq(BSgenome.Hsapiens.UCSC.hg38, promoters_down)

# Loading motif database
data(PWMLogn.hg19.MotifDb.Hsap)

# Running motif enrichment (output suppressed)
set.seed(444) # Setting seed for reproducibility
res_up <- motifEnrichment(seqs_up, PWMLogn.hg19.MotifDb.Hsap, score = "affinity")
## Calculating motif enrichment scores ...
res_down <- motifEnrichment(seqs_down, PWMLogn.hg19.MotifDb.Hsap, score = "affinity")
## Calculating motif enrichment scores ...
# Reporting top motifs
report_up <- sequenceReport(res_up, 1)
report_down <- sequenceReport(res_down, 1)

For motif enrichment analysis, we employed the PWMEnrich package, which requires background models computed for a specific genome build. We used the hg19 human genome as the reference for statistical scoring, as it is the only build currently supported by the precomputed background models embedded in PWMEnrich. Although the hg38 genome assembly is more recent and biologically accurate, PWMEnrich does not provide corresponding background models, and conflicts with other packages under recent R versions.

To address this, we extracted promoter sequences based on the hg38 genome build, which allowed us to leverage improved gene annotations and coordinate accuracy. These sequences were then analyzed using PWMEnrich with the hg19-based background model, which could introduce a minor inconsistency due to the mismatch between genome builds. However, this approach allows reliable statistical scoring while benefiting from the improved annotations of hg38.

5.4. Visualizing Motif Enrichment Results

# Plotting the top enriched motifs for up- and down-regulated genes
plot(report_up[report_up$p.value < 0.01], fontsize = 7, id.fontsize = 6)

plot(report_down[report_down$p.value < 0.01], fontsize = 7, id.fontsize = 6)

To uncover transcriptional regulators behind gene expression changes, we analyzed motifs in the 500 bp promoter regions of genes significantly differentially expressed and enriched in key GSEA pathways (C4 and C6), based on the hg19 genome.

Promoters of upregulated genes were enriched for motifs linked to TFs like FOXB1, FOXJ3, DUXA, and GSC2. These factors are known for roles in embryonic development, cell fate decisions, and maintaining pluripotency or regenerative states. Some enriched motifs lack detailed functional annotation, suggesting novel or context-specific regulatory elements. Overall, this points to upregulated genes being controlled by TFs that promote cellular plasticity and stem-like transcriptional programs, indicative of a shift toward a more primitive or regenerative state, relevant in tissue remodeling or tumorigenesis.

In contrast, downregulated genes showed motif enrichment for TFs such as SOX9, SOX10, FOXC1, TFAP2A, and ELK1, which are involved in neural crest development, differentiation, vascular formation, and immune response. The enrichment suggests repression of immune activity and lineage-specific functions. Some under-characterized motifs hint at unknown TFs contributing to suppressing mature cellular identities.

Together, these results reveal a biphasic regulatory shift: activation of developmental plasticity factors alongside suppression of differentiation and immune-related TFs. This pattern reflects a biological transition toward less differentiated, more flexible cellular states, commonly observed in cancer progression, regeneration, and reprogramming.

5.5. Extracting Significant TFs

# Extracting significant TFs
sig_up <- as.data.frame(report_up[report_up$p.value < 0.01, ])
sig_down <- as.data.frame(report_down[report_down$p.value < 0.01, ])

tf_up <- sig_up %>%
  group_by(target) %>%
  slice_min(p.value, n = 1) %>%
  ungroup()

tf_down <- sig_down %>%
  group_by(target) %>%
  slice_min(p.value, n = 1) %>%
  ungroup()

# Getting unique TFs
unique_tf_up <- unique(tf_up$target)
unique_tf_down <- unique(tf_down$target)

# Displaying the counts of unique TFs
length(unique_tf_up)
## [1] 22
length(unique_tf_down)
## [1] 18

In this step, we extracted significant TFs by filtering for interactions with a false discovery rate (FDR)-adjusted p-value less than 0.01 from both upregulated and downregulated gene sets. This adjustment controls for multiple testing and reduces the likelihood of false positives. For each target gene, we selected the most significant TF (with the lowest adjusted p-value) to avoid redundancy. We then identified the unique TFs involved in each set and counted them, resulting in 22 unique TFs associated with upregulated targets and 18 with downregulated targets. This process highlights the most statistically robust regulators potentially driving differential gene expression in the dataset.

6. Empirical Distribution of Top-Enriched TF

6.1. Extracting SOX9 Motifs & Preparing Promoter Sequences

# Extracting SOX9 PWMs from MotifDb for Homo sapiens
motifs_sox9 <- subset(MotifDb, organism == 'Hsapiens' & geneSymbol == 'SOX9')
pwms_sox9 <- toPWM(as.list(motifs_sox9))
names(pwms_sox9) <- sapply(names(pwms_sox9), function(x) strsplit(x, "-")[[1]][3])

# Ensuring that the promoter sequences of downregulated genes are available
stopifnot(exists("seqs_down")) # Should be a DNAStringSet object

# Computing raw motif scores for each promoter sequence
scores_list <- lapply(seqs_down, function(prom) motifScores(prom, pwms_sox9, raw.score = TRUE))
names(scores_list) <- names(seqs_down)

6.2. Calculating Background Thresholds & Enrichment Cutoffs

# Computing empirical background score distributions using hg19 for SOX9 motifs
ecdf <- motifEcdf(pwms_sox9, organism = "hg19", quick = TRUE)

# Calculating 99.75% log2-transformed threshold for each motif
sox9_thresholds <- sapply(names(pwms_sox9), function(name) {
  log2(quantile(ecdf[[name]], 1 - 0.0025, na.rm = TRUE))
})
names(sox9_thresholds) <- names(pwms_sox9)

For downstream analysis, SOX9, one of the top-enriched TFs, was selected to investigate its potential binding in the promoters of downregulated genes. We retrieved 16 SOX9 PWMs for H. sapiens from MotifDb, for which we computed raw motif scores across the promoter sequences of downregulated genes. To assess significance, we calculated empirical score distributions using the hg19 genomic background and determined log2-transformed thresholds at the 99.75th percentile. These thresholds represent cutoff values above which SOX9 binding sites are unlikely to occur by chance. From these PWMs, we selected one representative binding model, SOX9_HUMAN.H10MO.B from the HOCOMOCO v10 database, to visualize predicted binding affinities in promoter regions. For this PWM, a threshold score of 4.265822 was computed from the empirical background distribution.

6.3. Visualizing SOX9 Motif Scores in a Promoter Sequence

# Selecting motif and gene to plot
motif_name_to_plot <- names(sox9_thresholds)[1]
cutoff_to_plot <- sox9_thresholds[[motif_name_to_plot]]

# Plotting the motif scores without cutoff
plotMotifScores(scores_list[[1]], sel.motifs = motif_name_to_plot,
                cols = c("steelblue", "forestgreen", "tomato"),
                main = paste("SOX9 Motif:", motif_name_to_plot, "| Gene:", names(scores_list)[1]))

# Plotting the motif scores with cutoff
plotMotifScores(scores_list[[1]], sel.motifs = motif_name_to_plot,
                cols = c("steelblue", "forestgreen", "tomato"),
                cutoff = cutoff_to_plot,
                main = paste("SOX9 Motif w/ Cutoff:", motif_name_to_plot, "| Gene:", names(scores_list)[1]))

The two graphs illustrate SOX9 motif matches within a 500 bp region of Gene 100, with and without applying a score cutoff.

In the first graph, all potential binding sites are shown regardless of their strength, including both strong and weak matches on the forward and reverse strands. This provides a comprehensive view but may include low-confidence, biologically insignificant sites.

In contrast, the second graph applies a cutoff score of 4.265, filtering out weaker matches and retaining only the stronger, more likely functional binding sites—resulting in six clear matches on the forward strand.

Overall, applying a cutoff enhances the specificity of the analysis by highlighting high-confidence SOX9 binding regions, making the results more biologically meaningful.

7. Protein-Protein Interaction (PPI) Network Construction

7.1. Retrieving PPI Interactions from STRING

# Initializing the STRINGdb object
string_db <- STRINGdb$new(version = "11.5", species = 9606, score_threshold = 400)
# Species 9606 = the NCBI taxonomy ID of Homo sapiens
# Score threshold = the minimum confidence score for interactions, here set to 400

# Preparing the DEG gene list
selected_up_genes <- results$gene[results$DEG == "up"]
selected_down_genes <- results$gene[results$DEG == "down"]
deg_genes <- unique(c(selected_up_genes, selected_down_genes))

# Converting to a data frame
deg_genes_df <- data.frame(gene = deg_genes, stringsAsFactors = FALSE)

# Mapping gene symbols to STRING IDs
mapped <- string_db$map(deg_genes_df, "gene", removeUnmappedRows = TRUE)
## Warning:  we couldn't map to STRING 2% of your identifiers
# Retrieving PPI interactions
ppi_network <- string_db$get_interactions(mapped$STRING_id)

In this step, we initialized the STRINGdb object to access the STRING PPI database for Homo sapiens (taxonomy ID: 9606). We compiled a list of DEGs, combining both up- and down-regulated genes, and mapped their gene symbols to STRING identifiers. Using these STRING IDs, we retrieved high-confidence PPIs (combined score >= 400), which represent biologically meaningful associations based on multiple evidence channels such as experimental data, co-expression, and database annotations. This forms the foundation of our interaction network for downstream analysis.

7.2. Mapping Gene Symbols & Exporting the Labeled PPI Network

# Creating a look-up table to map each gene id to its corresponding gene name
id_to_gene <- mapped[, c("STRING_id", "gene")]

# Merging gene symbols for both ends of the interaction
labeled_ppi_network <- ppi_network %>%
  left_join(id_to_gene, by = c("from" = "STRING_id")) %>%
  rename(from_gene = gene) %>%
  left_join(id_to_gene, by = c("to" = "STRING_id")) %>%
  rename(to_gene = gene)

# Reordering columns for readability
labeled_ppi_network <- labeled_ppi_network[, c("from", "from_gene", "to", "to_gene", "combined_score")]

# Saving labeled network as a TSV file
write.table(labeled_ppi_network, file = "labeled_ppi_network.tsv", sep = "\t", quote = FALSE, row.names = FALSE)

The retrieved PPI network initially used internal STRING IDs, which are not easily interpretable. In this step, we mapped each STRING ID back to its corresponding gene symbol using our mapping table. This resulted in a labeled PPI network where each node is annotated with a human-readable gene name. We then exported this labeled network as a TSV file for easier interpretation, visualization, or integration with external tools such as Cytoscape or network analysis pipelines.

8. PPI Network Visualization & Analysis

8.1. Creating the PPI Network Graph

# Using the full interaction table with gene names
ppi_edges <- labeled_ppi_network %>% select(from_gene, to_gene, combined_score)

# Creating an undirected igraph object with gene symbols
g <- graph_from_data_frame(ppi_edges, directed = FALSE)

# Displaying some graph information
cat("Number of nodes in the graph:", gorder(g), "\nNumber of edges in the graph:", gsize(g), "\n")
## Number of nodes in the graph: 2371 
## Number of edges in the graph: 49824

Utilizing the generated labeled PPI network, we created an undirected graph object using the igraph package, which allows us to represent the PPI network as a graph structure. The edges of the graph are defined by the interactions between genes, with each edge weighted by the combined score from STRING, indicating the confidence of the interaction. This graph structure enables us to perform various network analyses and visualizations.

8.2. Visualizing the Network Degree Distribution

# Calculating the degree of each node using the degree() function from igraph
deg <- degree(g)
deg_df <- data.frame(gene = names(deg), degree = deg)

# Obtaining the frequency of each degree value
deg_freq <- deg_df %>%
  count(degree)

# Plotting the degree distribution
ggplot(deg_freq, aes(x = degree, y = n)) +
  geom_col(fill = "#800080", color = "#800080") +
  theme_minimal() +
  labs(title = "Degree Distribution of the PPI Network", x = "Degree", y = "Frequency") +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    axis.title = element_text(face = "bold")
  )

The degree distribution of the PPI network is highly right-skewed, indicating that the majority of genes participate in only a few interactions, while a small subset of genes act as hubs with over 500 connections. This pattern is characteristic of a scale-free network architecture, where a few highly connected nodes play a central role in maintaining connectivity. Although the average node has around 20 connections, the network consists of 49,824 edges out of a possible 2.8 million \[\frac{N(N-1)}{2} = \frac{2371 \times 2370}{2} = 2,\!809,\!635\], resulting in an edge density of approximately 1.6%. This confirms that the network is sparse overall, a typical feature of biological systems, which favor efficiency, modularity, and robustness through selective connectivity.

8.3. Calculating Centrality Metrics & Identifying the Top Genes for Each

# Computing centrality metrics
centrality_df <- data.frame(
  gene = V(g)$name,
  degree = degree(g),
  closeness = closeness(g, normalized = TRUE),
  betweenness = betweenness(g, normalized = TRUE)
)

# Obtaining the top 10 genes by each centrality measure
top_degree <- centrality_df %>% arrange(desc(degree)) %>% slice(1:10)
top_closeness <- centrality_df %>% arrange(desc(closeness)) %>% slice(1:10)
top_betweenness <- centrality_df %>% arrange(desc(betweenness)) %>% slice(1:10)
top_degree
top_closeness
top_betweenness

In our network analysis, we examined three key centrality measures to uncover the roles and influence of individual genes within the PPI network:

  • Degree Centrality
  • Closeness Centrality
  • Betweenness Centrality

Each metric offers a distinct perspective on how genes contribute to the network’s structure and function.

Degree centrality reflects the number of direct connections a gene holds in the network. Genes with a high degree act as hubs, serving as major points of interaction within the network. For instance, ALB (Albumin) was identified as the most connected gene, with 602 interactions. Similarly, genes like PTPRC, GAPDH, CD8A, and VEGFA also showed high connectivity, indicating their involvement in crucial pathways such as immune signaling, metabolism, and blood vessel formation.

Closeness centrality, on the other hand, measures how close a gene is to all others in the network, reflecting how quickly it can affect or be affected by the entire system. Some genes, such as C6ORF52, SLFN12L, TMSB15B and C18ORF63, had perfect closeness scores of 1.0, but these were confined to small disconnected subgraphs, where all nodes are directly adjacent. While mathematically high, these values do not reflect true global centrality. In contrast, key genes within the main connected component, such as ALB, GAPDH, FN1, VEGFA, and PTPRC, also ranked highly in closeness, highlighting their importance in rapidly influencing a broad range of other genes within the core of the network.

Betweenness centrality identifies genes that serve as bridges within the network, controlling the flow of information along the shortest paths between other genes. Genes with high betweenness scores are often critical for maintaining communication within the network. In our analysis, ALB again ranked highest, alongside GAPDH, FN1, EGF, and CDH1. Notably, genes like EGF, CDH1, and CFTR showed moderate degree but high betweenness, suggesting their importance as bottlenecks or mediators connecting otherwise distant network modules.

Overall, the recurring prominence of genes such as ALB, GAPDH, FN1, PTPRC, and VEGFA across all three centrality measures underscores their central regulatory roles. These genes are likely key drivers of immune response, metabolism, cell adhesion, and vascular remodeling, reflecting their functional significance in the biological system under study.

8.4. Identifying the Largest Connected Component (LCC)

# Decompose the graph into components
components_info <- components(g)

# Extract LCC
lcc <- induced_subgraph(g, which(components_info$membership == which.max(components_info$csize)))

# Summary
cat("Number of connected components:", components_info$no, "\n")
## Number of connected components: 3
cat("Size of largest component:", gorder(lcc), "nodes and", gsize(lcc), "edges\n")
## Size of largest component: 2367 nodes and 49818 edges

To better understand the structural organization of the DEG-based PPI network, we decomposed the graph into its connected components. A connected component is a subgraph in which any two nodes are connected to each other by a path.

Our analysis revealed 3 total connected components, with the largest connected component (LCC) comprising 2367 nodes and 49,818 edges. This dominant cluster contains the majority of DEGs and their interactions, reflecting a highly interconnected core of the biological network. Such a large LCC suggests strong functional integration and extensive signaling cross-talk among the deregulated genes, reinforcing the biological coherence of the differential expression results.

8.5. Visualizing the LCC

ggraph(lcc, layout = "fr") +
  geom_edge_link(alpha = 0.3, color = "gray60") +
  geom_node_point(size = 4, color = "#ff7f0e") +
  geom_node_text(aes(label = name), size = 3, repel = TRUE) +
  theme_void() +
  labs(title = "Largest Connected Component of the DEG PPI Network") +
  theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"))

Visualization of the LCC using a force-directed layout reveals a visually complex yet structurally sparse network. Notably, the LCC comprises virtually the entire network (over 99% of all nodes), indicating that the majority of DEGs are part of a single, coherent interaction space. This extensive connectivity, though sparse, highlights widespread co-regulation and functional interaction among the genes, especially around central hubs and key signaling modules. The force-directed layout helps expose these topological features by clustering highly connected genes and spreading apart loosely connected ones, allowing clearer insight into the network’s functional architecture.

9. Conclusion

This study presents a comprehensive transcriptomic analysis of KIRP, identifying 2,521 DEGs that reflect widespread molecular disruptions in tumor biology. Enrichment analysis revealed activation of oncogenic pathways such as KRAS and suppression of tumor suppressor signals like p53 and ATM, indicating a transcriptional landscape conducive to tumor progression. Motif analysis highlighted the involvement of developmental TFs and a concurrent reduction in immune and differentiation-associated regulators. PPI analysis uncovered a sparse, scale-free network architecture with prominent hub genes, underscoring key nodes in tumor-associated signaling. Together, these findings offer a detailed molecular profile of KIRP and suggest potential biomarkers and regulatory targets for future diagnostic or therapeutic development.