Single Sample Gene Set Enrichment Analysis (ssGSEA)

Single-sample Gene Set Enrichment Analysis (ssGSEA) is an variation of the GSEA algorithm that instead of calculating enrichment scores for groups of samples (i.e Control vs Disease) and sets of genes (i.e pathways), it provides a score for each each sample and gene set pair (https://www.genepattern.org/modules/docs/ssGSEAProjection/4).

In CKG this function is implemented in the function: run_ssgsea (analytics_core.analytics).

To visualize the results, they can be coupled with a Principal Component Analysis (PCA).

In this case, we are applied ssGSEA on two existing projects in CKG: PP4-interactomics experiments in two ovarian cancer cell lines (OVCAR-5 and COV318) expressing N-terminal FLAG- and C-terminal V5-tagged CT45, respectively from Coscia et al 2018 (https://pubmed.ncbi.nlm.nih.gov/30241606/).

Other References: - https://www.gsea-msigdb.org/gsea/doc/GSEAUserGuideTEXT.htm

[1]:
from ckg.report_manager import project
from ckg.analytics_core.analytics import analytics
from ckg.analytics_core.viz import viz

from plotly.offline import init_notebook_mode, iplot
%matplotlib inline
init_notebook_mode(connected=True)
c:\users\sande\.conda\envs\pip_rev\lib\site-packages\outdated\utils.py:18: OutdatedPackageWarning:

The package pingouin is out of date. Your version is 0.3.11, the latest is 0.3.12.
Set the environment variable OUTDATED_IGNORE=1 to disable these warnings.

WGCNA functions will not work. Module Rpy2 not installed.
R functions will not work. Module Rpy2 not installed.

Reading the Project Data

These projects are stored in CKG’s graph database and the project reports need to be generated.

[2]:
project1 = "P0000004"
project2 = "P0000005"
[3]:
p1 = project.Project(project1, datasets={}, knowledge=None, report={}, configuration_files=None)
p1.build_project(force=False)
p1.generate_report()
[4]:
p2 = project.Project(project2, datasets={}, knowledge=None, report={}, configuration_files=None)
p2.build_project(force=False)
p2.generate_report()
[5]:
proteomics_dataset1 = p1.get_dataset("interactomics")
proteomics_dataset2 = p2.get_dataset("interactomics")

Gene Ontology Annotation

We can use the Gene Ontology annotation generated during the report creation and available as a dataframe to provide the gene/protein sets for the ssGSEA analysis.

One could also obtain this annotation directly from the graph database using one of the existing Cypher queries and the list of proteins (set).

[6]:
annotations1 = proteomics_dataset1.get_dataframe("go annotation")
annotations2 = proteomics_dataset2.get_dataframe("go annotation")
[7]:
annotations1
[7]:
annotation group identifier source
0 mitochondrial genome maintenance None TYMP~P19971 UniProt
1 mitochondrial genome maintenance None MGME1~Q9BQP7 UniProt
2 mitochondrial genome maintenance None SLC25A4~P12235 UniProt
3 mitochondrial genome maintenance None SLC25A33~Q9BSK2 UniProt
4 mitochondrial genome maintenance None OPA1~O60313 UniProt
5 mitochondrial genome maintenance None MSTO1~Q9BUK6 UniProt
6 single strand break repair None ERCC6~Q03468 UniProt
7 single strand break repair None SIRT1~Q96EB6 UniProt
8 single strand break repair None APTX~Q7Z2E3 UniProt
9 single strand break repair None APLF~Q8IW19 UniProt
10 single strand break repair None XRCC1~P18887 UniProt
11 regulation of DNA recombination None THOC1~Q96FV9 UniProt
12 regulation of DNA recombination background KPNA2~P52292 UniProt
13 regulation of DNA recombination background ALYREF~Q86V81 UniProt
14 regulation of DNA recombination None KPNA1~P52294 UniProt
15 regulation of DNA recombination None SMARCAD1~Q9H4L7 UniProt
16 regulation of mitotic recombination None MRE11~P49959 UniProt
17 regulation of mitotic recombination None RAD50~Q92878 UniProt
18 mitotic spindle elongation None PRC1~O43663 UniProt
19 mitotic spindle elongation None KIF23~Q02241 UniProt
20 maltose metabolic process None MGAM~O43451 UniProt
21 maltose metabolic process None GANC~Q8TET4 UniProt
22 maltose metabolic process None GAA~P10253 UniProt
23 ribosomal large subunit assembly None MDN1~Q9NU22 UniProt
24 ribosomal large subunit assembly background RPL11~P62913 UniProt
25 ribosomal large subunit assembly background BRIX1~Q8TDN6 UniProt
26 ribosomal large subunit assembly background RPL12~P30050 UniProt
27 ribosomal large subunit assembly background RPL6~Q02878 UniProt
28 ribosomal large subunit assembly background RPL3~P39023 UniProt
29 ribosomal large subunit assembly None PPAN~Q9NQ55 UniProt
... ... ... ... ...
70530 regulation of store-operated calcium entry None SPINK1~P00995 UniProt
70531 negative regulation of cation channel activity None MMP9~P14780 UniProt
70532 positive regulation of cation channel activity None ANK3~Q12955 UniProt
70533 positive regulation of cation channel activity None CTSS~P25774 UniProt
70534 regulation of semaphorin-plexin signaling pathway None NCAM1~P13591 UniProt
70535 negative regulation of cysteine-type endopepti... None PIH1D1~Q9NWS0 UniProt
70536 negative regulation of cysteine-type endopepti... None PLAUR~Q03405 UniProt
70537 negative regulation of cysteine-type endopepti... background PARK7~Q99497 UniProt
70538 negative regulation of cysteine-type endopepti... None CAAP1~Q9H8G2 UniProt
70539 negative regulation of cysteine-type endopepti... None MMP9~P14780 UniProt
70540 negative regulation of cysteine-type endopepti... None NLE1~Q9NVX2 UniProt
70541 positive regulation of cysteine-type endopepti... None TFAP4~Q01664 UniProt
70542 positive regulation of cysteine-type endopepti... None HTRA2~O43464 UniProt
70543 positive regulation of cysteine-type endopepti... None GSN~P06396 UniProt
70544 positive regulation of cysteine-type endopepti... None LGALS9~O00182 UniProt
70545 positive regulation of cysteine-type endopepti... None FAS~P25445 UniProt
70546 negative regulation of cysteine-type endopepti... None PAK2~Q13177 UniProt
70547 positive regulation of cysteine-type endopepti... background RPS3~P23396 UniProt
70548 positive regulation of unsaturated fatty acid ... None ABCD1~P33897 UniProt
70549 regulation of caveolin-mediated endocytosis None SRC~P12931 UniProt
70550 negative regulation of caveolin-mediated endoc... None PROM2~Q8N271 UniProt
70551 positive regulation of caveolin-mediated endoc... None ITSN1~Q15811 UniProt
70552 positive regulation of caveolin-mediated endoc... None NEDD4L~Q96PU5 UniProt
70553 malonyl-CoA catabolic process None MLYCD~O95822 UniProt
70554 malonyl-CoA biosynthetic process None ACACA~Q13085 UniProt
70555 malonyl-CoA biosynthetic process None ACACB~O00763 UniProt
70556 lipoxin biosynthetic process None ALOX5~P09917 UniProt
70557 lipoxin biosynthetic process None ALOX5AP~P20292 UniProt
70558 lipoxin A4 biosynthetic process None ALOX15~P16050 UniProt
70559 lysobisphosphatidic acid metabolic process None ACP6~Q9NPH0 UniProt

70560 rows × 4 columns

[8]:
processed_dataset1 = proteomics_dataset1.get_dataframe('processed')
processed_dataset2 = proteomics_dataset2.get_dataframe('processed')
[9]:
processed_dataset1.head()
[9]:
ABCD3~P28288 ABLIM1~O14639 ACAP2~Q15057 ACIN1~Q9UKV3 ACTA1~P68133 ACTG1~P63261 ACTN4~O43707 ACTR1A~P61163 ACTR3~P61158 ADAR~P55265 ... ZCCHC17~Q9NP64 ZFR~Q96KR1 ZNF326~Q5BKZ1 ZNF629~Q9UEG4 ZNF638~Q14966 ZNRD2~O60232 ZRANB2~O95218 group sample subject
0 27.472355 24.510301 31.742221 29.535102 30.760022 35.118418 26.787807 26.299604 27.753916 24.588465 ... 28.397584 25.921480 27.055173 24.576694 23.742444 29.115204 29.057783 FLAG AS943 S138
1 27.460296 26.011139 31.011320 31.247341 30.421617 34.972458 29.027229 26.210650 25.878253 25.974869 ... 28.232973 26.284133 26.887625 23.669367 25.411049 28.268343 28.489300 FLAG AS944 S139
2 27.505821 23.608262 31.572265 29.430139 31.061174 35.345990 25.906052 25.960580 27.496556 24.543980 ... 28.640687 25.899775 27.219189 24.685440 24.414733 29.080731 29.234413 FLAG AS945 S140
3 27.012120 25.254434 31.353738 29.941257 30.312787 34.827996 25.787982 25.991854 26.494452 23.478376 ... 28.908190 25.790942 26.534750 24.141646 25.989279 28.972269 30.331381 FLAG+CT45 AS946 S141
4 26.591493 25.377726 31.171406 31.171585 30.140900 35.013676 27.243544 25.666928 26.412464 23.679287 ... 28.588887 25.749550 27.004746 24.608964 26.024305 28.598112 29.874860 FLAG+CT45 AS947 S142

5 rows × 923 columns

ssGSEA

The function needs only the processed dataset (interactomics data matrix) and the annotation matrix. The result provides two dataframes:

  • es – Enrichment scores – degree of overrepresentation

  • nes – Normalized Enrichment scores – corrects by set size and correlations between gene sets and the dataset

[10]:
ssgsea_result1 = analytics.run_ssgsea(processed_dataset1, annotations1, annotation_col='annotation', identifier_col='identifier', set_index=['group', 'sample','subject'], outdir=None, min_size=10, scale=False, permutations=0)
[11]:
ssgsea_result1['nes']
[11]:
DNA duplex unwinding DNA repair DNA replication Fc-epsilon receptor signaling pathway Fc-gamma receptor signaling pathway involved in phagocytosis G2/M transition of mitotic cell cycle MAPK cascade NIK/NF-kappaB signaling RNA export from nucleus RNA metabolic process ... translational initiation transmembrane transport tumor necrosis factor-mediated signaling pathway ubiquitin-dependent protein catabolic process vesicle-mediated transport viral process viral transcription group sample subject
FLAG_AS943_S138 0.011977 0.176476 0.052199 0.299403 0.211067 0.152337 0.336622 0.251237 0.117620 0.200810 ... 0.429929 0.254752 0.304584 0.069326 0.002255 0.120948 0.464412 FLAG AS943 S138
FLAG_AS944_S139 0.063279 0.147166 0.069440 0.248400 0.064426 0.096150 0.328776 0.239456 0.221997 0.254738 ... 0.432992 0.261562 0.314577 0.021409 0.002524 0.098014 0.468801 FLAG AS944 S139
FLAG_AS945_S140 0.044976 0.164713 0.038787 0.294322 0.225206 0.159580 0.339241 0.246767 0.111281 0.173833 ... 0.432074 0.266003 0.300269 0.058290 0.017175 0.099421 0.468696 FLAG AS945 S140
FLAG+CT45_AS946_S141 0.023172 0.153221 0.062612 0.289909 0.137606 0.088625 0.355786 0.289914 0.111026 0.147257 ... 0.496105 0.260089 0.307746 0.076835 -0.011761 0.069829 0.538429 FLAG+CT45 AS946 S141
FLAG+CT45_AS947_S142 0.062469 0.151500 0.055752 0.247872 0.110875 0.067426 0.325069 0.252344 0.209442 0.207055 ... 0.484496 0.233157 0.285062 0.057515 -0.024422 0.071443 0.524618 FLAG+CT45 AS947 S142
FLAG+CT45_AS948_S143 0.080360 0.165221 0.061832 0.278415 0.106264 0.091682 0.341581 0.266505 0.128023 0.169994 ... 0.496060 0.245163 0.298267 0.059050 -0.000553 0.079137 0.535138 FLAG+CT45 AS948 S143

6 rows × 153 columns

[12]:
pca_result, args = analytics.run_pca(ssgsea_result1['nes'], drop_cols=['sample', 'subject'], group='group')
[13]:
args.update({"loadings":15, "title":'PCA plot groups', 'height':600, 'width':700, 'factor':1})
plot = viz.get_pca_plot(pca_result, identifier='pca', args=args)
iplot(plot.figure)
[14]:
ssgsea_result2 = analytics.run_ssgsea(processed_dataset2, annotations2, annotation_col='annotation', identifier_col='identifier', set_index=['group', 'sample','subject'], outdir=None, min_size=10, scale=False, permutations=0)
[15]:
ssgsea_result2['nes']
[15]:
ATP-dependent chromatin remodeling COPII vesicle coating DNA duplex unwinding DNA repair Fc-epsilon receptor signaling pathway Fc-gamma receptor signaling pathway involved in phagocytosis G2/M transition of mitotic cell cycle MAPK cascade NIK/NF-kappaB signaling RNA export from nucleus ... translation translational initiation transmembrane transport tumor necrosis factor-mediated signaling pathway ubiquitin-dependent protein catabolic process viral process viral transcription group sample subject
V5_AS921_S116 0.044262 -0.010696 0.204359 0.170231 -0.166454 0.278150 0.180737 -0.070821 -0.434774 0.058232 ... 0.387313 0.381839 -0.212196 -0.191530 -0.261737 0.112756 0.417030 V5 AS921 S116
V5_AS922_S117 0.053901 -0.048848 0.160694 0.154869 -0.225246 0.328772 0.165380 -0.101982 -0.542500 0.041150 ... 0.380391 0.377981 -0.269878 -0.224860 -0.292436 0.100454 0.410628 V5 AS922 S117
V5_AS923_S118 -0.052357 -0.011790 0.169802 0.169539 -0.167112 0.219772 0.192022 -0.040777 -0.411418 0.043397 ... 0.395837 0.390451 -0.190605 -0.173792 -0.278322 0.108900 0.428926 V5 AS923 S118
V5+CT45_AS924_S119 0.062566 -0.088893 0.197607 0.224874 -0.112584 0.238658 0.229253 -0.034617 -0.182225 -0.050214 ... 0.387171 0.373498 -0.169060 -0.074586 -0.128237 0.103801 0.417750 V5+CT45 AS924 S119
V5+CT45_AS925_S120 0.073479 -0.093424 0.194265 0.216908 -0.069579 0.207769 0.211539 -0.013577 -0.146667 -0.012140 ... 0.382935 0.375096 -0.141614 -0.058460 -0.130907 0.093867 0.413617 V5+CT45 AS925 S120
V5+CT45_AS926_S121 0.079657 -0.091373 0.180172 0.183295 -0.134906 0.270098 0.203470 -0.042416 -0.206953 -0.033957 ... 0.379303 0.368977 -0.177034 -0.106843 -0.090751 0.082380 0.413381 V5+CT45 AS926 S121

6 rows × 167 columns

[16]:
pca_result, args = analytics.run_pca(ssgsea_result2['nes'], drop_cols=['sample', 'subject'], group='group')
[17]:
args.update({"loadings":15, "title":'PCA plot groups', 'height':600, 'width':700, 'factor':1})
plot = viz.get_pca_plot(pca_result, identifier='pca', args=args)
iplot(plot.figure)
[ ]: