Single Sample Gene Set Enrichment Analysis (ssGSEA)¶

Single-sample Gene Set Enrichment Analysis (ssGSEA) is an variation of the GSEA algorithm that instead of calculating enrichment scores for groups of samples (i.e Control vs Disease) and sets of genes (i.e pathways), it provides a score for each each sample and gene set pair (https://www.genepattern.org/modules/docs/ssGSEAProjection/4).

In CKG this function is implemented in the function: run_ssgsea (analytics_core.analytics).

To visualize the results, they can be coupled with a Principal Component Analysis (PCA).

In this case, we are applied ssGSEA on two existing projects in CKG: PP4-interactomics experiments in two ovarian cancer cell lines (OVCAR-5 and COV318) expressing N-terminal FLAG- and C-terminal V5-tagged CT45, respectively from Coscia et al 2018 (https://pubmed.ncbi.nlm.nih.gov/30241606/).

Other References: - https://www.gsea-msigdb.org/gsea/doc/GSEAUserGuideTEXT.htm

https://pubmed.ncbi.nlm.nih.gov/16199517/

[1]:

from ckg.report_manager import project
from ckg.analytics_core.analytics import analytics
from ckg.analytics_core.viz import viz

from plotly.offline import init_notebook_mode, iplot
%matplotlib inline
init_notebook_mode(connected=True)

c:\users\sande\.conda\envs\pip_rev\lib\site-packages\outdated\utils.py:18: OutdatedPackageWarning:

The package pingouin is out of date. Your version is 0.3.11, the latest is 0.3.12.
Set the environment variable OUTDATED_IGNORE=1 to disable these warnings.

WGCNA functions will not work. Module Rpy2 not installed.
R functions will not work. Module Rpy2 not installed.

Reading the Project Data¶

These projects are stored in CKG’s graph database and the project reports need to be generated.

[2]:

project1 = "P0000004"
project2 = "P0000005"

[3]:

p1 = project.Project(project1, datasets={}, knowledge=None, report={}, configuration_files=None)
p1.build_project(force=False)
p1.generate_report()

[4]:

p2 = project.Project(project2, datasets={}, knowledge=None, report={}, configuration_files=None)
p2.build_project(force=False)
p2.generate_report()

[5]:

proteomics_dataset1 = p1.get_dataset("interactomics")
proteomics_dataset2 = p2.get_dataset("interactomics")

Gene Ontology Annotation¶

We can use the Gene Ontology annotation generated during the report creation and available as a dataframe to provide the gene/protein sets for the ssGSEA analysis.

One could also obtain this annotation directly from the graph database using one of the existing Cypher queries and the list of proteins (set).

[6]:

annotations1 = proteomics_dataset1.get_dataframe("go annotation")
annotations2 = proteomics_dataset2.get_dataframe("go annotation")

[7]:

annotations1

[7]:

	annotation	group	identifier	source
0	mitochondrial genome maintenance	None	TYMP~P19971	UniProt
1	mitochondrial genome maintenance	None	MGME1~Q9BQP7	UniProt
2	mitochondrial genome maintenance	None	SLC25A4~P12235	UniProt
3	mitochondrial genome maintenance	None	SLC25A33~Q9BSK2	UniProt
4	mitochondrial genome maintenance	None	OPA1~O60313	UniProt
5	mitochondrial genome maintenance	None	MSTO1~Q9BUK6	UniProt
6	single strand break repair	None	ERCC6~Q03468	UniProt
7	single strand break repair	None	SIRT1~Q96EB6	UniProt
8	single strand break repair	None	APTX~Q7Z2E3	UniProt
9	single strand break repair	None	APLF~Q8IW19	UniProt
10	single strand break repair	None	XRCC1~P18887	UniProt
11	regulation of DNA recombination	None	THOC1~Q96FV9	UniProt
12	regulation of DNA recombination	background	KPNA2~P52292	UniProt
13	regulation of DNA recombination	background	ALYREF~Q86V81	UniProt
14	regulation of DNA recombination	None	KPNA1~P52294	UniProt
15	regulation of DNA recombination	None	SMARCAD1~Q9H4L7	UniProt
16	regulation of mitotic recombination	None	MRE11~P49959	UniProt
17	regulation of mitotic recombination	None	RAD50~Q92878	UniProt
18	mitotic spindle elongation	None	PRC1~O43663	UniProt
19	mitotic spindle elongation	None	KIF23~Q02241	UniProt
20	maltose metabolic process	None	MGAM~O43451	UniProt
21	maltose metabolic process	None	GANC~Q8TET4	UniProt
22	maltose metabolic process	None	GAA~P10253	UniProt
23	ribosomal large subunit assembly	None	MDN1~Q9NU22	UniProt
24	ribosomal large subunit assembly	background	RPL11~P62913	UniProt
25	ribosomal large subunit assembly	background	BRIX1~Q8TDN6	UniProt
26	ribosomal large subunit assembly	background	RPL12~P30050	UniProt
27	ribosomal large subunit assembly	background	RPL6~Q02878	UniProt
28	ribosomal large subunit assembly	background	RPL3~P39023	UniProt
29	ribosomal large subunit assembly	None	PPAN~Q9NQ55	UniProt
...	...	...	...	...
70530	regulation of store-operated calcium entry	None	SPINK1~P00995	UniProt
70531	negative regulation of cation channel activity	None	MMP9~P14780	UniProt
70532	positive regulation of cation channel activity	None	ANK3~Q12955	UniProt
70533	positive regulation of cation channel activity	None	CTSS~P25774	UniProt
70534	regulation of semaphorin-plexin signaling pathway	None	NCAM1~P13591	UniProt
70535	negative regulation of cysteine-type endopepti...	None	PIH1D1~Q9NWS0	UniProt
70536	negative regulation of cysteine-type endopepti...	None	PLAUR~Q03405	UniProt
70537	negative regulation of cysteine-type endopepti...	background	PARK7~Q99497	UniProt
70538	negative regulation of cysteine-type endopepti...	None	CAAP1~Q9H8G2	UniProt
70539	negative regulation of cysteine-type endopepti...	None	MMP9~P14780	UniProt
70540	negative regulation of cysteine-type endopepti...	None	NLE1~Q9NVX2	UniProt
70541	positive regulation of cysteine-type endopepti...	None	TFAP4~Q01664	UniProt
70542	positive regulation of cysteine-type endopepti...	None	HTRA2~O43464	UniProt
70543	positive regulation of cysteine-type endopepti...	None	GSN~P06396	UniProt
70544	positive regulation of cysteine-type endopepti...	None	LGALS9~O00182	UniProt
70545	positive regulation of cysteine-type endopepti...	None	FAS~P25445	UniProt
70546	negative regulation of cysteine-type endopepti...	None	PAK2~Q13177	UniProt
70547	positive regulation of cysteine-type endopepti...	background	RPS3~P23396	UniProt
70548	positive regulation of unsaturated fatty acid ...	None	ABCD1~P33897	UniProt
70549	regulation of caveolin-mediated endocytosis	None	SRC~P12931	UniProt
70550	negative regulation of caveolin-mediated endoc...	None	PROM2~Q8N271	UniProt
70551	positive regulation of caveolin-mediated endoc...	None	ITSN1~Q15811	UniProt
70552	positive regulation of caveolin-mediated endoc...	None	NEDD4L~Q96PU5	UniProt
70553	malonyl-CoA catabolic process	None	MLYCD~O95822	UniProt
70554	malonyl-CoA biosynthetic process	None	ACACA~Q13085	UniProt
70555	malonyl-CoA biosynthetic process	None	ACACB~O00763	UniProt
70556	lipoxin biosynthetic process	None	ALOX5~P09917	UniProt
70557	lipoxin biosynthetic process	None	ALOX5AP~P20292	UniProt
70558	lipoxin A4 biosynthetic process	None	ALOX15~P16050	UniProt
70559	lysobisphosphatidic acid metabolic process	None	ACP6~Q9NPH0	UniProt

70560 rows × 4 columns

[8]:

processed_dataset1 = proteomics_dataset1.get_dataframe('processed')
processed_dataset2 = proteomics_dataset2.get_dataframe('processed')

[9]:

processed_dataset1.head()

[9]:

	ABCD3~P28288	ABLIM1~O14639	ACAP2~Q15057	ACIN1~Q9UKV3	ACTA1~P68133	ACTG1~P63261	ACTN4~O43707	ACTR1A~P61163	ACTR3~P61158	ADAR~P55265	...	ZCCHC17~Q9NP64	ZFR~Q96KR1	ZNF326~Q5BKZ1	ZNF629~Q9UEG4	ZNF638~Q14966	ZNRD2~O60232	ZRANB2~O95218	group	sample	subject
0	27.472355	24.510301	31.742221	29.535102	30.760022	35.118418	26.787807	26.299604	27.753916	24.588465	...	28.397584	25.921480	27.055173	24.576694	23.742444	29.115204	29.057783	FLAG	AS943	S138
1	27.460296	26.011139	31.011320	31.247341	30.421617	34.972458	29.027229	26.210650	25.878253	25.974869	...	28.232973	26.284133	26.887625	23.669367	25.411049	28.268343	28.489300	FLAG	AS944	S139
2	27.505821	23.608262	31.572265	29.430139	31.061174	35.345990	25.906052	25.960580	27.496556	24.543980	...	28.640687	25.899775	27.219189	24.685440	24.414733	29.080731	29.234413	FLAG	AS945	S140
3	27.012120	25.254434	31.353738	29.941257	30.312787	34.827996	25.787982	25.991854	26.494452	23.478376	...	28.908190	25.790942	26.534750	24.141646	25.989279	28.972269	30.331381	FLAG+CT45	AS946	S141
4	26.591493	25.377726	31.171406	31.171585	30.140900	35.013676	27.243544	25.666928	26.412464	23.679287	...	28.588887	25.749550	27.004746	24.608964	26.024305	28.598112	29.874860	FLAG+CT45	AS947	S142

5 rows × 923 columns

ssGSEA¶

The function needs only the processed dataset (interactomics data matrix) and the annotation matrix. The result provides two dataframes:

es – Enrichment scores – degree of overrepresentation
nes – Normalized Enrichment scores – corrects by set size and correlations between gene sets and the dataset

[10]:

ssgsea_result1 = analytics.run_ssgsea(processed_dataset1, annotations1, annotation_col='annotation', identifier_col='identifier', set_index=['group', 'sample','subject'], outdir=None, min_size=10, scale=False, permutations=0)

[11]:

ssgsea_result1['nes']

[11]:

	DNA duplex unwinding	DNA repair	DNA replication	Fc-epsilon receptor signaling pathway	Fc-gamma receptor signaling pathway involved in phagocytosis	G2/M transition of mitotic cell cycle	MAPK cascade	NIK/NF-kappaB signaling	RNA export from nucleus	RNA metabolic process	...	translational initiation	transmembrane transport	tumor necrosis factor-mediated signaling pathway	ubiquitin-dependent protein catabolic process	vesicle-mediated transport	viral process	viral transcription	group	sample	subject
FLAG_AS943_S138	0.011977	0.176476	0.052199	0.299403	0.211067	0.152337	0.336622	0.251237	0.117620	0.200810	...	0.429929	0.254752	0.304584	0.069326	0.002255	0.120948	0.464412	FLAG	AS943	S138
FLAG_AS944_S139	0.063279	0.147166	0.069440	0.248400	0.064426	0.096150	0.328776	0.239456	0.221997	0.254738	...	0.432992	0.261562	0.314577	0.021409	0.002524	0.098014	0.468801	FLAG	AS944	S139
FLAG_AS945_S140	0.044976	0.164713	0.038787	0.294322	0.225206	0.159580	0.339241	0.246767	0.111281	0.173833	...	0.432074	0.266003	0.300269	0.058290	0.017175	0.099421	0.468696	FLAG	AS945	S140
FLAG+CT45_AS946_S141	0.023172	0.153221	0.062612	0.289909	0.137606	0.088625	0.355786	0.289914	0.111026	0.147257	...	0.496105	0.260089	0.307746	0.076835	-0.011761	0.069829	0.538429	FLAG+CT45	AS946	S141
FLAG+CT45_AS947_S142	0.062469	0.151500	0.055752	0.247872	0.110875	0.067426	0.325069	0.252344	0.209442	0.207055	...	0.484496	0.233157	0.285062	0.057515	-0.024422	0.071443	0.524618	FLAG+CT45	AS947	S142
FLAG+CT45_AS948_S143	0.080360	0.165221	0.061832	0.278415	0.106264	0.091682	0.341581	0.266505	0.128023	0.169994	...	0.496060	0.245163	0.298267	0.059050	-0.000553	0.079137	0.535138	FLAG+CT45	AS948	S143

6 rows × 153 columns

[12]:

pca_result, args = analytics.run_pca(ssgsea_result1['nes'], drop_cols=['sample', 'subject'], group='group')

[13]:

args.update({"loadings":15, "title":'PCA plot groups', 'height':600, 'width':700, 'factor':1})
plot = viz.get_pca_plot(pca_result, identifier='pca', args=args)
iplot(plot.figure)

[14]:

ssgsea_result2 = analytics.run_ssgsea(processed_dataset2, annotations2, annotation_col='annotation', identifier_col='identifier', set_index=['group', 'sample','subject'], outdir=None, min_size=10, scale=False, permutations=0)

[15]:

ssgsea_result2['nes']

[15]:

	ATP-dependent chromatin remodeling	COPII vesicle coating	DNA duplex unwinding	DNA repair	Fc-epsilon receptor signaling pathway	Fc-gamma receptor signaling pathway involved in phagocytosis	G2/M transition of mitotic cell cycle	MAPK cascade	NIK/NF-kappaB signaling	RNA export from nucleus	...	translation	translational initiation	transmembrane transport	tumor necrosis factor-mediated signaling pathway	ubiquitin-dependent protein catabolic process	viral process	viral transcription	group	sample	subject
V5_AS921_S116	0.044262	-0.010696	0.204359	0.170231	-0.166454	0.278150	0.180737	-0.070821	-0.434774	0.058232	...	0.387313	0.381839	-0.212196	-0.191530	-0.261737	0.112756	0.417030	V5	AS921	S116
V5_AS922_S117	0.053901	-0.048848	0.160694	0.154869	-0.225246	0.328772	0.165380	-0.101982	-0.542500	0.041150	...	0.380391	0.377981	-0.269878	-0.224860	-0.292436	0.100454	0.410628	V5	AS922	S117
V5_AS923_S118	-0.052357	-0.011790	0.169802	0.169539	-0.167112	0.219772	0.192022	-0.040777	-0.411418	0.043397	...	0.395837	0.390451	-0.190605	-0.173792	-0.278322	0.108900	0.428926	V5	AS923	S118
V5+CT45_AS924_S119	0.062566	-0.088893	0.197607	0.224874	-0.112584	0.238658	0.229253	-0.034617	-0.182225	-0.050214	...	0.387171	0.373498	-0.169060	-0.074586	-0.128237	0.103801	0.417750	V5+CT45	AS924	S119
V5+CT45_AS925_S120	0.073479	-0.093424	0.194265	0.216908	-0.069579	0.207769	0.211539	-0.013577	-0.146667	-0.012140	...	0.382935	0.375096	-0.141614	-0.058460	-0.130907	0.093867	0.413617	V5+CT45	AS925	S120
V5+CT45_AS926_S121	0.079657	-0.091373	0.180172	0.183295	-0.134906	0.270098	0.203470	-0.042416	-0.206953	-0.033957	...	0.379303	0.368977	-0.177034	-0.106843	-0.090751	0.082380	0.413381	V5+CT45	AS926	S121

6 rows × 167 columns

[16]:

pca_result, args = analytics.run_pca(ssgsea_result2['nes'], drop_cols=['sample', 'subject'], group='group')

[17]:

args.update({"loadings":15, "title":'PCA plot groups', 'height':600, 'width':700, 'factor':1})
plot = viz.get_pca_plot(pca_result, identifier='pca', args=args)
iplot(plot.figure)

[ ]: