CPTAC Glioblastoma (GBM) Discovery Study¶

Glioblastoma is the most common type of brain cancer in adults with approximately 14,000 new diagnoses each year (NCI Cancer Currents, 2017). Tumors from patients with GBM were molecularly profiled by The Cancer Genome Atlas (TCGA) and these studies identified somatic mutations associated with essential signaling pathways (Nature 2008, Cell 2013). To elucidate the proteome, phosphoproteome, and acetylome profiles of GBM tumors, tissue from 99 patients were subjected to mass spectrometry analysis using the 11-plexed isobaric tandem mass tags (TMT-11). Normal brain samples from 10 participants of the Genotype-Tissue Expression(GTEx) program were also analyzed. CPTAC GBM discovery study

Exploring CPTAC GBM proteomics data with CKG¶

In this notebook, we analyze the GBM proteomics data from CPTAC to identify significant proteomics differences between tumor and normal brain tissue. The objective is to then annotate the potentially relevant proteins with knowledge from CKG to interpret their involvement in the progression of the disease.

Initially, we will format the data using mainly Pandas library to be able to use it within CKG’s platform and then run a sequence of analysis:

Data preprocessing
Differential regulation
Knowledge annotation
Drug annotation

The data was downloaded from: https://cptac-data-portal.georgetown.edu/study-summary/S048

“Data used in this analysis were generated by the Clinical Proteomic Tumor Analysis Consortium (NCI/NIH).”

[1]:

import os
import pandas as pd

CPTAC Files Used¶

[2]:

data_dir = '/Users/sande/work/Multi_Omics_Analytics_Department/GBM_FilipMundt/CPTAC_Phase_III_Data/CPTAC_GBM_S048'
cptac_proteome_report_file = os.path.join(data_dir,'CPTAC_GBM_Proteome_CDAP_Protein_Report.r1/CPTAC3_Glioblastoma_Multiforme_Proteome.tmt11.tsv')
cptac_clinical_file = os.path.join(data_dir,'CPTAC_GBM_metadata/S048_CPTAC_GBM_Discovery_Cohort_Clinical_Data_Dec2019_r1.xlsx')
cptac_sample_mapping_file = os.path.join(data_dir,'CPTAC_GBM_metadata/S048_CPTAC_GBM_Discovery_Cohort_TMT11_CaseID_SampleID_AliquotID_Map_Dec2019_r1.xlsx')

Clinical Data¶

[3]:

cptac_clinical_data = pd.read_excel(cptac_clinical_file, sheet_name='Clinical_Attributes')

[4]:

cptac_clinical_data.head()

[4]:

	tumor_code	case_id	type_of_analyzed_samples	gender	age	height_at_time_of_surgery_cm	weight_at_time_of_surgery_kg	BMI	race	ethnicity	...	measure_of_success_of_outcome_at_completion_of_this_follow_up_form	tumor_status_at_date_of_last_contact_or_death	vital_status_at_date_of_last_contact	cause_of_death	days_from_date_of_initial_pathologic_diagnosis_to_date_of_death	performance_status_score_eastern_cooperative_oncology_group	performance_status_score_karnofsky_score_preoperative	performance_status_scale_timing	days_from_date_of_initial_pathologic_diagnosis_to_date_of_last_contact	is_this_patient_lost_to_follow_up
0	GBM	C3L-00104	Tumor	Male	58	188.0	115.0	32.54	White	Not-Hispanic or Latino	...	Patient Deceased	With Tumor	Deceased	NaN	129.0	Not Evaluated: Not provided or available	Not Evaluated: Not provided or available	Not Evaluated: Not provided or available	128.0	No
1	GBM	C3L-00365	Tumor	Female	59	162.0	54.0	20.61	White	Not-Hispanic or Latino	...	Patient Deceased	NaN	Deceased	Unknown; patient entered hospice care. At last...	322.0	Not Evaluated: Not provided or available	Not Evaluated: Not provided or available	Not Evaluated: Not provided or available	280.0	No
2	GBM	C3L-00674	Tumor	Male	45	193.0	102.0	27.44	White	Not-Hispanic or Latino	...	Patient Deceased	NaN	Deceased	Progression of glioblastoma	478.0	Not Evaluated: Not provided or available	90: Able to carry on normal activity; minor si...	Post-Adjuvant Therapy	385.0	No
3	GBM	C3L-00677	Tumor	Female	69	164.0	52.0	19.32	White	Not-Hispanic or Latino	...	Patient Deceased	NaN	Deceased	Progression of glioblastoma + Multiple organ f...	154.0	1: Symptomatic; Restricted in physically stren...	70: Cares for self; unable to carry on normal ...	Post-Adjuvant Therapy	154.0	No
4	GBM	C3L-01040	Tumor	Male	77	170.0	70.0	24.22	NaN	NaN	...	Persistent Disease	With Tumor	Living	NaN	NaN	1: Symptomatic; Restricted in physically stren...	70: Cares for self; unable to carry on normal ...	Post-Adjuvant Therapy	608.0	Yes

5 rows × 43 columns

[5]:

cptac_clinical_data.groupby(['type_of_analyzed_samples', 'gender']).count()[['case_id']]

[5]:

		case_id
type_of_analyzed_samples	gender
Normal	Female	5
Normal	Male	5
Tumor	Female	44
Tumor	Male	56

10 normal brain samples (5 female, 5 male) and 100 tumor samples (44 female, 56 male).

[6]:

cptac_clinical_data.shape

[6]:

(110, 43)

[7]:

list_of_samples = cptac_clinical_data['case_id'].tolist()

[8]:

len(list_of_samples)

[8]:

[9]:

cptac_clinical_data[cptac_clinical_data['case_id'].isin(list_of_samples)].groupby('type_of_analyzed_samples').count()

[9]:

	tumor_code	case_id	gender	age	height_at_time_of_surgery_cm	weight_at_time_of_surgery_kg	BMI	race	ethnicity	ethnicity_race_ancestry_identified	...	measure_of_success_of_outcome_at_completion_of_this_follow_up_form	tumor_status_at_date_of_last_contact_or_death	vital_status_at_date_of_last_contact	cause_of_death	days_from_date_of_initial_pathologic_diagnosis_to_date_of_death	performance_status_score_eastern_cooperative_oncology_group	performance_status_score_karnofsky_score_preoperative	performance_status_scale_timing	days_from_date_of_initial_pathologic_diagnosis_to_date_of_last_contact	is_this_patient_lost_to_follow_up
type_of_analyzed_samples
Normal	0	10	10	10	10	10	10	10	4	0	...	0	0	10	9	0	0	0	0	0	0
Tumor	100	100	100	100	100	100	100	29	31	100	...	80	73	94	31	49	94	94	94	94	94

2 rows × 42 columns

Sample Identifier Mapping¶

We use the file S048_CPTAC_GBM_Discovery_Cohort_TMT11_CaseID_SampleID_AliquotID_Map_Dec2019_r1.xlsx to match the clinical metadata (Case ID) and proteomics data (Aliquot ID).

We keep Sample type (normal, tumor) from the clinical metadata (groups), as well as gender and age to use in the differential regulation analysis as possible covariates.

[10]:

cptac_sample_mapping = pd.read_excel(cptac_sample_mapping_file, comment='#', header=2)

[11]:

cptac_sample_mapping.head()

[11]:

	Batch	TMT plex	TMT channel	Alias	Case ID (Participant ID)	Parent Sample ID(s)	Aliquot ID	Sample type	OCT	TCIA Slide ID	TCIA Image links	Unnamed: 11	Unnamed: 12	Unnamed: 13	Unnamed: 14
0	1	1	126	B1S1	ref	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	1	1	127N	B1S1	GTEX-Y8DK-0011-R10A-SM-HAKY1	GTEX-Y8DK-0011-R10A-SM-HAKY1	CPT0204410003	normal	No	NaN	NaN	NaN	NaN	NaN	NaN
2	1	1	127C	B1S1	C3N-03183	C3N-03183-02, C3N-03183-03	CPT0206670004	tumor	No	C3N-03183-22, C3N-03183-23	https://pathology.cancerimagingarchive.net/pat...	https://pathology.cancerimagingarchive.net/pat...	NaN	NaN	NaN
3	1	1	128N	B1S1	C3N-01505	C3N-01505-01	CPT0089150003	tumor	No	C3N-01505-21	https://pathology.cancerimagingarchive.net/pat...	NaN	NaN	NaN	NaN
4	1	1	128C	B1S1	C3N-03188	C3N-03188-02	CPT0207030003	tumor	No	C3N-03188-22	https://pathology.cancerimagingarchive.net/pat...	NaN	NaN	NaN	NaN

[12]:

cptac_sample_mapping = cptac_sample_mapping[cptac_sample_mapping['Case ID (Participant ID)'].isin(list_of_samples)]

[13]:

cptac_sample_mapping.shape

[13]:

(110, 15)

[14]:

metadata = cptac_sample_mapping[['Case ID (Participant ID)','Aliquot ID', 'Sample type']].set_index('Case ID (Participant ID)')
metadata = metadata.join(cptac_clinical_data.set_index('case_id')[['gender', 'age']]).reset_index()

[15]:

metadata.head()

[15]:

	Case ID (Participant ID)	Aliquot ID	Sample type	gender	age
0	GTEX-Y8DK-0011-R10A-SM-HAKY1	CPT0204410003	normal	Male	62
1	C3N-03183	CPT0206670004	tumor	Male	53
2	C3N-01505	CPT0089150003	tumor	Male	74
3	C3N-03188	CPT0207030003	tumor	Male	54
4	C3L-02984	CPT0190240004	tumor	Male	34

[16]:

proteomics_sample_ids = cptac_sample_mapping['Aliquot ID'].tolist()

[17]:

proteomics_sample_ids

[17]:

['CPT0204410003',
 'CPT0206670004',
 'CPT0089150003',
 'CPT0207030003',
 'CPT0190240004',
 'CPT0161730003',
 'CPT0218330004',
 'CPT0104220003',
 'CPT0182580003',
 'CPT0167860004',
 'CPT0205670004',
 'CPT0093450003',
 'CPT0002410011',
 'CPT0189460003',
 'CPT0167750004',
 'CPT0218690004',
 'CPT0217060003',
 'CPT0205890003',
 'CPT0204420003',
 'CPT0189570004',
 'CPT0205780003',
 'CPT0168720003',
 'CPT0093550003',
 'CPT0217190003',
 'CPT0218770003',
 'CPT0204390003',
 'CPT0189250003',
 'CPT0087950003',
 'CPT0190360004',
 'CPT0224330003',
 'CPT0217880003',
 'CPT0127420003',
 'CPT0218960004',
 'CPT0168480003',
 'CPT0204360003',
 'CPT0221180003',
 'CPT0218830004',
 'CPT0225760003',
 'CPT0168270003',
 'CPT0064650003',
 'CPT0206880003',
 'CPT0168380003',
 'CPT0206000004',
 'CPT0167530003',
 'CPT0189850004',
 'CPT0196850003',
 'CPT0206560003',
 'CPT0219080004',
 'CPT0204380003',
 'CPT0224600003',
 'CPT0125570003',
 'CPT0217000004',
 'CPT0207090003',
 'CPT0217430008',
 'CPT0168590003',
 'CPT0186100003',
 'CPT0168080003',
 'CPT0162020003',
 'CPT0201710003',
 'CPT0204330003',
 'CPT0224390004',
 'CPT0209440003',
 'CPT0218890004',
 'CPT0208980003',
 'CPT0123530003',
 'CPT0071100003',
 'CPT0182550003',
 'CPT0217710008',
 'CPT0125510003',
 'CPT0127480003',
 'CPT0167640003',
 'CPT0087680003',
 'CPT0224540004',
 'CPT0167970003',
 'CPT0162100003',
 'CPT0189750004',
 'CPT0204350003',
 'CPT0104330003',
 'CPT0162140003',
 'CPT0078580003',
 'CPT0204400003',
 'CPT0093510003',
 'CPT0087570003',
 'CPT0216920008',
 'CPT0228220003',
 'CPT0175060003',
 'CPT0206330003',
 'CPT0217100003',
 'CPT0199770003',
 'CPT0189650004',
 'CPT0125220003',
 'CPT0206450003',
 'CPT0182500003',
 'CPT0225730003',
 'CPT0093590003',
 'CPT0204340003',
 'CPT0093360003',
 'CPT0206780003',
 'CPT0218670003',
 'CPT0087730003',
 'CPT0064890003',
 'CPT0206230003',
 'CPT0171580008',
 'CPT0168830003',
 'CPT0079790003',
 'CPT0092440003',
 'CPT0205570003',
 'CPT0206110003',
 'CPT0204370003',
 'CPT0205450004']

Proteomics Data¶

[18]:

cptac_proteomics_data = pd.read_csv(cptac_proteome_report_file, sep='\t')

[19]:

cptac_proteomics_data.head()

[19]:

	Gene	CPT0204410003 Log Ratio	CPT0204410003 Unshared Log Ratio	CPT0206670004 Log Ratio	CPT0206670004 Unshared Log Ratio	CPT0089150003 Log Ratio	CPT0089150003 Unshared Log Ratio	CPT0207030003 Log Ratio	CPT0207030003 Unshared Log Ratio	CPT0190240004 Log Ratio	...	CPT0204370003 Log Ratio	CPT0204370003 Unshared Log Ratio	CPT0205450004 Log Ratio	CPT0205450004 Unshared Log Ratio	NCBIGeneID	Authority	Description	Organism	Chromosome	Locus
0	Mean	0.533903	0.536714	0.634488	0.638574	-0.042532	-0.030967	0.352712	0.358905	0.577073	...	0.412576	0.415405	0.502734	0.499886	NaN	NaN	NaN	NaN	NaN	NaN
1	Median	0.430830	0.429394	0.698806	0.702917	0.002671	0.012689	0.359163	0.363725	0.607895	...	0.307880	0.303315	0.529097	0.533102	NaN	NaN	NaN	NaN	NaN	NaN
2	StdDev	1.028880	1.054129	0.685146	0.710699	0.679727	0.705571	0.618481	0.665532	0.808809	...	0.898016	0.921603	0.561767	0.569397	NaN	NaN	NaN	NaN	NaN	NaN
3	A1BG	-0.878237	-0.876801	0.025279	0.021168	0.451471	0.441454	-0.206660	-0.211223	-0.835626	...	-0.816745	-0.812180	0.504218	0.500214	1.0	HGNC:5	alpha-1-B glycoprotein	Homo sapiens	19	19q13.43
4	A2M	-1.171150	-1.129142	0.156104	0.157726	0.476275	0.459270	-0.602418	-0.601337	-0.359150	...	-0.803398	-0.765856	0.553261	0.558857	2.0	HGNC:7	alpha-2-macroglobulin	Homo sapiens	12	12p13.31

5 rows × 227 columns

[20]:

cols = {c:c.split(' ')[0] for c in cptac_proteomics_data.columns if c.split(' ')[0] in proteomics_sample_ids}

[21]:

cptac_proteomics_data = cptac_proteomics_data[['Gene'] + list(cols.keys())].set_index('Gene').drop(['Mean', 'Median', 'StdDev'], axis=0)
cptac_proteomics_data = cptac_proteomics_data.rename(cols, axis=1)

[22]:

cptac_proteomics_data.head()

[22]:

	CPT0204410003	CPT0204410003	CPT0206670004	CPT0206670004	CPT0089150003	CPT0089150003	CPT0207030003	CPT0207030003	CPT0190240004	CPT0190240004	...	CPT0092440003	CPT0092440003	CPT0205570003	CPT0205570003	CPT0206110003	CPT0206110003	CPT0204370003	CPT0204370003	CPT0205450004	CPT0205450004
Gene
A1BG	-0.878237	-0.876801	0.025279	0.021168	0.451471	0.441454	-0.206660	-0.211223	-0.835626	-0.831638	...	-0.041234	-0.051084	0.762486	0.757234	0.797139	0.794111	-0.816745	-0.812180	0.504218	0.500214
A2M	-1.171150	-1.129142	0.156104	0.157726	0.476275	0.459270	-0.602418	-0.601337	-0.359150	-0.354726	...	0.235061	0.232168	0.445184	0.440900	0.762671	0.752278	-0.803398	-0.765856	0.553261	0.558857
AAAS	-0.406262	-0.404826	0.471849	0.467738	0.091538	0.081520	-0.125625	-0.130188	0.245751	0.249738	...	0.001848	-0.008002	0.082095	0.076843	0.108620	0.105592	-0.278136	-0.273572	0.211318	0.207313
AACS	0.926254	0.927690	0.285570	0.281459	-0.065078	-0.075096	-0.183277	-0.187840	-0.004768	-0.000780	...	-0.078248	-0.088098	0.139698	0.134446	0.151606	0.148578	0.288503	0.293067	0.053601	0.049597
AADAT	1.317162	1.318598	-0.497299	-0.501410	0.156067	0.146049	-0.174811	-0.179374	0.185505	0.189493	...	0.614136	0.604287	0.225270	0.220018	0.099918	0.096891	1.213109	1.217673	-0.100376	-0.104380

5 rows × 220 columns

[23]:

cptac_proteomics_data.shape

[23]:

(10977, 220)

Clinical and Proteomics Data¶

[24]:

cptac_proteomics_data = cptac_proteomics_data.transpose().join(metadata.set_index('Aliquot ID'))

[25]:

cptac_proteomics_data.head()

[25]:

	A1BG	A2M	AAAS	AACS	AADAT	AAGAB	AAK1	AAMDC	AAMP	AAR2	...	ZWILCH	ZXDC	ZYG11B	ZYX	ZZEF1	ZZZ3	Case ID (Participant ID)	Sample type	gender	age
CPT0002410011	0.110003	0.432359	0.204539	-0.695101	NaN	0.133647	-0.524525	0.248913	-0.006661	0.482628	...	-0.043140	0.138792	0.013802	0.273392	-0.097031	0.641540	C3L-00365	tumor	Female	59
CPT0002410011	0.101098	0.421684	0.195633	-0.704007	NaN	0.124741	-0.540317	0.240007	-0.015566	0.473723	...	-0.052046	0.129887	0.004897	0.239006	-0.105937	0.632634	C3L-00365	tumor	Female	59
CPT0064650003	0.457406	0.702453	0.095134	-0.242212	0.072786	0.174392	-0.072472	-0.461447	-0.316991	0.703966	...	0.570851	1.089374	-0.198843	-0.016913	0.054600	-0.754282	C3L-00674	tumor	Male	45
CPT0064650003	0.443143	0.695942	0.080871	-0.256475	0.058524	0.160130	-0.070154	-0.475710	-0.331254	0.689704	...	0.556589	1.075111	-0.213106	-0.009587	0.040337	-0.768544	C3L-00674	tumor	Male	45
CPT0064890003	0.046885	-0.032622	0.358109	-0.121666	0.105781	-0.220744	-0.426095	-0.019649	-0.124015	0.396851	...	0.588521	0.192123	0.075653	0.275052	-0.047978	0.172529	C3L-01327	tumor	Male	74

5 rows × 10981 columns

Clinical Knowledge Graph Re-analysis¶

[26]:

from ckg.analytics_core.analytics import analytics
from ckg.analytics_core.viz import viz

from ckg.graphdb_connector import connector
driver = connector.getGraphDatabaseConnectionConfiguration()

from ckg.report_manager import knowledge


from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)

c:\users\sande\.conda\envs\ckgenv\lib\site-packages\outdated\utils.py:18: OutdatedPackageWarning: The package pingouin is out of date. Your version is 0.3.12, the latest is 0.4.0.
Set the environment variable OUTDATED_IGNORE=1 to disable these warnings.
  **kwargs

WGCNA functions will not work. Module Rpy2 not installed.
R functions will not work. Module Rpy2 not installed.

Imputation of missing values¶

We use KNN algorithm to impute missing values in the proteomics data.

[27]:

cptac_proteomics_data = analytics.imputation_KNN(cptac_proteomics_data.reset_index(), drop_cols=['gender', 'age', 'Sample type'], group='Sample type', cutoff=0.5)
cptac_proteomics_data = cptac_proteomics_data

[28]:

cptac_proteomics_data.head()

[28]:

	A1BG	A2M	AAAS	AACS	AADAT	AAGAB	AAK1	AAMDC	AAMP	AAR2	...	ZXDC	ZYG11B	ZYX	ZZEF1	ZZZ3	age	Sample type	gender	Case ID (Participant ID)	index
0	0.110003	0.432359	0.204539	-0.695101	0.039568	0.133647	-0.524525	0.248913	-0.006661	0.482628	...	0.138792	0.013802	0.273392	-0.097031	0.641540	59	tumor	Female	C3L-00365	CPT0002410011
1	0.101098	0.421684	0.195633	-0.704007	0.038556	0.124741	-0.540317	0.240007	-0.015566	0.473723	...	0.129887	0.004897	0.239006	-0.105937	0.632634	59	tumor	Female	C3L-00365	CPT0002410011
2	0.457406	0.702453	0.095134	-0.242212	0.072786	0.174392	-0.072472	-0.461447	-0.316991	0.703966	...	1.089374	-0.198843	-0.016913	0.054600	-0.754282	45	tumor	Male	C3L-00674	CPT0064650003
3	0.443143	0.695942	0.080871	-0.256475	0.058524	0.160130	-0.070154	-0.475710	-0.331254	0.689704	...	1.075111	-0.213106	-0.009587	0.040337	-0.768544	45	tumor	Male	C3L-00674	CPT0064650003
4	0.046885	-0.032622	0.358109	-0.121666	0.105781	-0.220744	-0.426095	-0.019649	-0.124015	0.396851	...	0.192123	0.075653	0.275052	-0.047978	0.172529	74	tumor	Male	C3L-01327	CPT0064890003

5 rows × 10789 columns

[29]:

cptac_proteomics_data = cptac_proteomics_data.rename({'index':'subject', 'Sample type': 'group', 'Case ID (Participant ID)': 'sample'}, axis=1)

Analysis of Covariance¶

We analyze dataset to find differentially regulated proteins comparing normal and brain tissue samples and taking age and gender as covariates.

[30]:

cptac_proteomics_data = cptac_proteomics_data.sort_values(by=['group'], ascending=False)

[31]:

results = analytics.run_ancova(cptac_proteomics_data, covariates=['age', 'gender'], drop_cols=['sample', 'subject'], subject='subject', group='group', alpha=0.01)

[32]:

results.head()

[32]:

	identifier	group1	group2	mean(group1)	std(group1)	mean(group2)	std(group2)	posthoc T-Statistics	posthoc pvalue	coef	...	log2FC	FC	F-statistics	pvalue	padj	correction	rejected	-log10 pvalue	Method	posthoc padj
0	A1BG	normal	tumor	-0.932974	0.087071	0.109841	0.529097	9.013360	1.086551e-16	1.054802	...	-1.042815	0.485379	81.240659	1.086551e-16	3.058566e-16	FDR correction BH	True	15.963950	One-way ancova	3.058566e-16
1	A2M	normal	tumor	-1.070228	0.289921	0.093514	0.576938	9.135619	4.820736e-17	1.177500	...	-1.163741	0.446353	83.459542	4.820736e-17	1.384469e-16	FDR correction BH	True	16.316887	One-way ancova	1.384469e-16
2	AAAS	normal	tumor	-0.485488	0.185624	0.136464	0.207010	13.078227	3.568030e-29	0.622227	...	-0.621952	0.649791	171.040029	3.568030e-29	2.066468e-28	FDR correction BH	True	28.447572	One-way ancova	2.066468e-28
3	AACS	normal	tumor	0.722912	0.218220	-0.049294	0.298136	-11.801273	3.937318e-25	-0.782880	...	0.772206	1.707879	139.270049	3.937318e-25	1.794592e-24	FDR correction BH	True	24.404799	One-way ancova	1.794592e-24
4	AADAT	normal	tumor	1.241883	0.282301	-0.151706	0.351998	-17.257875	1.597045e-42	-1.399042	...	1.393590	2.627316	297.834255	1.597045e-42	1.945476e-41	FDR correction BH	True	41.796683	One-way ancova	1.945476e-41

5 rows × 23 columns

[33]:

fig = viz.run_volcano(results, identifier='volcano_plot', args={'alpha': 0.01,
                                                                      'fc': 2,
                                                                      'colorscale': 'Blues',
                                                                      'showscale': False,
                                                                      'marker_size': 8,
                                                                      'x_title': 'log2FC',
                                                                      'y_title': '-log10(pvalue)',
                                                                      'num_annotations': 1000,
                                                                      'annotate_list': []})

viz.save_DASH_plot(fig[0], 'volcano_plot_normal_tumor', plot_format='png', directory=data_dir)
iplot(fig[0].figure)

Enrichment and Knowledge Annotation¶

Using the identified list of significantly regulated proteins we annotation from CKG to determine enriched biological processes, and reveal the knowledge graph associated to these protein hits with a focus on up-regulated proteins in the tumor tissue compared to normal. These proteins could be targeted by drug inhibitors to try to reverse the progression of the tumor.

[34]:

annotation_query = '''MATCH (p:Protein)-[r:ASSOCIATED_WITH]-(bp:Biological_process)
                        WHERE p.name IN $protein_list
                        RETURN DISTINCT p.name AS identifier, bp.name AS annotation'''
driver = connector.getGraphDatabaseConnectionConfiguration()
annotation = connector.getCursorData(driver, annotation_query, parameters={'protein_list':results['identifier'].unique().tolist()})

[35]:

enrichment = analytics.run_up_down_regulation_enrichment(results, annotation, identifier='identifier', groups=['group1', 'group2'], annotation_col='annotation', reject_col='rejected', group_col='group', method='fisher', correction='fdr_bh', alpha=0.01, lfc_cutoff=1)

upregulated_gbm_enrichment = enrichment['normal~tumor']
upregulated_gbm_enrichment = upregulated_gbm_enrichment[upregulated_gbm_enrichment['direction'] == 'downregulated']

e = {'normal~tumor': upregulated_gbm_enrichment}

[36]:

figures = viz.get_enrichment_plots(e, identifier='enrichment', args={'width':2200})
i = 0
for fig in figures:
    iplot(fig.figure)
    viz.save_DASH_plot(fig, name="enrichment_"+str(i), plot_format='png', directory=data_dir)
    i += 1

Knowledge Summarization¶

[37]:

upregulated_gbm = results[(results['rejected']) & (results['log2FC'] < -1)]

[38]:

upregulated_gbm.shape

[38]:

(861, 23)

[39]:

kn = knowledge.Knowledge(identifier='GBM', data=None)

kn.annotate_list(query_list=upregulated_gbm['identifier'].tolist(),
                 entity_type='protein',
                 queries_file=None,
                 attribute='name',
                 diseases=['glioblastoma'],
                 entities=None)

[40]:

kn.generate_report(visualizations=['sankey', 'network'],
                   summarize=True,
                   method='betweenness',
                   inplace=True)

[41]:

kn.report.visualize_report(environment='notebook')[0]

[42]:

kn.save_report(data_dir)

[58]:

kn.keep_nodes

[58]:

['glioblastoma']

Finding Drug Inhibitors¶

We use the knowledge in CKG to filter the list of up-regulated proteins in the tumor tissue compared to normal to explore possible drug inhibitors that could reverse the progression of the tumor.

We initially filter those significantly upregulated proteins that have been already associated with GBM by using the protein-disease associations in CKG.

We then use the filtered list to find inhibitors stored in CKG as drug-target relationships.

From the obtained list of candidate drugs, we further filter the results by finding evidence in publications mentioning the drug together with the disease and the protein they target.

This last step provides a final list of connected protein hits, candidate drug inhibitors and publications that can be visualize as a CKG knowledge subgraph.

[43]:

query = '''MATCH (p:Protein)-[r:ASSOCIATED_WITH]-(d:Disease{name:"glioblastoma"})
            WHERE p.name IN $protein_list AND r.score > 1.5 RETURN p.name, d.name, r.score'''

res = connector.getCursorData(driver, query, parameters={'protein_list': upregulated_gbm['identifier'].tolist()})

[44]:

res.head()

[44]:

	d.name	p.name	r.score
0	glioblastoma	CD276	1.602
1	glioblastoma	ANXA5	2.290
2	glioblastoma	THBS1	1.615
3	glioblastoma	IGFBP2	1.835
4	glioblastoma	MMP9	2.218

[45]:

len(res['p.name'].tolist())

[45]:

[46]:

query = '''MATCH (p:Protein)-[r:ACTS_ON{action:"inhibition"}]-(d:Drug)
            WHERE p.name IN $protein_list AND r.score > 0.7
            WITH p, d, r, SIZE((:Protein)-[:ACTS_ON]-(d)) as degree WHERE degree < 10
            RETURN p.name, d.name, r.score'''

res = connector.getCursorData(driver, query, parameters={'protein_list': res['p.name'].tolist()})

[47]:

res.head()

[47]:

	d.name	p.name	r.score
0	Pirfenidone	MMP9	0.8
1	3,5-Dimethyl-1-(3-Nitrophenyl)-1h-Pyrazole-4-C...	MMP9	0.8
2	Hydroflumethiazide	CA9	0.8
3	Pirfenidone	ALB	0.8
4	(10E,12Z)-octadecadienoic acid	MMP2	0.8

[48]:

res['d.name'].unique().shape

[48]:

(28,)

[49]:

project_knowledge = knowledge.Knowledge(identifier='targets',
                              data=None,
                              nodes={},
                              relationships={},
                              queries_file=None,
                              colors={},
                              graph=None,
                              report={})
project_knowledge.generate_knowledge_from_edgelist(edgelist=res,
                                                   entity1='Protein',
                                                   entity2='Drug',
                                                   source='p.name',
                                                   target='d.name',
                                                   rtype='associated_with',
                                                   weight='r.score')

[50]:

query = '''MATCH (d:Drug)-[:MENTIONED_IN_PUBLICATION]-(p:Publication)-[:MENTIONED_IN_PUBLICATION]-(di:Disease{name:"glioblastoma"})
            WHERE d.name IN $drug_list
            WITH d, p
            MATCH (pro:Protein)-[:MENTIONED_IN_PUBLICATION]-(p)
            WHERE pro.name IN $protein_list
            RETURN p.id, d.name, d.class, pro.name'''

res = connector.getCursorData(driver, query, parameters={'drug_list': res['d.name'].tolist(),
                                                         'protein_list':res['p.name'].unique().tolist()})

[51]:

res.columns = ['Drug class', 'Drug name', 'Publication', 'Protein name']

res['Publication'] = ["PMID:"+p for p in res['Publication'].tolist()]

res[['Drug class', 'Drug name', 'Publication', 'Protein name']]

[51]:

	Drug class	Drug name	Publication	Protein name
0	Azoles	3,5-Dimethyl-1-(3-Nitrophenyl)-1h-Pyrazole-4-C...	PMID:30470262	MMP9
1	Pyridines and derivatives	Pirfenidone	PMID:31282197	MMP9
2	Purine nucleosides	Nelarabine	PMID:26899176	MMP9
3	Pyridines and derivatives	Pirfenidone	PMID:31282197	CA9
4	Benzene and substituted derivatives	Cetirizine	PMID:31022935	ALB
5	Carboxylic acids and derivatives	Tranexamic acid	PMID:22539956	ALB
6	Pyridines and derivatives	Pirfenidone	PMID:31117237	FN1
7	Purine nucleosides	Nelarabine	PMID:31950163	FN1
8	Pyridines and derivatives	Pirfenidone	PMID:31282197	FN1
9	Pyridines and derivatives	Pirfenidone	PMID:29996062	FN1
10	Pyridines and derivatives	Pirfenidone	PMID:31547567	FN1
11	Azoles	3,5-Dimethyl-1-(3-Nitrophenyl)-1h-Pyrazole-4-C...	PMID:31231472	FN1
12	Pyridines and derivatives	Pirfenidone	PMID:31231472	FN1
13	Pyridines and derivatives	Pirfenidone	PMID:29038232	FN1
14	Benzene and substituted derivatives	N-(4-sulfamoylphenyl)-1H-indazole-3-carboxamide	PMID:25707963	FN1
15	Pyridines and derivatives	Pirfenidone	PMID:25026295	FN1
16	Benzothiophenes	4-Iodobenzo[B]Thiophene-2-Carboxamidine	PMID:21976520	FN1
17	Pyridines and derivatives	Pirfenidone	PMID:29996062	ICAM1
18	Azoles	3,5-Dimethyl-1-(3-Nitrophenyl)-1h-Pyrazole-4-C...	PMID:31231472	CDK1
19	Pyridines and derivatives	Pirfenidone	PMID:31231472	CDK1
20	Benzothiophenes	4-Iodobenzo[B]Thiophene-2-Carboxamidine	PMID:21976520	PLG
21	Carboxylic acids and derivatives	Mimosine	PMID:20226717	PLG
22	Pyridines and derivatives	Pirfenidone	PMID:29996062	CCL2
23	Pyridines and derivatives	Pirfenidone	PMID:25026295	CHI3L1

[52]:

res.to_csv(os.path.join(data_dir, 'studies_drugs.tsv'), sep='\t', header=True, index=False, doublequote=None)

[53]:

len(res['Drug name'].tolist())

[53]:

[54]:

res['r.score'] = 1

[55]:

project_knowledge.generate_knowledge_from_edgelist(edgelist=res,
                                                   entity1='Drug',
                                                   entity2='Publication',
                                                   source='Drug name',
                                                   target='Publication',
                                                   rtype='associated_with',
                                                   weight='r.score')

project_knowledge.generate_knowledge_from_edgelist(edgelist=res,
                                                   entity1='Protein',
                                                   entity2='Publication',
                                                   source='Protein name',
                                                   target='Publication',
                                                   rtype='associated_with',
                                                   weight='r.score')

[60]:

project_knowledge.generate_report(visualizations=['sankey', 'network'], summarize=False)
project_knowledge.report.visualize_report(environment='notebook')[0]

[57]:

project_knowledge.save_report(data_dir)