CPTAC Glioblastoma (GBM) Discovery Study

Glioblastoma is the most common type of brain cancer in adults with approximately 14,000 new diagnoses each year (NCI Cancer Currents, 2017). Tumors from patients with GBM were molecularly profiled by The Cancer Genome Atlas (TCGA) and these studies identified somatic mutations associated with essential signaling pathways (Nature 2008, Cell 2013). To elucidate the proteome, phosphoproteome, and acetylome profiles of GBM tumors, tissue from 99 patients were subjected to mass spectrometry analysis using the 11-plexed isobaric tandem mass tags (TMT-11). Normal brain samples from 10 participants of the Genotype-Tissue Expression(GTEx) program were also analyzed. CPTAC GBM discovery study

Exploring CPTAC GBM proteomics data with CKG

In this notebook, we analyze the GBM proteomics data from CPTAC to identify significant proteomics differences between tumor and normal brain tissue. The objective is to then annotate the potentially relevant proteins with knowledge from CKG to interpret their involvement in the progression of the disease.

Initially, we will format the data using mainly Pandas library to be able to use it within CKG’s platform and then run a sequence of analysis:

  • Data preprocessing

  • Differential regulation

  • Knowledge annotation

  • Drug annotation

The data was downloaded from: https://cptac-data-portal.georgetown.edu/study-summary/S048

“Data used in this analysis were generated by the Clinical Proteomic Tumor Analysis Consortium (NCI/NIH).”

[1]:
import os
import pandas as pd

CPTAC Files Used

[2]:
data_dir = '/Users/sande/work/Multi_Omics_Analytics_Department/GBM_FilipMundt/CPTAC_Phase_III_Data/CPTAC_GBM_S048'
cptac_proteome_report_file = os.path.join(data_dir,'CPTAC_GBM_Proteome_CDAP_Protein_Report.r1/CPTAC3_Glioblastoma_Multiforme_Proteome.tmt11.tsv')
cptac_clinical_file = os.path.join(data_dir,'CPTAC_GBM_metadata/S048_CPTAC_GBM_Discovery_Cohort_Clinical_Data_Dec2019_r1.xlsx')
cptac_sample_mapping_file = os.path.join(data_dir,'CPTAC_GBM_metadata/S048_CPTAC_GBM_Discovery_Cohort_TMT11_CaseID_SampleID_AliquotID_Map_Dec2019_r1.xlsx')

Clinical Data

[3]:
cptac_clinical_data = pd.read_excel(cptac_clinical_file, sheet_name='Clinical_Attributes')
[4]:
cptac_clinical_data.head()
[4]:
tumor_code case_id type_of_analyzed_samples gender age height_at_time_of_surgery_cm weight_at_time_of_surgery_kg BMI race ethnicity ... measure_of_success_of_outcome_at_completion_of_this_follow_up_form tumor_status_at_date_of_last_contact_or_death vital_status_at_date_of_last_contact cause_of_death days_from_date_of_initial_pathologic_diagnosis_to_date_of_death performance_status_score_eastern_cooperative_oncology_group performance_status_score_karnofsky_score_preoperative performance_status_scale_timing days_from_date_of_initial_pathologic_diagnosis_to_date_of_last_contact is_this_patient_lost_to_follow_up
0 GBM C3L-00104 Tumor Male 58 188.0 115.0 32.54 White Not-Hispanic or Latino ... Patient Deceased With Tumor Deceased NaN 129.0 Not Evaluated: Not provided or available Not Evaluated: Not provided or available Not Evaluated: Not provided or available 128.0 No
1 GBM C3L-00365 Tumor Female 59 162.0 54.0 20.61 White Not-Hispanic or Latino ... Patient Deceased NaN Deceased Unknown; patient entered hospice care. At last... 322.0 Not Evaluated: Not provided or available Not Evaluated: Not provided or available Not Evaluated: Not provided or available 280.0 No
2 GBM C3L-00674 Tumor Male 45 193.0 102.0 27.44 White Not-Hispanic or Latino ... Patient Deceased NaN Deceased Progression of glioblastoma 478.0 Not Evaluated: Not provided or available 90: Able to carry on normal activity; minor si... Post-Adjuvant Therapy 385.0 No
3 GBM C3L-00677 Tumor Female 69 164.0 52.0 19.32 White Not-Hispanic or Latino ... Patient Deceased NaN Deceased Progression of glioblastoma + Multiple organ f... 154.0 1: Symptomatic; Restricted in physically stren... 70: Cares for self; unable to carry on normal ... Post-Adjuvant Therapy 154.0 No
4 GBM C3L-01040 Tumor Male 77 170.0 70.0 24.22 NaN NaN ... Persistent Disease With Tumor Living NaN NaN 1: Symptomatic; Restricted in physically stren... 70: Cares for self; unable to carry on normal ... Post-Adjuvant Therapy 608.0 Yes

5 rows × 43 columns

[5]:
cptac_clinical_data.groupby(['type_of_analyzed_samples', 'gender']).count()[['case_id']]
[5]:
case_id
type_of_analyzed_samples gender
Normal Female 5
Male 5
Tumor Female 44
Male 56

10 normal brain samples (5 female, 5 male) and 100 tumor samples (44 female, 56 male).

[6]:
cptac_clinical_data.shape
[6]:
(110, 43)
[7]:
list_of_samples = cptac_clinical_data['case_id'].tolist()
[8]:
len(list_of_samples)
[8]:
110
[9]:
cptac_clinical_data[cptac_clinical_data['case_id'].isin(list_of_samples)].groupby('type_of_analyzed_samples').count()
[9]:
tumor_code case_id gender age height_at_time_of_surgery_cm weight_at_time_of_surgery_kg BMI race ethnicity ethnicity_race_ancestry_identified ... measure_of_success_of_outcome_at_completion_of_this_follow_up_form tumor_status_at_date_of_last_contact_or_death vital_status_at_date_of_last_contact cause_of_death days_from_date_of_initial_pathologic_diagnosis_to_date_of_death performance_status_score_eastern_cooperative_oncology_group performance_status_score_karnofsky_score_preoperative performance_status_scale_timing days_from_date_of_initial_pathologic_diagnosis_to_date_of_last_contact is_this_patient_lost_to_follow_up
type_of_analyzed_samples
Normal 0 10 10 10 10 10 10 10 4 0 ... 0 0 10 9 0 0 0 0 0 0
Tumor 100 100 100 100 100 100 100 29 31 100 ... 80 73 94 31 49 94 94 94 94 94

2 rows × 42 columns

Sample Identifier Mapping

We use the file S048_CPTAC_GBM_Discovery_Cohort_TMT11_CaseID_SampleID_AliquotID_Map_Dec2019_r1.xlsx to match the clinical metadata (Case ID) and proteomics data (Aliquot ID).

We keep Sample type (normal, tumor) from the clinical metadata (groups), as well as gender and age to use in the differential regulation analysis as possible covariates.

[10]:
cptac_sample_mapping = pd.read_excel(cptac_sample_mapping_file, comment='#', header=2)
[11]:
cptac_sample_mapping.head()
[11]:
Batch TMT plex TMT channel Alias Case ID (Participant ID) Parent Sample ID(s) Aliquot ID Sample type OCT TCIA Slide ID TCIA Image links Unnamed: 11 Unnamed: 12 Unnamed: 13 Unnamed: 14
0 1 1 126 B1S1 ref NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 1 1 127N B1S1 GTEX-Y8DK-0011-R10A-SM-HAKY1 GTEX-Y8DK-0011-R10A-SM-HAKY1 CPT0204410003 normal No NaN NaN NaN NaN NaN NaN
2 1 1 127C B1S1 C3N-03183 C3N-03183-02, C3N-03183-03 CPT0206670004 tumor No C3N-03183-22, C3N-03183-23 https://pathology.cancerimagingarchive.net/pat... https://pathology.cancerimagingarchive.net/pat... NaN NaN NaN
3 1 1 128N B1S1 C3N-01505 C3N-01505-01 CPT0089150003 tumor No C3N-01505-21 https://pathology.cancerimagingarchive.net/pat... NaN NaN NaN NaN
4 1 1 128C B1S1 C3N-03188 C3N-03188-02 CPT0207030003 tumor No C3N-03188-22 https://pathology.cancerimagingarchive.net/pat... NaN NaN NaN NaN
[12]:
cptac_sample_mapping = cptac_sample_mapping[cptac_sample_mapping['Case ID (Participant ID)'].isin(list_of_samples)]
[13]:
cptac_sample_mapping.shape
[13]:
(110, 15)
[14]:
metadata = cptac_sample_mapping[['Case ID (Participant ID)','Aliquot ID', 'Sample type']].set_index('Case ID (Participant ID)')
metadata = metadata.join(cptac_clinical_data.set_index('case_id')[['gender', 'age']]).reset_index()
[15]:
metadata.head()
[15]:
Case ID (Participant ID) Aliquot ID Sample type gender age
0 GTEX-Y8DK-0011-R10A-SM-HAKY1 CPT0204410003 normal Male 62
1 C3N-03183 CPT0206670004 tumor Male 53
2 C3N-01505 CPT0089150003 tumor Male 74
3 C3N-03188 CPT0207030003 tumor Male 54
4 C3L-02984 CPT0190240004 tumor Male 34
[16]:
proteomics_sample_ids = cptac_sample_mapping['Aliquot ID'].tolist()
[17]:
proteomics_sample_ids
[17]:
['CPT0204410003',
 'CPT0206670004',
 'CPT0089150003',
 'CPT0207030003',
 'CPT0190240004',
 'CPT0161730003',
 'CPT0218330004',
 'CPT0104220003',
 'CPT0182580003',
 'CPT0167860004',
 'CPT0205670004',
 'CPT0093450003',
 'CPT0002410011',
 'CPT0189460003',
 'CPT0167750004',
 'CPT0218690004',
 'CPT0217060003',
 'CPT0205890003',
 'CPT0204420003',
 'CPT0189570004',
 'CPT0205780003',
 'CPT0168720003',
 'CPT0093550003',
 'CPT0217190003',
 'CPT0218770003',
 'CPT0204390003',
 'CPT0189250003',
 'CPT0087950003',
 'CPT0190360004',
 'CPT0224330003',
 'CPT0217880003',
 'CPT0127420003',
 'CPT0218960004',
 'CPT0168480003',
 'CPT0204360003',
 'CPT0221180003',
 'CPT0218830004',
 'CPT0225760003',
 'CPT0168270003',
 'CPT0064650003',
 'CPT0206880003',
 'CPT0168380003',
 'CPT0206000004',
 'CPT0167530003',
 'CPT0189850004',
 'CPT0196850003',
 'CPT0206560003',
 'CPT0219080004',
 'CPT0204380003',
 'CPT0224600003',
 'CPT0125570003',
 'CPT0217000004',
 'CPT0207090003',
 'CPT0217430008',
 'CPT0168590003',
 'CPT0186100003',
 'CPT0168080003',
 'CPT0162020003',
 'CPT0201710003',
 'CPT0204330003',
 'CPT0224390004',
 'CPT0209440003',
 'CPT0218890004',
 'CPT0208980003',
 'CPT0123530003',
 'CPT0071100003',
 'CPT0182550003',
 'CPT0217710008',
 'CPT0125510003',
 'CPT0127480003',
 'CPT0167640003',
 'CPT0087680003',
 'CPT0224540004',
 'CPT0167970003',
 'CPT0162100003',
 'CPT0189750004',
 'CPT0204350003',
 'CPT0104330003',
 'CPT0162140003',
 'CPT0078580003',
 'CPT0204400003',
 'CPT0093510003',
 'CPT0087570003',
 'CPT0216920008',
 'CPT0228220003',
 'CPT0175060003',
 'CPT0206330003',
 'CPT0217100003',
 'CPT0199770003',
 'CPT0189650004',
 'CPT0125220003',
 'CPT0206450003',
 'CPT0182500003',
 'CPT0225730003',
 'CPT0093590003',
 'CPT0204340003',
 'CPT0093360003',
 'CPT0206780003',
 'CPT0218670003',
 'CPT0087730003',
 'CPT0064890003',
 'CPT0206230003',
 'CPT0171580008',
 'CPT0168830003',
 'CPT0079790003',
 'CPT0092440003',
 'CPT0205570003',
 'CPT0206110003',
 'CPT0204370003',
 'CPT0205450004']

Proteomics Data

[18]:
cptac_proteomics_data = pd.read_csv(cptac_proteome_report_file, sep='\t')
[19]:
cptac_proteomics_data.head()
[19]:
Gene CPT0204410003 Log Ratio CPT0204410003 Unshared Log Ratio CPT0206670004 Log Ratio CPT0206670004 Unshared Log Ratio CPT0089150003 Log Ratio CPT0089150003 Unshared Log Ratio CPT0207030003 Log Ratio CPT0207030003 Unshared Log Ratio CPT0190240004 Log Ratio ... CPT0204370003 Log Ratio CPT0204370003 Unshared Log Ratio CPT0205450004 Log Ratio CPT0205450004 Unshared Log Ratio NCBIGeneID Authority Description Organism Chromosome Locus
0 Mean 0.533903 0.536714 0.634488 0.638574 -0.042532 -0.030967 0.352712 0.358905 0.577073 ... 0.412576 0.415405 0.502734 0.499886 NaN NaN NaN NaN NaN NaN
1 Median 0.430830 0.429394 0.698806 0.702917 0.002671 0.012689 0.359163 0.363725 0.607895 ... 0.307880 0.303315 0.529097 0.533102 NaN NaN NaN NaN NaN NaN
2 StdDev 1.028880 1.054129 0.685146 0.710699 0.679727 0.705571 0.618481 0.665532 0.808809 ... 0.898016 0.921603 0.561767 0.569397 NaN NaN NaN NaN NaN NaN
3 A1BG -0.878237 -0.876801 0.025279 0.021168 0.451471 0.441454 -0.206660 -0.211223 -0.835626 ... -0.816745 -0.812180 0.504218 0.500214 1.0 HGNC:5 alpha-1-B glycoprotein Homo sapiens 19 19q13.43
4 A2M -1.171150 -1.129142 0.156104 0.157726 0.476275 0.459270 -0.602418 -0.601337 -0.359150 ... -0.803398 -0.765856 0.553261 0.558857 2.0 HGNC:7 alpha-2-macroglobulin Homo sapiens 12 12p13.31

5 rows × 227 columns

[20]:
cols = {c:c.split(' ')[0] for c in cptac_proteomics_data.columns if c.split(' ')[0] in proteomics_sample_ids}
[21]:
cptac_proteomics_data = cptac_proteomics_data[['Gene'] + list(cols.keys())].set_index('Gene').drop(['Mean', 'Median', 'StdDev'], axis=0)
cptac_proteomics_data = cptac_proteomics_data.rename(cols, axis=1)
[22]:
cptac_proteomics_data.head()
[22]:
CPT0204410003 CPT0204410003 CPT0206670004 CPT0206670004 CPT0089150003 CPT0089150003 CPT0207030003 CPT0207030003 CPT0190240004 CPT0190240004 ... CPT0092440003 CPT0092440003 CPT0205570003 CPT0205570003 CPT0206110003 CPT0206110003 CPT0204370003 CPT0204370003 CPT0205450004 CPT0205450004
Gene
A1BG -0.878237 -0.876801 0.025279 0.021168 0.451471 0.441454 -0.206660 -0.211223 -0.835626 -0.831638 ... -0.041234 -0.051084 0.762486 0.757234 0.797139 0.794111 -0.816745 -0.812180 0.504218 0.500214
A2M -1.171150 -1.129142 0.156104 0.157726 0.476275 0.459270 -0.602418 -0.601337 -0.359150 -0.354726 ... 0.235061 0.232168 0.445184 0.440900 0.762671 0.752278 -0.803398 -0.765856 0.553261 0.558857
AAAS -0.406262 -0.404826 0.471849 0.467738 0.091538 0.081520 -0.125625 -0.130188 0.245751 0.249738 ... 0.001848 -0.008002 0.082095 0.076843 0.108620 0.105592 -0.278136 -0.273572 0.211318 0.207313
AACS 0.926254 0.927690 0.285570 0.281459 -0.065078 -0.075096 -0.183277 -0.187840 -0.004768 -0.000780 ... -0.078248 -0.088098 0.139698 0.134446 0.151606 0.148578 0.288503 0.293067 0.053601 0.049597
AADAT 1.317162 1.318598 -0.497299 -0.501410 0.156067 0.146049 -0.174811 -0.179374 0.185505 0.189493 ... 0.614136 0.604287 0.225270 0.220018 0.099918 0.096891 1.213109 1.217673 -0.100376 -0.104380

5 rows × 220 columns

[23]:
cptac_proteomics_data.shape
[23]:
(10977, 220)

Clinical and Proteomics Data

[24]:
cptac_proteomics_data = cptac_proteomics_data.transpose().join(metadata.set_index('Aliquot ID'))
[25]:
cptac_proteomics_data.head()
[25]:
A1BG A2M AAAS AACS AADAT AAGAB AAK1 AAMDC AAMP AAR2 ... ZWILCH ZXDC ZYG11B ZYX ZZEF1 ZZZ3 Case ID (Participant ID) Sample type gender age
CPT0002410011 0.110003 0.432359 0.204539 -0.695101 NaN 0.133647 -0.524525 0.248913 -0.006661 0.482628 ... -0.043140 0.138792 0.013802 0.273392 -0.097031 0.641540 C3L-00365 tumor Female 59
CPT0002410011 0.101098 0.421684 0.195633 -0.704007 NaN 0.124741 -0.540317 0.240007 -0.015566 0.473723 ... -0.052046 0.129887 0.004897 0.239006 -0.105937 0.632634 C3L-00365 tumor Female 59
CPT0064650003 0.457406 0.702453 0.095134 -0.242212 0.072786 0.174392 -0.072472 -0.461447 -0.316991 0.703966 ... 0.570851 1.089374 -0.198843 -0.016913 0.054600 -0.754282 C3L-00674 tumor Male 45
CPT0064650003 0.443143 0.695942 0.080871 -0.256475 0.058524 0.160130 -0.070154 -0.475710 -0.331254 0.689704 ... 0.556589 1.075111 -0.213106 -0.009587 0.040337 -0.768544 C3L-00674 tumor Male 45
CPT0064890003 0.046885 -0.032622 0.358109 -0.121666 0.105781 -0.220744 -0.426095 -0.019649 -0.124015 0.396851 ... 0.588521 0.192123 0.075653 0.275052 -0.047978 0.172529 C3L-01327 tumor Male 74

5 rows × 10981 columns

Clinical Knowledge Graph Re-analysis

[26]:
from ckg.analytics_core.analytics import analytics
from ckg.analytics_core.viz import viz

from ckg.graphdb_connector import connector
driver = connector.getGraphDatabaseConnectionConfiguration()

from ckg.report_manager import knowledge


from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
c:\users\sande\.conda\envs\ckgenv\lib\site-packages\outdated\utils.py:18: OutdatedPackageWarning: The package pingouin is out of date. Your version is 0.3.12, the latest is 0.4.0.
Set the environment variable OUTDATED_IGNORE=1 to disable these warnings.
  **kwargs
WGCNA functions will not work. Module Rpy2 not installed.
R functions will not work. Module Rpy2 not installed.

Imputation of missing values

We use KNN algorithm to impute missing values in the proteomics data.

[27]:
cptac_proteomics_data = analytics.imputation_KNN(cptac_proteomics_data.reset_index(), drop_cols=['gender', 'age', 'Sample type'], group='Sample type', cutoff=0.5)
cptac_proteomics_data = cptac_proteomics_data
[28]:
cptac_proteomics_data.head()
[28]:
A1BG A2M AAAS AACS AADAT AAGAB AAK1 AAMDC AAMP AAR2 ... ZXDC ZYG11B ZYX ZZEF1 ZZZ3 age Sample type gender Case ID (Participant ID) index
0 0.110003 0.432359 0.204539 -0.695101 0.039568 0.133647 -0.524525 0.248913 -0.006661 0.482628 ... 0.138792 0.013802 0.273392 -0.097031 0.641540 59 tumor Female C3L-00365 CPT0002410011
1 0.101098 0.421684 0.195633 -0.704007 0.038556 0.124741 -0.540317 0.240007 -0.015566 0.473723 ... 0.129887 0.004897 0.239006 -0.105937 0.632634 59 tumor Female C3L-00365 CPT0002410011
2 0.457406 0.702453 0.095134 -0.242212 0.072786 0.174392 -0.072472 -0.461447 -0.316991 0.703966 ... 1.089374 -0.198843 -0.016913 0.054600 -0.754282 45 tumor Male C3L-00674 CPT0064650003
3 0.443143 0.695942 0.080871 -0.256475 0.058524 0.160130 -0.070154 -0.475710 -0.331254 0.689704 ... 1.075111 -0.213106 -0.009587 0.040337 -0.768544 45 tumor Male C3L-00674 CPT0064650003
4 0.046885 -0.032622 0.358109 -0.121666 0.105781 -0.220744 -0.426095 -0.019649 -0.124015 0.396851 ... 0.192123 0.075653 0.275052 -0.047978 0.172529 74 tumor Male C3L-01327 CPT0064890003

5 rows × 10789 columns

[29]:
cptac_proteomics_data = cptac_proteomics_data.rename({'index':'subject', 'Sample type': 'group', 'Case ID (Participant ID)': 'sample'}, axis=1)

Analysis of Covariance

We analyze dataset to find differentially regulated proteins comparing normal and brain tissue samples and taking age and gender as covariates.

[30]:
cptac_proteomics_data = cptac_proteomics_data.sort_values(by=['group'], ascending=False)
[31]:
results = analytics.run_ancova(cptac_proteomics_data, covariates=['age', 'gender'], drop_cols=['sample', 'subject'], subject='subject', group='group', alpha=0.01)
[32]:
results.head()
[32]:
identifier group1 group2 mean(group1) std(group1) mean(group2) std(group2) posthoc T-Statistics posthoc pvalue coef ... log2FC FC F-statistics pvalue padj correction rejected -log10 pvalue Method posthoc padj
0 A1BG normal tumor -0.932974 0.087071 0.109841 0.529097 9.013360 1.086551e-16 1.054802 ... -1.042815 0.485379 81.240659 1.086551e-16 3.058566e-16 FDR correction BH True 15.963950 One-way ancova 3.058566e-16
1 A2M normal tumor -1.070228 0.289921 0.093514 0.576938 9.135619 4.820736e-17 1.177500 ... -1.163741 0.446353 83.459542 4.820736e-17 1.384469e-16 FDR correction BH True 16.316887 One-way ancova 1.384469e-16
2 AAAS normal tumor -0.485488 0.185624 0.136464 0.207010 13.078227 3.568030e-29 0.622227 ... -0.621952 0.649791 171.040029 3.568030e-29 2.066468e-28 FDR correction BH True 28.447572 One-way ancova 2.066468e-28
3 AACS normal tumor 0.722912 0.218220 -0.049294 0.298136 -11.801273 3.937318e-25 -0.782880 ... 0.772206 1.707879 139.270049 3.937318e-25 1.794592e-24 FDR correction BH True 24.404799 One-way ancova 1.794592e-24
4 AADAT normal tumor 1.241883 0.282301 -0.151706 0.351998 -17.257875 1.597045e-42 -1.399042 ... 1.393590 2.627316 297.834255 1.597045e-42 1.945476e-41 FDR correction BH True 41.796683 One-way ancova 1.945476e-41

5 rows × 23 columns

[33]:
fig = viz.run_volcano(results, identifier='volcano_plot', args={'alpha': 0.01,
                                                                      'fc': 2,
                                                                      'colorscale': 'Blues',
                                                                      'showscale': False,
                                                                      'marker_size': 8,
                                                                      'x_title': 'log2FC',
                                                                      'y_title': '-log10(pvalue)',
                                                                      'num_annotations': 1000,
                                                                      'annotate_list': []})

viz.save_DASH_plot(fig[0], 'volcano_plot_normal_tumor', plot_format='png', directory=data_dir)
iplot(fig[0].figure)

Enrichment and Knowledge Annotation

Using the identified list of significantly regulated proteins we annotation from CKG to determine enriched biological processes, and reveal the knowledge graph associated to these protein hits with a focus on up-regulated proteins in the tumor tissue compared to normal. These proteins could be targeted by drug inhibitors to try to reverse the progression of the tumor.

[34]:
annotation_query = '''MATCH (p:Protein)-[r:ASSOCIATED_WITH]-(bp:Biological_process)
                        WHERE p.name IN $protein_list
                        RETURN DISTINCT p.name AS identifier, bp.name AS annotation'''
driver = connector.getGraphDatabaseConnectionConfiguration()
annotation = connector.getCursorData(driver, annotation_query, parameters={'protein_list':results['identifier'].unique().tolist()})
[35]:
enrichment = analytics.run_up_down_regulation_enrichment(results, annotation, identifier='identifier', groups=['group1', 'group2'], annotation_col='annotation', reject_col='rejected', group_col='group', method='fisher', correction='fdr_bh', alpha=0.01, lfc_cutoff=1)

upregulated_gbm_enrichment = enrichment['normal~tumor']
upregulated_gbm_enrichment = upregulated_gbm_enrichment[upregulated_gbm_enrichment['direction'] == 'downregulated']

e = {'normal~tumor': upregulated_gbm_enrichment}
[36]:
figures = viz.get_enrichment_plots(e, identifier='enrichment', args={'width':2200})
i = 0
for fig in figures:
    iplot(fig.figure)
    viz.save_DASH_plot(fig, name="enrichment_"+str(i), plot_format='png', directory=data_dir)
    i += 1

Knowledge Summarization

[37]:
upregulated_gbm = results[(results['rejected']) & (results['log2FC'] < -1)]
[38]:
upregulated_gbm.shape
[38]:
(861, 23)
[39]:
kn = knowledge.Knowledge(identifier='GBM', data=None)

kn.annotate_list(query_list=upregulated_gbm['identifier'].tolist(),
                 entity_type='protein',
                 queries_file=None,
                 attribute='name',
                 diseases=['glioblastoma'],
                 entities=None)
[40]:
kn.generate_report(visualizations=['sankey', 'network'],
                   summarize=True,
                   method='betweenness',
                   inplace=True)
[41]:
kn.report.visualize_report(environment='notebook')[0]
[42]:
kn.save_report(data_dir)
[58]:
kn.keep_nodes
[58]:
['glioblastoma']

Finding Drug Inhibitors

We use the knowledge in CKG to filter the list of up-regulated proteins in the tumor tissue compared to normal to explore possible drug inhibitors that could reverse the progression of the tumor.

We initially filter those significantly upregulated proteins that have been already associated with GBM by using the protein-disease associations in CKG.

We then use the filtered list to find inhibitors stored in CKG as drug-target relationships.

From the obtained list of candidate drugs, we further filter the results by finding evidence in publications mentioning the drug together with the disease and the protein they target.

This last step provides a final list of connected protein hits, candidate drug inhibitors and publications that can be visualize as a CKG knowledge subgraph.

[43]:
query = '''MATCH (p:Protein)-[r:ASSOCIATED_WITH]-(d:Disease{name:"glioblastoma"})
            WHERE p.name IN $protein_list AND r.score > 1.5 RETURN p.name, d.name, r.score'''

res = connector.getCursorData(driver, query, parameters={'protein_list': upregulated_gbm['identifier'].tolist()})
[44]:
res.head()
[44]:
d.name p.name r.score
0 glioblastoma CD276 1.602
1 glioblastoma ANXA5 2.290
2 glioblastoma THBS1 1.615
3 glioblastoma IGFBP2 1.835
4 glioblastoma MMP9 2.218
[45]:
len(res['p.name'].tolist())
[45]:
44
[46]:
query = '''MATCH (p:Protein)-[r:ACTS_ON{action:"inhibition"}]-(d:Drug)
            WHERE p.name IN $protein_list AND r.score > 0.7
            WITH p, d, r, SIZE((:Protein)-[:ACTS_ON]-(d)) as degree WHERE degree < 10
            RETURN p.name, d.name, r.score'''

res = connector.getCursorData(driver, query, parameters={'protein_list': res['p.name'].tolist()})
[47]:
res.head()
[47]:
d.name p.name r.score
0 Pirfenidone MMP9 0.8
1 3,5-Dimethyl-1-(3-Nitrophenyl)-1h-Pyrazole-4-C... MMP9 0.8
2 Hydroflumethiazide CA9 0.8
3 Pirfenidone ALB 0.8
4 (10E,12Z)-octadecadienoic acid MMP2 0.8
[48]:
res['d.name'].unique().shape
[48]:
(28,)
[49]:
project_knowledge = knowledge.Knowledge(identifier='targets',
                              data=None,
                              nodes={},
                              relationships={},
                              queries_file=None,
                              colors={},
                              graph=None,
                              report={})
project_knowledge.generate_knowledge_from_edgelist(edgelist=res,
                                                   entity1='Protein',
                                                   entity2='Drug',
                                                   source='p.name',
                                                   target='d.name',
                                                   rtype='associated_with',
                                                   weight='r.score')
[50]:
query = '''MATCH (d:Drug)-[:MENTIONED_IN_PUBLICATION]-(p:Publication)-[:MENTIONED_IN_PUBLICATION]-(di:Disease{name:"glioblastoma"})
            WHERE d.name IN $drug_list
            WITH d, p
            MATCH (pro:Protein)-[:MENTIONED_IN_PUBLICATION]-(p)
            WHERE pro.name IN $protein_list
            RETURN p.id, d.name, d.class, pro.name'''

res = connector.getCursorData(driver, query, parameters={'drug_list': res['d.name'].tolist(),
                                                         'protein_list':res['p.name'].unique().tolist()})
[51]:
res.columns = ['Drug class', 'Drug name', 'Publication', 'Protein name']

res['Publication'] = ["PMID:"+p for p in res['Publication'].tolist()]

res[['Drug class', 'Drug name', 'Publication', 'Protein name']]
[51]:
Drug class Drug name Publication Protein name
0 Azoles 3,5-Dimethyl-1-(3-Nitrophenyl)-1h-Pyrazole-4-C... PMID:30470262 MMP9
1 Pyridines and derivatives Pirfenidone PMID:31282197 MMP9
2 Purine nucleosides Nelarabine PMID:26899176 MMP9
3 Pyridines and derivatives Pirfenidone PMID:31282197 CA9
4 Benzene and substituted derivatives Cetirizine PMID:31022935 ALB
5 Carboxylic acids and derivatives Tranexamic acid PMID:22539956 ALB
6 Pyridines and derivatives Pirfenidone PMID:31117237 FN1
7 Purine nucleosides Nelarabine PMID:31950163 FN1
8 Pyridines and derivatives Pirfenidone PMID:31282197 FN1
9 Pyridines and derivatives Pirfenidone PMID:29996062 FN1
10 Pyridines and derivatives Pirfenidone PMID:31547567 FN1
11 Azoles 3,5-Dimethyl-1-(3-Nitrophenyl)-1h-Pyrazole-4-C... PMID:31231472 FN1
12 Pyridines and derivatives Pirfenidone PMID:31231472 FN1
13 Pyridines and derivatives Pirfenidone PMID:29038232 FN1
14 Benzene and substituted derivatives N-(4-sulfamoylphenyl)-1H-indazole-3-carboxamide PMID:25707963 FN1
15 Pyridines and derivatives Pirfenidone PMID:25026295 FN1
16 Benzothiophenes 4-Iodobenzo[B]Thiophene-2-Carboxamidine PMID:21976520 FN1
17 Pyridines and derivatives Pirfenidone PMID:29996062 ICAM1
18 Azoles 3,5-Dimethyl-1-(3-Nitrophenyl)-1h-Pyrazole-4-C... PMID:31231472 CDK1
19 Pyridines and derivatives Pirfenidone PMID:31231472 CDK1
20 Benzothiophenes 4-Iodobenzo[B]Thiophene-2-Carboxamidine PMID:21976520 PLG
21 Carboxylic acids and derivatives Mimosine PMID:20226717 PLG
22 Pyridines and derivatives Pirfenidone PMID:29996062 CCL2
23 Pyridines and derivatives Pirfenidone PMID:25026295 CHI3L1
[52]:
res.to_csv(os.path.join(data_dir, 'studies_drugs.tsv'), sep='\t', header=True, index=False, doublequote=None)
[53]:
len(res['Drug name'].tolist())
[53]:
24
[54]:
res['r.score'] = 1
[55]:
project_knowledge.generate_knowledge_from_edgelist(edgelist=res,
                                                   entity1='Drug',
                                                   entity2='Publication',
                                                   source='Drug name',
                                                   target='Publication',
                                                   rtype='associated_with',
                                                   weight='r.score')

project_knowledge.generate_knowledge_from_edgelist(edgelist=res,
                                                   entity1='Protein',
                                                   entity2='Publication',
                                                   source='Protein name',
                                                   target='Publication',
                                                   rtype='associated_with',
                                                   weight='r.score')
[60]:
project_knowledge.generate_report(visualizations=['sankey', 'network'], summarize=False)
project_knowledge.report.visualize_report(environment='notebook')[0]
[57]:
project_knowledge.save_report(data_dir)