CPTAC Glioblastoma (GBM) Discovery Study¶
Glioblastoma is the most common type of brain cancer in adults with approximately 14,000 new diagnoses each year (NCI Cancer Currents, 2017). Tumors from patients with GBM were molecularly profiled by The Cancer Genome Atlas (TCGA) and these studies identified somatic mutations associated with essential signaling pathways (Nature 2008, Cell 2013). To elucidate the proteome, phosphoproteome, and acetylome profiles of GBM tumors, tissue from 99 patients were subjected to mass spectrometry analysis using the 11-plexed isobaric tandem mass tags (TMT-11). Normal brain samples from 10 participants of the Genotype-Tissue Expression(GTEx) program were also analyzed. CPTAC GBM discovery study
Exploring CPTAC GBM proteomics data with CKG¶
In this notebook, we analyze the GBM proteomics data from CPTAC to identify significant proteomics differences between tumor and normal brain tissue. The objective is to then annotate the potentially relevant proteins with knowledge from CKG to interpret their involvement in the progression of the disease.
Initially, we will format the data using mainly Pandas library to be able to use it within CKG’s platform and then run a sequence of analysis:
Data preprocessing
Differential regulation
Knowledge annotation
Drug annotation
The data was downloaded from: https://cptac-data-portal.georgetown.edu/study-summary/S048
“Data used in this analysis were generated by the Clinical Proteomic Tumor Analysis Consortium (NCI/NIH).”
[1]:
import os
import pandas as pd
CPTAC Files Used¶
[2]:
data_dir = '/Users/sande/work/Multi_Omics_Analytics_Department/GBM_FilipMundt/CPTAC_Phase_III_Data/CPTAC_GBM_S048'
cptac_proteome_report_file = os.path.join(data_dir,'CPTAC_GBM_Proteome_CDAP_Protein_Report.r1/CPTAC3_Glioblastoma_Multiforme_Proteome.tmt11.tsv')
cptac_clinical_file = os.path.join(data_dir,'CPTAC_GBM_metadata/S048_CPTAC_GBM_Discovery_Cohort_Clinical_Data_Dec2019_r1.xlsx')
cptac_sample_mapping_file = os.path.join(data_dir,'CPTAC_GBM_metadata/S048_CPTAC_GBM_Discovery_Cohort_TMT11_CaseID_SampleID_AliquotID_Map_Dec2019_r1.xlsx')
Clinical Data¶
[3]:
cptac_clinical_data = pd.read_excel(cptac_clinical_file, sheet_name='Clinical_Attributes')
[4]:
cptac_clinical_data.head()
[4]:
tumor_code | case_id | type_of_analyzed_samples | gender | age | height_at_time_of_surgery_cm | weight_at_time_of_surgery_kg | BMI | race | ethnicity | ... | measure_of_success_of_outcome_at_completion_of_this_follow_up_form | tumor_status_at_date_of_last_contact_or_death | vital_status_at_date_of_last_contact | cause_of_death | days_from_date_of_initial_pathologic_diagnosis_to_date_of_death | performance_status_score_eastern_cooperative_oncology_group | performance_status_score_karnofsky_score_preoperative | performance_status_scale_timing | days_from_date_of_initial_pathologic_diagnosis_to_date_of_last_contact | is_this_patient_lost_to_follow_up | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | GBM | C3L-00104 | Tumor | Male | 58 | 188.0 | 115.0 | 32.54 | White | Not-Hispanic or Latino | ... | Patient Deceased | With Tumor | Deceased | NaN | 129.0 | Not Evaluated: Not provided or available | Not Evaluated: Not provided or available | Not Evaluated: Not provided or available | 128.0 | No |
1 | GBM | C3L-00365 | Tumor | Female | 59 | 162.0 | 54.0 | 20.61 | White | Not-Hispanic or Latino | ... | Patient Deceased | NaN | Deceased | Unknown; patient entered hospice care. At last... | 322.0 | Not Evaluated: Not provided or available | Not Evaluated: Not provided or available | Not Evaluated: Not provided or available | 280.0 | No |
2 | GBM | C3L-00674 | Tumor | Male | 45 | 193.0 | 102.0 | 27.44 | White | Not-Hispanic or Latino | ... | Patient Deceased | NaN | Deceased | Progression of glioblastoma | 478.0 | Not Evaluated: Not provided or available | 90: Able to carry on normal activity; minor si... | Post-Adjuvant Therapy | 385.0 | No |
3 | GBM | C3L-00677 | Tumor | Female | 69 | 164.0 | 52.0 | 19.32 | White | Not-Hispanic or Latino | ... | Patient Deceased | NaN | Deceased | Progression of glioblastoma + Multiple organ f... | 154.0 | 1: Symptomatic; Restricted in physically stren... | 70: Cares for self; unable to carry on normal ... | Post-Adjuvant Therapy | 154.0 | No |
4 | GBM | C3L-01040 | Tumor | Male | 77 | 170.0 | 70.0 | 24.22 | NaN | NaN | ... | Persistent Disease | With Tumor | Living | NaN | NaN | 1: Symptomatic; Restricted in physically stren... | 70: Cares for self; unable to carry on normal ... | Post-Adjuvant Therapy | 608.0 | Yes |
5 rows × 43 columns
[5]:
cptac_clinical_data.groupby(['type_of_analyzed_samples', 'gender']).count()[['case_id']]
[5]:
case_id | ||
---|---|---|
type_of_analyzed_samples | gender | |
Normal | Female | 5 |
Male | 5 | |
Tumor | Female | 44 |
Male | 56 |
10 normal brain samples (5 female, 5 male) and 100 tumor samples (44 female, 56 male).
[6]:
cptac_clinical_data.shape
[6]:
(110, 43)
[7]:
list_of_samples = cptac_clinical_data['case_id'].tolist()
[8]:
len(list_of_samples)
[8]:
110
[9]:
cptac_clinical_data[cptac_clinical_data['case_id'].isin(list_of_samples)].groupby('type_of_analyzed_samples').count()
[9]:
tumor_code | case_id | gender | age | height_at_time_of_surgery_cm | weight_at_time_of_surgery_kg | BMI | race | ethnicity | ethnicity_race_ancestry_identified | ... | measure_of_success_of_outcome_at_completion_of_this_follow_up_form | tumor_status_at_date_of_last_contact_or_death | vital_status_at_date_of_last_contact | cause_of_death | days_from_date_of_initial_pathologic_diagnosis_to_date_of_death | performance_status_score_eastern_cooperative_oncology_group | performance_status_score_karnofsky_score_preoperative | performance_status_scale_timing | days_from_date_of_initial_pathologic_diagnosis_to_date_of_last_contact | is_this_patient_lost_to_follow_up | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
type_of_analyzed_samples | |||||||||||||||||||||
Normal | 0 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 4 | 0 | ... | 0 | 0 | 10 | 9 | 0 | 0 | 0 | 0 | 0 | 0 |
Tumor | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 29 | 31 | 100 | ... | 80 | 73 | 94 | 31 | 49 | 94 | 94 | 94 | 94 | 94 |
2 rows × 42 columns
Sample Identifier Mapping¶
We use the file S048_CPTAC_GBM_Discovery_Cohort_TMT11_CaseID_SampleID_AliquotID_Map_Dec2019_r1.xlsx to match the clinical metadata (Case ID) and proteomics data (Aliquot ID).
We keep Sample type (normal, tumor) from the clinical metadata (groups), as well as gender and age to use in the differential regulation analysis as possible covariates.
[10]:
cptac_sample_mapping = pd.read_excel(cptac_sample_mapping_file, comment='#', header=2)
[11]:
cptac_sample_mapping.head()
[11]:
Batch | TMT plex | TMT channel | Alias | Case ID (Participant ID) | Parent Sample ID(s) | Aliquot ID | Sample type | OCT | TCIA Slide ID | TCIA Image links | Unnamed: 11 | Unnamed: 12 | Unnamed: 13 | Unnamed: 14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 126 | B1S1 | ref | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 1 | 1 | 127N | B1S1 | GTEX-Y8DK-0011-R10A-SM-HAKY1 | GTEX-Y8DK-0011-R10A-SM-HAKY1 | CPT0204410003 | normal | No | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 1 | 1 | 127C | B1S1 | C3N-03183 | C3N-03183-02, C3N-03183-03 | CPT0206670004 | tumor | No | C3N-03183-22, C3N-03183-23 | https://pathology.cancerimagingarchive.net/pat... | https://pathology.cancerimagingarchive.net/pat... | NaN | NaN | NaN |
3 | 1 | 1 | 128N | B1S1 | C3N-01505 | C3N-01505-01 | CPT0089150003 | tumor | No | C3N-01505-21 | https://pathology.cancerimagingarchive.net/pat... | NaN | NaN | NaN | NaN |
4 | 1 | 1 | 128C | B1S1 | C3N-03188 | C3N-03188-02 | CPT0207030003 | tumor | No | C3N-03188-22 | https://pathology.cancerimagingarchive.net/pat... | NaN | NaN | NaN | NaN |
[12]:
cptac_sample_mapping = cptac_sample_mapping[cptac_sample_mapping['Case ID (Participant ID)'].isin(list_of_samples)]
[13]:
cptac_sample_mapping.shape
[13]:
(110, 15)
[14]:
metadata = cptac_sample_mapping[['Case ID (Participant ID)','Aliquot ID', 'Sample type']].set_index('Case ID (Participant ID)')
metadata = metadata.join(cptac_clinical_data.set_index('case_id')[['gender', 'age']]).reset_index()
[15]:
metadata.head()
[15]:
Case ID (Participant ID) | Aliquot ID | Sample type | gender | age | |
---|---|---|---|---|---|
0 | GTEX-Y8DK-0011-R10A-SM-HAKY1 | CPT0204410003 | normal | Male | 62 |
1 | C3N-03183 | CPT0206670004 | tumor | Male | 53 |
2 | C3N-01505 | CPT0089150003 | tumor | Male | 74 |
3 | C3N-03188 | CPT0207030003 | tumor | Male | 54 |
4 | C3L-02984 | CPT0190240004 | tumor | Male | 34 |
[16]:
proteomics_sample_ids = cptac_sample_mapping['Aliquot ID'].tolist()
[17]:
proteomics_sample_ids
[17]:
['CPT0204410003',
'CPT0206670004',
'CPT0089150003',
'CPT0207030003',
'CPT0190240004',
'CPT0161730003',
'CPT0218330004',
'CPT0104220003',
'CPT0182580003',
'CPT0167860004',
'CPT0205670004',
'CPT0093450003',
'CPT0002410011',
'CPT0189460003',
'CPT0167750004',
'CPT0218690004',
'CPT0217060003',
'CPT0205890003',
'CPT0204420003',
'CPT0189570004',
'CPT0205780003',
'CPT0168720003',
'CPT0093550003',
'CPT0217190003',
'CPT0218770003',
'CPT0204390003',
'CPT0189250003',
'CPT0087950003',
'CPT0190360004',
'CPT0224330003',
'CPT0217880003',
'CPT0127420003',
'CPT0218960004',
'CPT0168480003',
'CPT0204360003',
'CPT0221180003',
'CPT0218830004',
'CPT0225760003',
'CPT0168270003',
'CPT0064650003',
'CPT0206880003',
'CPT0168380003',
'CPT0206000004',
'CPT0167530003',
'CPT0189850004',
'CPT0196850003',
'CPT0206560003',
'CPT0219080004',
'CPT0204380003',
'CPT0224600003',
'CPT0125570003',
'CPT0217000004',
'CPT0207090003',
'CPT0217430008',
'CPT0168590003',
'CPT0186100003',
'CPT0168080003',
'CPT0162020003',
'CPT0201710003',
'CPT0204330003',
'CPT0224390004',
'CPT0209440003',
'CPT0218890004',
'CPT0208980003',
'CPT0123530003',
'CPT0071100003',
'CPT0182550003',
'CPT0217710008',
'CPT0125510003',
'CPT0127480003',
'CPT0167640003',
'CPT0087680003',
'CPT0224540004',
'CPT0167970003',
'CPT0162100003',
'CPT0189750004',
'CPT0204350003',
'CPT0104330003',
'CPT0162140003',
'CPT0078580003',
'CPT0204400003',
'CPT0093510003',
'CPT0087570003',
'CPT0216920008',
'CPT0228220003',
'CPT0175060003',
'CPT0206330003',
'CPT0217100003',
'CPT0199770003',
'CPT0189650004',
'CPT0125220003',
'CPT0206450003',
'CPT0182500003',
'CPT0225730003',
'CPT0093590003',
'CPT0204340003',
'CPT0093360003',
'CPT0206780003',
'CPT0218670003',
'CPT0087730003',
'CPT0064890003',
'CPT0206230003',
'CPT0171580008',
'CPT0168830003',
'CPT0079790003',
'CPT0092440003',
'CPT0205570003',
'CPT0206110003',
'CPT0204370003',
'CPT0205450004']
Proteomics Data¶
[18]:
cptac_proteomics_data = pd.read_csv(cptac_proteome_report_file, sep='\t')
[19]:
cptac_proteomics_data.head()
[19]:
Gene | CPT0204410003 Log Ratio | CPT0204410003 Unshared Log Ratio | CPT0206670004 Log Ratio | CPT0206670004 Unshared Log Ratio | CPT0089150003 Log Ratio | CPT0089150003 Unshared Log Ratio | CPT0207030003 Log Ratio | CPT0207030003 Unshared Log Ratio | CPT0190240004 Log Ratio | ... | CPT0204370003 Log Ratio | CPT0204370003 Unshared Log Ratio | CPT0205450004 Log Ratio | CPT0205450004 Unshared Log Ratio | NCBIGeneID | Authority | Description | Organism | Chromosome | Locus | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Mean | 0.533903 | 0.536714 | 0.634488 | 0.638574 | -0.042532 | -0.030967 | 0.352712 | 0.358905 | 0.577073 | ... | 0.412576 | 0.415405 | 0.502734 | 0.499886 | NaN | NaN | NaN | NaN | NaN | NaN |
1 | Median | 0.430830 | 0.429394 | 0.698806 | 0.702917 | 0.002671 | 0.012689 | 0.359163 | 0.363725 | 0.607895 | ... | 0.307880 | 0.303315 | 0.529097 | 0.533102 | NaN | NaN | NaN | NaN | NaN | NaN |
2 | StdDev | 1.028880 | 1.054129 | 0.685146 | 0.710699 | 0.679727 | 0.705571 | 0.618481 | 0.665532 | 0.808809 | ... | 0.898016 | 0.921603 | 0.561767 | 0.569397 | NaN | NaN | NaN | NaN | NaN | NaN |
3 | A1BG | -0.878237 | -0.876801 | 0.025279 | 0.021168 | 0.451471 | 0.441454 | -0.206660 | -0.211223 | -0.835626 | ... | -0.816745 | -0.812180 | 0.504218 | 0.500214 | 1.0 | HGNC:5 | alpha-1-B glycoprotein | Homo sapiens | 19 | 19q13.43 |
4 | A2M | -1.171150 | -1.129142 | 0.156104 | 0.157726 | 0.476275 | 0.459270 | -0.602418 | -0.601337 | -0.359150 | ... | -0.803398 | -0.765856 | 0.553261 | 0.558857 | 2.0 | HGNC:7 | alpha-2-macroglobulin | Homo sapiens | 12 | 12p13.31 |
5 rows × 227 columns
[20]:
cols = {c:c.split(' ')[0] for c in cptac_proteomics_data.columns if c.split(' ')[0] in proteomics_sample_ids}
[21]:
cptac_proteomics_data = cptac_proteomics_data[['Gene'] + list(cols.keys())].set_index('Gene').drop(['Mean', 'Median', 'StdDev'], axis=0)
cptac_proteomics_data = cptac_proteomics_data.rename(cols, axis=1)
[22]:
cptac_proteomics_data.head()
[22]:
CPT0204410003 | CPT0204410003 | CPT0206670004 | CPT0206670004 | CPT0089150003 | CPT0089150003 | CPT0207030003 | CPT0207030003 | CPT0190240004 | CPT0190240004 | ... | CPT0092440003 | CPT0092440003 | CPT0205570003 | CPT0205570003 | CPT0206110003 | CPT0206110003 | CPT0204370003 | CPT0204370003 | CPT0205450004 | CPT0205450004 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Gene | |||||||||||||||||||||
A1BG | -0.878237 | -0.876801 | 0.025279 | 0.021168 | 0.451471 | 0.441454 | -0.206660 | -0.211223 | -0.835626 | -0.831638 | ... | -0.041234 | -0.051084 | 0.762486 | 0.757234 | 0.797139 | 0.794111 | -0.816745 | -0.812180 | 0.504218 | 0.500214 |
A2M | -1.171150 | -1.129142 | 0.156104 | 0.157726 | 0.476275 | 0.459270 | -0.602418 | -0.601337 | -0.359150 | -0.354726 | ... | 0.235061 | 0.232168 | 0.445184 | 0.440900 | 0.762671 | 0.752278 | -0.803398 | -0.765856 | 0.553261 | 0.558857 |
AAAS | -0.406262 | -0.404826 | 0.471849 | 0.467738 | 0.091538 | 0.081520 | -0.125625 | -0.130188 | 0.245751 | 0.249738 | ... | 0.001848 | -0.008002 | 0.082095 | 0.076843 | 0.108620 | 0.105592 | -0.278136 | -0.273572 | 0.211318 | 0.207313 |
AACS | 0.926254 | 0.927690 | 0.285570 | 0.281459 | -0.065078 | -0.075096 | -0.183277 | -0.187840 | -0.004768 | -0.000780 | ... | -0.078248 | -0.088098 | 0.139698 | 0.134446 | 0.151606 | 0.148578 | 0.288503 | 0.293067 | 0.053601 | 0.049597 |
AADAT | 1.317162 | 1.318598 | -0.497299 | -0.501410 | 0.156067 | 0.146049 | -0.174811 | -0.179374 | 0.185505 | 0.189493 | ... | 0.614136 | 0.604287 | 0.225270 | 0.220018 | 0.099918 | 0.096891 | 1.213109 | 1.217673 | -0.100376 | -0.104380 |
5 rows × 220 columns
[23]:
cptac_proteomics_data.shape
[23]:
(10977, 220)
Clinical and Proteomics Data¶
[24]:
cptac_proteomics_data = cptac_proteomics_data.transpose().join(metadata.set_index('Aliquot ID'))
[25]:
cptac_proteomics_data.head()
[25]:
A1BG | A2M | AAAS | AACS | AADAT | AAGAB | AAK1 | AAMDC | AAMP | AAR2 | ... | ZWILCH | ZXDC | ZYG11B | ZYX | ZZEF1 | ZZZ3 | Case ID (Participant ID) | Sample type | gender | age | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CPT0002410011 | 0.110003 | 0.432359 | 0.204539 | -0.695101 | NaN | 0.133647 | -0.524525 | 0.248913 | -0.006661 | 0.482628 | ... | -0.043140 | 0.138792 | 0.013802 | 0.273392 | -0.097031 | 0.641540 | C3L-00365 | tumor | Female | 59 |
CPT0002410011 | 0.101098 | 0.421684 | 0.195633 | -0.704007 | NaN | 0.124741 | -0.540317 | 0.240007 | -0.015566 | 0.473723 | ... | -0.052046 | 0.129887 | 0.004897 | 0.239006 | -0.105937 | 0.632634 | C3L-00365 | tumor | Female | 59 |
CPT0064650003 | 0.457406 | 0.702453 | 0.095134 | -0.242212 | 0.072786 | 0.174392 | -0.072472 | -0.461447 | -0.316991 | 0.703966 | ... | 0.570851 | 1.089374 | -0.198843 | -0.016913 | 0.054600 | -0.754282 | C3L-00674 | tumor | Male | 45 |
CPT0064650003 | 0.443143 | 0.695942 | 0.080871 | -0.256475 | 0.058524 | 0.160130 | -0.070154 | -0.475710 | -0.331254 | 0.689704 | ... | 0.556589 | 1.075111 | -0.213106 | -0.009587 | 0.040337 | -0.768544 | C3L-00674 | tumor | Male | 45 |
CPT0064890003 | 0.046885 | -0.032622 | 0.358109 | -0.121666 | 0.105781 | -0.220744 | -0.426095 | -0.019649 | -0.124015 | 0.396851 | ... | 0.588521 | 0.192123 | 0.075653 | 0.275052 | -0.047978 | 0.172529 | C3L-01327 | tumor | Male | 74 |
5 rows × 10981 columns
Clinical Knowledge Graph Re-analysis¶
[26]:
from ckg.analytics_core.analytics import analytics
from ckg.analytics_core.viz import viz
from ckg.graphdb_connector import connector
driver = connector.getGraphDatabaseConnectionConfiguration()
from ckg.report_manager import knowledge
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
c:\users\sande\.conda\envs\ckgenv\lib\site-packages\outdated\utils.py:18: OutdatedPackageWarning: The package pingouin is out of date. Your version is 0.3.12, the latest is 0.4.0.
Set the environment variable OUTDATED_IGNORE=1 to disable these warnings.
**kwargs
WGCNA functions will not work. Module Rpy2 not installed.
R functions will not work. Module Rpy2 not installed.
Imputation of missing values¶
We use KNN algorithm to impute missing values in the proteomics data.
[27]:
cptac_proteomics_data = analytics.imputation_KNN(cptac_proteomics_data.reset_index(), drop_cols=['gender', 'age', 'Sample type'], group='Sample type', cutoff=0.5)
cptac_proteomics_data = cptac_proteomics_data
[28]:
cptac_proteomics_data.head()
[28]:
A1BG | A2M | AAAS | AACS | AADAT | AAGAB | AAK1 | AAMDC | AAMP | AAR2 | ... | ZXDC | ZYG11B | ZYX | ZZEF1 | ZZZ3 | age | Sample type | gender | Case ID (Participant ID) | index | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.110003 | 0.432359 | 0.204539 | -0.695101 | 0.039568 | 0.133647 | -0.524525 | 0.248913 | -0.006661 | 0.482628 | ... | 0.138792 | 0.013802 | 0.273392 | -0.097031 | 0.641540 | 59 | tumor | Female | C3L-00365 | CPT0002410011 |
1 | 0.101098 | 0.421684 | 0.195633 | -0.704007 | 0.038556 | 0.124741 | -0.540317 | 0.240007 | -0.015566 | 0.473723 | ... | 0.129887 | 0.004897 | 0.239006 | -0.105937 | 0.632634 | 59 | tumor | Female | C3L-00365 | CPT0002410011 |
2 | 0.457406 | 0.702453 | 0.095134 | -0.242212 | 0.072786 | 0.174392 | -0.072472 | -0.461447 | -0.316991 | 0.703966 | ... | 1.089374 | -0.198843 | -0.016913 | 0.054600 | -0.754282 | 45 | tumor | Male | C3L-00674 | CPT0064650003 |
3 | 0.443143 | 0.695942 | 0.080871 | -0.256475 | 0.058524 | 0.160130 | -0.070154 | -0.475710 | -0.331254 | 0.689704 | ... | 1.075111 | -0.213106 | -0.009587 | 0.040337 | -0.768544 | 45 | tumor | Male | C3L-00674 | CPT0064650003 |
4 | 0.046885 | -0.032622 | 0.358109 | -0.121666 | 0.105781 | -0.220744 | -0.426095 | -0.019649 | -0.124015 | 0.396851 | ... | 0.192123 | 0.075653 | 0.275052 | -0.047978 | 0.172529 | 74 | tumor | Male | C3L-01327 | CPT0064890003 |
5 rows × 10789 columns
[29]:
cptac_proteomics_data = cptac_proteomics_data.rename({'index':'subject', 'Sample type': 'group', 'Case ID (Participant ID)': 'sample'}, axis=1)
Analysis of Covariance¶
We analyze dataset to find differentially regulated proteins comparing normal and brain tissue samples and taking age and gender as covariates.
[30]:
cptac_proteomics_data = cptac_proteomics_data.sort_values(by=['group'], ascending=False)
[31]:
results = analytics.run_ancova(cptac_proteomics_data, covariates=['age', 'gender'], drop_cols=['sample', 'subject'], subject='subject', group='group', alpha=0.01)
[32]:
results.head()
[32]:
identifier | group1 | group2 | mean(group1) | std(group1) | mean(group2) | std(group2) | posthoc T-Statistics | posthoc pvalue | coef | ... | log2FC | FC | F-statistics | pvalue | padj | correction | rejected | -log10 pvalue | Method | posthoc padj | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | A1BG | normal | tumor | -0.932974 | 0.087071 | 0.109841 | 0.529097 | 9.013360 | 1.086551e-16 | 1.054802 | ... | -1.042815 | 0.485379 | 81.240659 | 1.086551e-16 | 3.058566e-16 | FDR correction BH | True | 15.963950 | One-way ancova | 3.058566e-16 |
1 | A2M | normal | tumor | -1.070228 | 0.289921 | 0.093514 | 0.576938 | 9.135619 | 4.820736e-17 | 1.177500 | ... | -1.163741 | 0.446353 | 83.459542 | 4.820736e-17 | 1.384469e-16 | FDR correction BH | True | 16.316887 | One-way ancova | 1.384469e-16 |
2 | AAAS | normal | tumor | -0.485488 | 0.185624 | 0.136464 | 0.207010 | 13.078227 | 3.568030e-29 | 0.622227 | ... | -0.621952 | 0.649791 | 171.040029 | 3.568030e-29 | 2.066468e-28 | FDR correction BH | True | 28.447572 | One-way ancova | 2.066468e-28 |
3 | AACS | normal | tumor | 0.722912 | 0.218220 | -0.049294 | 0.298136 | -11.801273 | 3.937318e-25 | -0.782880 | ... | 0.772206 | 1.707879 | 139.270049 | 3.937318e-25 | 1.794592e-24 | FDR correction BH | True | 24.404799 | One-way ancova | 1.794592e-24 |
4 | AADAT | normal | tumor | 1.241883 | 0.282301 | -0.151706 | 0.351998 | -17.257875 | 1.597045e-42 | -1.399042 | ... | 1.393590 | 2.627316 | 297.834255 | 1.597045e-42 | 1.945476e-41 | FDR correction BH | True | 41.796683 | One-way ancova | 1.945476e-41 |
5 rows × 23 columns
[33]:
fig = viz.run_volcano(results, identifier='volcano_plot', args={'alpha': 0.01,
'fc': 2,
'colorscale': 'Blues',
'showscale': False,
'marker_size': 8,
'x_title': 'log2FC',
'y_title': '-log10(pvalue)',
'num_annotations': 1000,
'annotate_list': []})
viz.save_DASH_plot(fig[0], 'volcano_plot_normal_tumor', plot_format='png', directory=data_dir)
iplot(fig[0].figure)
Enrichment and Knowledge Annotation¶
Using the identified list of significantly regulated proteins we annotation from CKG to determine enriched biological processes, and reveal the knowledge graph associated to these protein hits with a focus on up-regulated proteins in the tumor tissue compared to normal. These proteins could be targeted by drug inhibitors to try to reverse the progression of the tumor.
[34]:
annotation_query = '''MATCH (p:Protein)-[r:ASSOCIATED_WITH]-(bp:Biological_process)
WHERE p.name IN $protein_list
RETURN DISTINCT p.name AS identifier, bp.name AS annotation'''
driver = connector.getGraphDatabaseConnectionConfiguration()
annotation = connector.getCursorData(driver, annotation_query, parameters={'protein_list':results['identifier'].unique().tolist()})
[35]:
enrichment = analytics.run_up_down_regulation_enrichment(results, annotation, identifier='identifier', groups=['group1', 'group2'], annotation_col='annotation', reject_col='rejected', group_col='group', method='fisher', correction='fdr_bh', alpha=0.01, lfc_cutoff=1)
upregulated_gbm_enrichment = enrichment['normal~tumor']
upregulated_gbm_enrichment = upregulated_gbm_enrichment[upregulated_gbm_enrichment['direction'] == 'downregulated']
e = {'normal~tumor': upregulated_gbm_enrichment}
[36]:
figures = viz.get_enrichment_plots(e, identifier='enrichment', args={'width':2200})
i = 0
for fig in figures:
iplot(fig.figure)
viz.save_DASH_plot(fig, name="enrichment_"+str(i), plot_format='png', directory=data_dir)
i += 1
Knowledge Summarization¶
[37]:
upregulated_gbm = results[(results['rejected']) & (results['log2FC'] < -1)]
[38]:
upregulated_gbm.shape
[38]:
(861, 23)
[39]:
kn = knowledge.Knowledge(identifier='GBM', data=None)
kn.annotate_list(query_list=upregulated_gbm['identifier'].tolist(),
entity_type='protein',
queries_file=None,
attribute='name',
diseases=['glioblastoma'],
entities=None)
[40]:
kn.generate_report(visualizations=['sankey', 'network'],
summarize=True,
method='betweenness',
inplace=True)
[41]:
kn.report.visualize_report(environment='notebook')[0]
[42]:
kn.save_report(data_dir)
[58]:
kn.keep_nodes
[58]:
['glioblastoma']
Finding Drug Inhibitors¶
We use the knowledge in CKG to filter the list of up-regulated proteins in the tumor tissue compared to normal to explore possible drug inhibitors that could reverse the progression of the tumor.
We initially filter those significantly upregulated proteins that have been already associated with GBM by using the protein-disease associations in CKG.
We then use the filtered list to find inhibitors stored in CKG as drug-target relationships.
From the obtained list of candidate drugs, we further filter the results by finding evidence in publications mentioning the drug together with the disease and the protein they target.
This last step provides a final list of connected protein hits, candidate drug inhibitors and publications that can be visualize as a CKG knowledge subgraph.
[43]:
query = '''MATCH (p:Protein)-[r:ASSOCIATED_WITH]-(d:Disease{name:"glioblastoma"})
WHERE p.name IN $protein_list AND r.score > 1.5 RETURN p.name, d.name, r.score'''
res = connector.getCursorData(driver, query, parameters={'protein_list': upregulated_gbm['identifier'].tolist()})
[44]:
res.head()
[44]:
d.name | p.name | r.score | |
---|---|---|---|
0 | glioblastoma | CD276 | 1.602 |
1 | glioblastoma | ANXA5 | 2.290 |
2 | glioblastoma | THBS1 | 1.615 |
3 | glioblastoma | IGFBP2 | 1.835 |
4 | glioblastoma | MMP9 | 2.218 |
[45]:
len(res['p.name'].tolist())
[45]:
44
[46]:
query = '''MATCH (p:Protein)-[r:ACTS_ON{action:"inhibition"}]-(d:Drug)
WHERE p.name IN $protein_list AND r.score > 0.7
WITH p, d, r, SIZE((:Protein)-[:ACTS_ON]-(d)) as degree WHERE degree < 10
RETURN p.name, d.name, r.score'''
res = connector.getCursorData(driver, query, parameters={'protein_list': res['p.name'].tolist()})
[47]:
res.head()
[47]:
d.name | p.name | r.score | |
---|---|---|---|
0 | Pirfenidone | MMP9 | 0.8 |
1 | 3,5-Dimethyl-1-(3-Nitrophenyl)-1h-Pyrazole-4-C... | MMP9 | 0.8 |
2 | Hydroflumethiazide | CA9 | 0.8 |
3 | Pirfenidone | ALB | 0.8 |
4 | (10E,12Z)-octadecadienoic acid | MMP2 | 0.8 |
[48]:
res['d.name'].unique().shape
[48]:
(28,)
[49]:
project_knowledge = knowledge.Knowledge(identifier='targets',
data=None,
nodes={},
relationships={},
queries_file=None,
colors={},
graph=None,
report={})
project_knowledge.generate_knowledge_from_edgelist(edgelist=res,
entity1='Protein',
entity2='Drug',
source='p.name',
target='d.name',
rtype='associated_with',
weight='r.score')
[50]:
query = '''MATCH (d:Drug)-[:MENTIONED_IN_PUBLICATION]-(p:Publication)-[:MENTIONED_IN_PUBLICATION]-(di:Disease{name:"glioblastoma"})
WHERE d.name IN $drug_list
WITH d, p
MATCH (pro:Protein)-[:MENTIONED_IN_PUBLICATION]-(p)
WHERE pro.name IN $protein_list
RETURN p.id, d.name, d.class, pro.name'''
res = connector.getCursorData(driver, query, parameters={'drug_list': res['d.name'].tolist(),
'protein_list':res['p.name'].unique().tolist()})
[51]:
res.columns = ['Drug class', 'Drug name', 'Publication', 'Protein name']
res['Publication'] = ["PMID:"+p for p in res['Publication'].tolist()]
res[['Drug class', 'Drug name', 'Publication', 'Protein name']]
[51]:
Drug class | Drug name | Publication | Protein name | |
---|---|---|---|---|
0 | Azoles | 3,5-Dimethyl-1-(3-Nitrophenyl)-1h-Pyrazole-4-C... | PMID:30470262 | MMP9 |
1 | Pyridines and derivatives | Pirfenidone | PMID:31282197 | MMP9 |
2 | Purine nucleosides | Nelarabine | PMID:26899176 | MMP9 |
3 | Pyridines and derivatives | Pirfenidone | PMID:31282197 | CA9 |
4 | Benzene and substituted derivatives | Cetirizine | PMID:31022935 | ALB |
5 | Carboxylic acids and derivatives | Tranexamic acid | PMID:22539956 | ALB |
6 | Pyridines and derivatives | Pirfenidone | PMID:31117237 | FN1 |
7 | Purine nucleosides | Nelarabine | PMID:31950163 | FN1 |
8 | Pyridines and derivatives | Pirfenidone | PMID:31282197 | FN1 |
9 | Pyridines and derivatives | Pirfenidone | PMID:29996062 | FN1 |
10 | Pyridines and derivatives | Pirfenidone | PMID:31547567 | FN1 |
11 | Azoles | 3,5-Dimethyl-1-(3-Nitrophenyl)-1h-Pyrazole-4-C... | PMID:31231472 | FN1 |
12 | Pyridines and derivatives | Pirfenidone | PMID:31231472 | FN1 |
13 | Pyridines and derivatives | Pirfenidone | PMID:29038232 | FN1 |
14 | Benzene and substituted derivatives | N-(4-sulfamoylphenyl)-1H-indazole-3-carboxamide | PMID:25707963 | FN1 |
15 | Pyridines and derivatives | Pirfenidone | PMID:25026295 | FN1 |
16 | Benzothiophenes | 4-Iodobenzo[B]Thiophene-2-Carboxamidine | PMID:21976520 | FN1 |
17 | Pyridines and derivatives | Pirfenidone | PMID:29996062 | ICAM1 |
18 | Azoles | 3,5-Dimethyl-1-(3-Nitrophenyl)-1h-Pyrazole-4-C... | PMID:31231472 | CDK1 |
19 | Pyridines and derivatives | Pirfenidone | PMID:31231472 | CDK1 |
20 | Benzothiophenes | 4-Iodobenzo[B]Thiophene-2-Carboxamidine | PMID:21976520 | PLG |
21 | Carboxylic acids and derivatives | Mimosine | PMID:20226717 | PLG |
22 | Pyridines and derivatives | Pirfenidone | PMID:29996062 | CCL2 |
23 | Pyridines and derivatives | Pirfenidone | PMID:25026295 | CHI3L1 |
[52]:
res.to_csv(os.path.join(data_dir, 'studies_drugs.tsv'), sep='\t', header=True, index=False, doublequote=None)
[53]:
len(res['Drug name'].tolist())
[53]:
24
[54]:
res['r.score'] = 1
[55]:
project_knowledge.generate_knowledge_from_edgelist(edgelist=res,
entity1='Drug',
entity2='Publication',
source='Drug name',
target='Publication',
rtype='associated_with',
weight='r.score')
project_knowledge.generate_knowledge_from_edgelist(edgelist=res,
entity1='Protein',
entity2='Publication',
source='Protein name',
target='Publication',
rtype='associated_with',
weight='r.score')
[60]:
project_knowledge.generate_report(visualizations=['sankey', 'network'], summarize=False)
project_knowledge.report.visualize_report(environment='notebook')[0]
[57]:
project_knowledge.save_report(data_dir)