This notebook shows how CKG can be used to download data from the Proteomics Identifications Database - PRIDE - (https://www.ebi.ac.uk/pride/) and quickly formated to start analyzing them with the functionality in the analytics core.

[2]:

import os

import ckg.ckg_utils as ckg_utils
from ckg.graphdb_builder import builder_utils
from ckg.graphdb_builder.experiments.parsers import proteomicsParser

from ckg.analytics_core.analytics import analytics

CKG path¶

[3]:

ckg_location = '/Users/albertosantos/Development/Clinical_Proteomics_Department/ClinicalKnowledgeGraph(CKG)/code'

Define where the data should be downloaded¶

[4]:

analysis_dir = os.path.join(ckg_location, '/data/tmp/Deshmukh2019')
ckg_utils.checkDirectory(analysis_dir)

Specify the PRIDE identifier and file to be downloaded¶

[5]:

pxd_id = 'PXD008541'
file_name='SearchEngineResults_secretome.zip.rar'

Download data¶

We can use functionality in graphdb_builder to directly download data files from EBI’s PRIDE database (https://www.ebi.ac.uk/pride/). For that you just need to specify the PRIDE identifier for the project (PXD…) and the name of the file to download. In this case, the project identifier is PXD008541 and the file we will use is SearchEngineResults_secretome.zip.rar, a RAR compressed file with the output files from MaxQuant.

[6]:

builder_utils.download_PRIDE_data(pxd_id=pxd_id,
                                  file_name=file_name,
                                  to=analysis_dir)

'publicationDate'

[6]:

{'status': 'INTERNAL_SERVER_ERROR',
 'code': 500,
 'message': 'Internal Server Error: Could not open JPA EntityManager for transaction; nested exception is javax.persistence.PersistenceException: org.hibernate.exception.GenericJDBCException: Could not open connection',
 'developerMessage': 'Please report to pride-support@ebi.ac.uk',
 'moreInfoUrl': None,
 'throwable': None}

[5]:

builder_utils.unrar(filepath=os.path.join(analysis_dir, file_name), to=analysis_dir)

The list of files within the compressed folder can be listed using the listDirectoryFiles functionality in gaphdb_builder.

[6]:

builder_utils.listDirectoryFiles(analysis_dir)

[6]:

['peptides.txt',
 'SearchEngineResults_secretome.zip.rar',
 'modificationSpecificPeptides.txt',
 'experimentalDesignTemplate.txt',
 'parameters.txt',
 'msms.txt',
 'proteinGroups.txt']

We use the proteinGroups file that contains the proteomics data processed using MaxQuant software.

[7]:

proteinGroups_file = os.path.join(analysis_dir, 'proteinGroups.txt')

CKG has parsers for MaxQuant and Spectronaut output files. The default configuration needed to parse these files needs to be updated with the name of the columns containing the protein quantifications for each sample. Also, the default configuration can be adapted to the experiment by selected specific filters or removing non-used columns. For example, in this study the output file did not have columns: Score, Q-value, so we removed them from the configuration and the column ‘Potential contaminant’ was renamed to ‘Contaminant’ so we changed the name in the filters.

[8]:

#d = pd.read_csv(proteinGroups_file, sep='\t')
#d.columns.tolist()

[9]:

columns = ['LFQ intensity BAT_NE1',
           'LFQ intensity BAT_NE2',
           'LFQ intensity BAT_NE3',
           'LFQ intensity BAT_NE4',
           'LFQ intensity BAT_NE5',
           'LFQ intensity BAT_woNE1',
           'LFQ intensity BAT_woNE2',
           'LFQ intensity BAT_woNE3',
           'LFQ intensity BAT_woNE4',
           'LFQ intensity BAT_woNE5',
           'LFQ intensity WAT_NE1',
           'LFQ intensity WAT_NE2',
           'LFQ intensity WAT_NE3',
           'LFQ intensity WAT_NE4',
           'LFQ intensity WAT_NE5',
           'LFQ intensity WAT_woNE1',
           'LFQ intensity WAT_woNE2',
           'LFQ intensity WAT_woNE3',
           'LFQ intensity WAT_woNE4',
           'LFQ intensity WAT_woNE5',
           'Contaminant']

[10]:

configuration = proteomicsParser.update_configuration(data_type='proteins',
                                                      processing_tool='maxquant',
                                                      value_col='LFQ intensity',
                                                      columns=columns,
                                                      drop_cols=['Score', 'Q-value', 'Potential contaminant'],
                                                      filters=['Reverse', 'Only identified by site', 'Contaminant'])

[11]:

configuration

[11]:

{'columns': ['Majority protein IDs',
  'Razor + unique peptides',
  'id',
  'LFQ intensity \\w+_AS\\d+_?-?\\d*',
  'Intensity \\w+_AS\\d+_?-?\\d*',
  'Reverse',
  'Only identified by site',
  'is_razor',
  'LFQ intensity BAT_NE1',
  'LFQ intensity BAT_NE2',
  'LFQ intensity BAT_NE3',
  'LFQ intensity BAT_NE4',
  'LFQ intensity BAT_NE5',
  'LFQ intensity BAT_woNE1',
  'LFQ intensity BAT_woNE2',
  'LFQ intensity BAT_woNE3',
  'LFQ intensity BAT_woNE4',
  'LFQ intensity BAT_woNE5',
  'LFQ intensity WAT_NE1',
  'LFQ intensity WAT_NE2',
  'LFQ intensity WAT_NE3',
  'LFQ intensity WAT_NE4',
  'LFQ intensity WAT_NE5',
  'LFQ intensity WAT_woNE1',
  'LFQ intensity WAT_woNE2',
  'LFQ intensity WAT_woNE3',
  'LFQ intensity WAT_woNE4',
  'LFQ intensity WAT_woNE5',
  'Contaminant'],
 'generated_columns': ['is_razor'],
 'filters': ['Reverse', 'Only identified by site', 'Contaminant'],
 'proteinCol': 'Majority protein IDs',
 'contaminant_tag': 'CON__',
 'valueCol': 'LFQ intensity',
 'groupCol': 'id',
 'indexCol': 'Majority protein IDs',
 'attributes': {'cols': ['id', 'is_razor'], 'regex': ['Intensity']},
 'log': 'log2',
 'combine': 'regex',
 'file': 'proteinGroups.txt'}

When we parse the data, we obtain a matrix in an edge list following CKG’s graph format: sample, protein, realtionship_type, value, protein_group_id, is_razor

[12]:

data = proteomicsParser.parser_from_file(proteinGroups_file, configuration=configuration, data_type='proteins', is_standard=False)[('proteins', 'w')]

/Users/albertosantos/Development/Clinical_Proteomics_Department/ClinicalKnowledgeGraph(CKG)/code/src/graphdb_builder/experiments/parsers/proteomicsParser.py:118: RuntimeWarning:

divide by zero encountered in log2

[13]:

data.head()

[13]:

	START_ID	END_ID	TYPE	value	id	is_razor
0	WAT_woNE5	A0AVL1	HAS_QUANTIFIED_PROTEIN	20.109423	0	False
1	WAT_woNE2	A1A441	HAS_QUANTIFIED_PROTEIN	23.658291	1	False
2	BAT_woNE3	A1A441	HAS_QUANTIFIED_PROTEIN	22.753054	1	False
3	WAT_woNE5	A1A441	HAS_QUANTIFIED_PROTEIN	23.363726	1	False
4	BAT_NE3	A1A441	HAS_QUANTIFIED_PROTEIN	23.317552	1	False

[14]:

data.columns = ['sample', 'identifier', 'relationship', 'LFQ intensity', 'id', 'is_razor']

[15]:

data.head()

[15]:

	sample	identifier	relationship	LFQ intensity	id	is_razor
0	WAT_woNE5	A0AVL1	HAS_QUANTIFIED_PROTEIN	20.109423	0	False
1	WAT_woNE2	A1A441	HAS_QUANTIFIED_PROTEIN	23.658291	1	False
2	BAT_woNE3	A1A441	HAS_QUANTIFIED_PROTEIN	22.753054	1	False
3	WAT_woNE5	A1A441	HAS_QUANTIFIED_PROTEIN	23.363726	1	False
4	BAT_NE3	A1A441	HAS_QUANTIFIED_PROTEIN	23.317552	1	False

[16]:

data.shape

[16]:

(57470, 6)

[17]:

data = data[data.is_razor]

[18]:

data.shape

[18]:

(17489, 6)

We can use the sample names to extract the group information: BAT_NE, WAT_NE, BAT_woNE, WAT_woNE

With this last column, we obtain the original dataframe used as starting point in CKG’ analysis pipelines.

[19]:

data['group'] = data['sample'].apply(lambda x: re.sub('\d', '', x))

[20]:

data.head()

[20]:

	sample	identifier	relationship	LFQ intensity	id	is_razor	group
7	BAT_NE4	A1L4H1	HAS_QUANTIFIED_PROTEIN	21.686542	2	True	BAT_NE
134	WAT_NE4	A6NCN2	HAS_QUANTIFIED_PROTEIN	27.843440	10	True	WAT_NE
135	BAT_woNE3	A6NDG6	HAS_QUANTIFIED_PROTEIN	23.307164	11	True	BAT_woNE
136	WAT_NE4	A6NDG6	HAS_QUANTIFIED_PROTEIN	23.586348	11	True	WAT_NE
137	BAT_woNE2	A6NDG6	HAS_QUANTIFIED_PROTEIN	23.446827	11	True	BAT_woNE

[21]:

original = data[['group', 'sample', 'identifier', 'LFQ intensity']]

–> the original dataframe is the starting point in CKG’s proteomics analysis.¶

In order to prepare the data we follow the steps:

Filtering based on missing values
Imputation of missing values using a mixed model estrategy: KNN and MinProb

These steps will generate the processed dataframe, a complete matrix that can be used in the exploratory and statistical analysis.

[28]:

processed_data = analytics.get_proteomics_measurements_ready(original,
                                                             index_cols=['group', 'sample'],
                                                             drop_cols=['sample'],
                                                             group='group',
                                                             identifier='identifier',
                                                             extra_identifier=None,
                                                             imputation=True,
                                                             method='mixed',
                                                             knn_cutoff=0.4,
                                                             missing_method='at_least_x',
                                                             missing_per_group=True,
                                                             min_valid=3,
                                                             value_col='LFQ intensity',
                                                             shift=1.8,
                                                             nstd=0.3)

[29]:

processed_data.head()

[29]:

identifier	group	sample	A6NDG6	B3KW70	E9PAV3	E9PGF5	E9PHK0	F5GWP8	F8W031	G3V3G9	...	Q9Y4Y9	Q9Y5P4	Q9Y5X3	Q9Y5Z4	Q9Y600	Q9Y617	Q9Y646	Q9Y678	Q9Y696	Q9Y6I3
0	BAT_NE	BAT_NE1	22.948430	29.091339	26.578739	23.563604	27.322081	27.821164	24.420405	23.947173	...	22.109509	23.130157	22.569201	28.647187	24.888146	26.012226	22.746304	23.610317	24.676053	23.408469
1	BAT_NE	BAT_NE2	22.954981	28.610591	27.468243	23.811590	26.283462	27.452462	23.071491	23.891354	...	22.109780	23.312665	22.572965	28.026016	24.164153	26.294796	22.171377	25.493292	24.928906	22.570449
2	BAT_NE	BAT_NE3	22.817377	28.272086	26.773793	23.708936	25.299149	27.471115	22.773441	23.618293	...	22.104088	23.378619	22.493896	27.956985	25.221587	26.440457	22.150598	25.144410	26.972790	22.340063
3	BAT_NE	BAT_NE4	22.937885	30.812377	26.089441	23.757766	28.160532	28.601127	23.301110	23.881631	...	22.109074	22.237510	22.563142	27.506049	24.793327	24.906694	22.693322	23.264133	25.210343	22.709766
4	BAT_NE	BAT_NE5	23.047168	28.364277	26.971255	24.055731	25.989885	26.889135	22.317248	24.034814	...	22.113593	23.397543	22.625936	28.196715	25.171234	26.397037	21.778724	24.573057	26.251168	22.132167

5 rows × 1169 columns