This notebook shows how CKG can be used to download data from the Proteomics Identifications Database - PRIDE - (https://www.ebi.ac.uk/pride/) and quickly formated to start analyzing them with the functionality in the analytics core.
[2]:
import os
import ckg.ckg_utils as ckg_utils
from ckg.graphdb_builder import builder_utils
from ckg.graphdb_builder.experiments.parsers import proteomicsParser
from ckg.analytics_core.analytics import analytics
CKG path¶
[3]:
ckg_location = '/Users/albertosantos/Development/Clinical_Proteomics_Department/ClinicalKnowledgeGraph(CKG)/code'
Define where the data should be downloaded¶
[4]:
analysis_dir = os.path.join(ckg_location, '/data/tmp/Deshmukh2019')
ckg_utils.checkDirectory(analysis_dir)
Specify the PRIDE identifier and file to be downloaded¶
[5]:
pxd_id = 'PXD008541'
file_name='SearchEngineResults_secretome.zip.rar'
Download data¶
We can use functionality in graphdb_builder to directly download data files from EBI’s PRIDE database (https://www.ebi.ac.uk/pride/). For that you just need to specify the PRIDE identifier for the project (PXD…) and the name of the file to download. In this case, the project identifier is PXD008541 and the file we will use is SearchEngineResults_secretome.zip.rar, a RAR compressed file with the output files from MaxQuant.
[6]:
builder_utils.download_PRIDE_data(pxd_id=pxd_id,
file_name=file_name,
to=analysis_dir)
'publicationDate'
[6]:
{'status': 'INTERNAL_SERVER_ERROR',
'code': 500,
'message': 'Internal Server Error: Could not open JPA EntityManager for transaction; nested exception is javax.persistence.PersistenceException: org.hibernate.exception.GenericJDBCException: Could not open connection',
'developerMessage': 'Please report to pride-support@ebi.ac.uk',
'moreInfoUrl': None,
'throwable': None}
[5]:
builder_utils.unrar(filepath=os.path.join(analysis_dir, file_name), to=analysis_dir)
The list of files within the compressed folder can be listed using the listDirectoryFiles functionality in gaphdb_builder.
[6]:
builder_utils.listDirectoryFiles(analysis_dir)
[6]:
['peptides.txt',
'SearchEngineResults_secretome.zip.rar',
'modificationSpecificPeptides.txt',
'experimentalDesignTemplate.txt',
'parameters.txt',
'msms.txt',
'proteinGroups.txt']
We use the proteinGroups file that contains the proteomics data processed using MaxQuant software.
[7]:
proteinGroups_file = os.path.join(analysis_dir, 'proteinGroups.txt')
CKG has parsers for MaxQuant and Spectronaut output files. The default configuration needed to parse these files needs to be updated with the name of the columns containing the protein quantifications for each sample. Also, the default configuration can be adapted to the experiment by selected specific filters or removing non-used columns. For example, in this study the output file did not have columns: Score, Q-value, so we removed them from the configuration and the column ‘Potential contaminant’ was renamed to ‘Contaminant’ so we changed the name in the filters.
[8]:
#d = pd.read_csv(proteinGroups_file, sep='\t')
#d.columns.tolist()
[9]:
columns = ['LFQ intensity BAT_NE1',
'LFQ intensity BAT_NE2',
'LFQ intensity BAT_NE3',
'LFQ intensity BAT_NE4',
'LFQ intensity BAT_NE5',
'LFQ intensity BAT_woNE1',
'LFQ intensity BAT_woNE2',
'LFQ intensity BAT_woNE3',
'LFQ intensity BAT_woNE4',
'LFQ intensity BAT_woNE5',
'LFQ intensity WAT_NE1',
'LFQ intensity WAT_NE2',
'LFQ intensity WAT_NE3',
'LFQ intensity WAT_NE4',
'LFQ intensity WAT_NE5',
'LFQ intensity WAT_woNE1',
'LFQ intensity WAT_woNE2',
'LFQ intensity WAT_woNE3',
'LFQ intensity WAT_woNE4',
'LFQ intensity WAT_woNE5',
'Contaminant']
[10]:
configuration = proteomicsParser.update_configuration(data_type='proteins',
processing_tool='maxquant',
value_col='LFQ intensity',
columns=columns,
drop_cols=['Score', 'Q-value', 'Potential contaminant'],
filters=['Reverse', 'Only identified by site', 'Contaminant'])
[11]:
configuration
[11]:
{'columns': ['Majority protein IDs',
'Razor + unique peptides',
'id',
'LFQ intensity \\w+_AS\\d+_?-?\\d*',
'Intensity \\w+_AS\\d+_?-?\\d*',
'Reverse',
'Only identified by site',
'is_razor',
'LFQ intensity BAT_NE1',
'LFQ intensity BAT_NE2',
'LFQ intensity BAT_NE3',
'LFQ intensity BAT_NE4',
'LFQ intensity BAT_NE5',
'LFQ intensity BAT_woNE1',
'LFQ intensity BAT_woNE2',
'LFQ intensity BAT_woNE3',
'LFQ intensity BAT_woNE4',
'LFQ intensity BAT_woNE5',
'LFQ intensity WAT_NE1',
'LFQ intensity WAT_NE2',
'LFQ intensity WAT_NE3',
'LFQ intensity WAT_NE4',
'LFQ intensity WAT_NE5',
'LFQ intensity WAT_woNE1',
'LFQ intensity WAT_woNE2',
'LFQ intensity WAT_woNE3',
'LFQ intensity WAT_woNE4',
'LFQ intensity WAT_woNE5',
'Contaminant'],
'generated_columns': ['is_razor'],
'filters': ['Reverse', 'Only identified by site', 'Contaminant'],
'proteinCol': 'Majority protein IDs',
'contaminant_tag': 'CON__',
'valueCol': 'LFQ intensity',
'groupCol': 'id',
'indexCol': 'Majority protein IDs',
'attributes': {'cols': ['id', 'is_razor'], 'regex': ['Intensity']},
'log': 'log2',
'combine': 'regex',
'file': 'proteinGroups.txt'}
When we parse the data, we obtain a matrix in an edge list following CKG’s graph format: sample, protein, realtionship_type, value, protein_group_id, is_razor
[12]:
data = proteomicsParser.parser_from_file(proteinGroups_file, configuration=configuration, data_type='proteins', is_standard=False)[('proteins', 'w')]
/Users/albertosantos/Development/Clinical_Proteomics_Department/ClinicalKnowledgeGraph(CKG)/code/src/graphdb_builder/experiments/parsers/proteomicsParser.py:118: RuntimeWarning:
divide by zero encountered in log2
[13]:
data.head()
[13]:
START_ID | END_ID | TYPE | value | id | is_razor | |
---|---|---|---|---|---|---|
0 | WAT_woNE5 | A0AVL1 | HAS_QUANTIFIED_PROTEIN | 20.109423 | 0 | False |
1 | WAT_woNE2 | A1A441 | HAS_QUANTIFIED_PROTEIN | 23.658291 | 1 | False |
2 | BAT_woNE3 | A1A441 | HAS_QUANTIFIED_PROTEIN | 22.753054 | 1 | False |
3 | WAT_woNE5 | A1A441 | HAS_QUANTIFIED_PROTEIN | 23.363726 | 1 | False |
4 | BAT_NE3 | A1A441 | HAS_QUANTIFIED_PROTEIN | 23.317552 | 1 | False |
[14]:
data.columns = ['sample', 'identifier', 'relationship', 'LFQ intensity', 'id', 'is_razor']
[15]:
data.head()
[15]:
sample | identifier | relationship | LFQ intensity | id | is_razor | |
---|---|---|---|---|---|---|
0 | WAT_woNE5 | A0AVL1 | HAS_QUANTIFIED_PROTEIN | 20.109423 | 0 | False |
1 | WAT_woNE2 | A1A441 | HAS_QUANTIFIED_PROTEIN | 23.658291 | 1 | False |
2 | BAT_woNE3 | A1A441 | HAS_QUANTIFIED_PROTEIN | 22.753054 | 1 | False |
3 | WAT_woNE5 | A1A441 | HAS_QUANTIFIED_PROTEIN | 23.363726 | 1 | False |
4 | BAT_NE3 | A1A441 | HAS_QUANTIFIED_PROTEIN | 23.317552 | 1 | False |
[16]:
data.shape
[16]:
(57470, 6)
[17]:
data = data[data.is_razor]
[18]:
data.shape
[18]:
(17489, 6)
We can use the sample names to extract the group information: BAT_NE, WAT_NE, BAT_woNE, WAT_woNE
With this last column, we obtain the original dataframe used as starting point in CKG’ analysis pipelines.
[19]:
data['group'] = data['sample'].apply(lambda x: re.sub('\d', '', x))
[20]:
data.head()
[20]:
sample | identifier | relationship | LFQ intensity | id | is_razor | group | |
---|---|---|---|---|---|---|---|
7 | BAT_NE4 | A1L4H1 | HAS_QUANTIFIED_PROTEIN | 21.686542 | 2 | True | BAT_NE |
134 | WAT_NE4 | A6NCN2 | HAS_QUANTIFIED_PROTEIN | 27.843440 | 10 | True | WAT_NE |
135 | BAT_woNE3 | A6NDG6 | HAS_QUANTIFIED_PROTEIN | 23.307164 | 11 | True | BAT_woNE |
136 | WAT_NE4 | A6NDG6 | HAS_QUANTIFIED_PROTEIN | 23.586348 | 11 | True | WAT_NE |
137 | BAT_woNE2 | A6NDG6 | HAS_QUANTIFIED_PROTEIN | 23.446827 | 11 | True | BAT_woNE |
[21]:
original = data[['group', 'sample', 'identifier', 'LFQ intensity']]
–> the original dataframe is the starting point in CKG’s proteomics analysis.¶
In order to prepare the data we follow the steps:
Filtering based on missing values
Imputation of missing values using a mixed model estrategy: KNN and MinProb
These steps will generate the processed dataframe, a complete matrix that can be used in the exploratory and statistical analysis.
[28]:
processed_data = analytics.get_proteomics_measurements_ready(original,
index_cols=['group', 'sample'],
drop_cols=['sample'],
group='group',
identifier='identifier',
extra_identifier=None,
imputation=True,
method='mixed',
knn_cutoff=0.4,
missing_method='at_least_x',
missing_per_group=True,
min_valid=3,
value_col='LFQ intensity',
shift=1.8,
nstd=0.3)
[29]:
processed_data.head()
[29]:
identifier | group | sample | A6NDG6 | B3KW70 | E9PAV3 | E9PGF5 | E9PHK0 | F5GWP8 | F8W031 | G3V3G9 | ... | Q9Y4Y9 | Q9Y5P4 | Q9Y5X3 | Q9Y5Z4 | Q9Y600 | Q9Y617 | Q9Y646 | Q9Y678 | Q9Y696 | Q9Y6I3 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | BAT_NE | BAT_NE1 | 22.948430 | 29.091339 | 26.578739 | 23.563604 | 27.322081 | 27.821164 | 24.420405 | 23.947173 | ... | 22.109509 | 23.130157 | 22.569201 | 28.647187 | 24.888146 | 26.012226 | 22.746304 | 23.610317 | 24.676053 | 23.408469 |
1 | BAT_NE | BAT_NE2 | 22.954981 | 28.610591 | 27.468243 | 23.811590 | 26.283462 | 27.452462 | 23.071491 | 23.891354 | ... | 22.109780 | 23.312665 | 22.572965 | 28.026016 | 24.164153 | 26.294796 | 22.171377 | 25.493292 | 24.928906 | 22.570449 |
2 | BAT_NE | BAT_NE3 | 22.817377 | 28.272086 | 26.773793 | 23.708936 | 25.299149 | 27.471115 | 22.773441 | 23.618293 | ... | 22.104088 | 23.378619 | 22.493896 | 27.956985 | 25.221587 | 26.440457 | 22.150598 | 25.144410 | 26.972790 | 22.340063 |
3 | BAT_NE | BAT_NE4 | 22.937885 | 30.812377 | 26.089441 | 23.757766 | 28.160532 | 28.601127 | 23.301110 | 23.881631 | ... | 22.109074 | 22.237510 | 22.563142 | 27.506049 | 24.793327 | 24.906694 | 22.693322 | 23.264133 | 25.210343 | 22.709766 |
4 | BAT_NE | BAT_NE5 | 23.047168 | 28.364277 | 26.971255 | 24.055731 | 25.989885 | 26.889135 | 22.317248 | 24.034814 | ... | 22.113593 | 23.397543 | 22.625936 | 28.196715 | 25.171234 | 26.397037 | 21.778724 | 24.573057 | 26.251168 | 22.132167 |
5 rows × 1169 columns