Converting CKG to Sample and Data Relationship Format for Proteomics (SDRF-Proteomics)

Abstract

Metadata is essential in proteomics data repositories and is crucial to interpret and reanalyze the deposited data sets. For every proteomics data set, we should capture at least three levels of metadata: (i) data set description, (ii) the sample to data files related information, and (iii) standard data file formats (e.g., mzIdentML, mzML, or mzTab). While the data set description and standard data file formats are supported by all ProteomeXchange partners, the information regarding the sample to data files is mostly missing. Recently, members of the European Bioinformatics Community for Mass Spectrometry (EuBIC) have created an open-source project called Sample to Data file format for Proteomics (https://github.com/bigbio/proteomics-metadata-standard/) to enable the standardization of sample metadata of public proteomics data sets. Here, the project is presented to the proteomics community, and we call for contributors, including researchers, journals, and consortiums to provide feedback about the format. We believe this work will improve reproducibility and facilitate the development of new tools dedicated to proteomics data analysis.

image.png

Here, we show how to easily convert CKG projects into SDRF-proteomcis standard. This data standard is included in every report generated with CKG and can be shared when submitting to PRIDE.

[1]:
from ckg.graphdb_builder import builder_utils as utils
from ckg.report_manager import project
c:\users\sande\.conda\envs\pip_rev\lib\site-packages\outdated\utils.py:18: OutdatedPackageWarning:

The package pingouin is out of date. Your version is 0.3.11, the latest is 0.3.12.
Set the environment variable OUTDATED_IGNORE=1 to disable these warnings.

WGCNA functions will not work. Module Rpy2 not installed.
R functions will not work. Module Rpy2 not installed.

Get SDRF format for a specific project

Here, we create a Project object for project P0000014 and use the function get_sdrf() to convert from CKG standard to SDRF format. This function will map the ontologies in CKG to the Experimental factor ontology (https://www.ebi.ac.uk/efo/) required in SDRF format.

[2]:
my_project = project.Project(identifier="P0000014", configuration_files={}, datasets={}, knowledge=None, report={})
[3]:
sdrf_df = my_project.get_sdrf()
sdrf_df.head()
[3]:
characteristics[individual] source name comment[data file] characteristics[phenotype] characteristics[organism part] characteristics[alkaline phosphatase measurement] characteristics[aspartate aminotransferase measurement] characteristics[bilirubin measurement] characteristics[body mass index] characteristics[fasting blood glucose measurement] characteristics[low density lipoprotein cholesterol measurement] characteristics[serum alanine aminotransferase measurement] characteristics[waist circumference]
0 31 31 31_C6 Healthy blood plasma 54.0 30.0 15.0 27.774423 5.07 2.1 24.0 108.0
1 32 32 32_C7 Healthy blood plasma 27.0 28.0 17.0 28.727377 6.09 4.3 27.0 108.0
2 33 33 33_C8 Healthy blood plasma 69.0 21.0 9.0 28.841532 4.93 4.1 18.0 90.0
3 34 34 34_C9 Healthy blood plasma 101.0 26.0 12.0 42.056933 5.33 4.8 22.0 134.0
4 35 35 35_C10 Healthy blood plasma 61.0 25.0 8.0 29.434851 4.80 3.9 18.0 102.0

Convert SDRF to CKG

Conversely, we can convert a SDRF dataframe to CKG standard using the function convert_sdrf_to_ckg(). The generated dataframe could be used as the Clinical data for a project and uploaded through the Data upload app. Notice that the generated dataframe also contains the required columns in the Experimental design (subject external_id, biological_sample external_id, analytical_sample external_id and grouping1).

[4]:
df = utils.convert_sdrf_to_ckg(sdrf_df)
df.head()
[4]:
subject external_id biological_sample external_id analytical_sample external_id grouping1 tissue Alkaline phosphatase measurement (88810008) Aspartate aminotransferase measurement (45896001) Bilirubin level (302787001) Body mass index (60621009) Fasting blood glucose level (271062006) Low density lipoprotein cholesterol measurement (113079009) Alanine aminotransferase measurement (34608000) Waist circumference (276361009)
0 31 31 31_C6 Healthy blood plasma 54.0 30.0 15.0 27.774423 5.07 2.1 24.0 108.0
1 32 32 32_C7 Healthy blood plasma 27.0 28.0 17.0 28.727377 6.09 4.3 27.0 108.0
2 33 33 33_C8 Healthy blood plasma 69.0 21.0 9.0 28.841532 4.93 4.1 18.0 90.0
3 34 34 34_C9 Healthy blood plasma 101.0 26.0 12.0 42.056933 5.33 4.8 22.0 134.0
4 35 35 35_C10 Healthy blood plasma 61.0 25.0 8.0 29.434851 4.80 3.9 18.0 102.0

Converting CKG’s Clinical data format to SDRF

If we have a dataframe with the CKG’s clinical data format, we can also convert it to SDRF using the function convert_ckg_clinical_to_sdrf().

[5]:
sdrf_df = utils.convert_ckg_clinical_to_sdrf(df)
sdrf_df.head()
[5]:
subject external_id biological_sample external_id analytical_sample external_id characteristics[phenotype] characteristics[organism part] characteristic[alkaline phosphatase measurement] characteristic[aspartate aminotransferase measurement] characteristic[bilirubin measurement] characteristic[body mass index] characteristic[fasting blood glucose measurement] characteristic[low density lipoprotein cholesterol measurement] characteristic[serum alanine aminotransferase measurement] characteristic[waist circumference]
0 31 31 31_C6 Healthy blood plasma 54.0 30.0 15.0 27.774423 5.07 2.1 24.0 108.0
1 32 32 32_C7 Healthy blood plasma 27.0 28.0 17.0 28.727377 6.09 4.3 27.0 108.0
2 33 33 33_C8 Healthy blood plasma 69.0 21.0 9.0 28.841532 4.93 4.1 18.0 90.0
3 34 34 34_C9 Healthy blood plasma 101.0 26.0 12.0 42.056933 5.33 4.8 22.0 134.0
4 35 35 35_C10 Healthy blood plasma 61.0 25.0 8.0 29.434851 4.80 3.9 18.0 102.0
[ ]:

[ ]: