Batch correction¶
The Clinical Knowledge Graph (CKG) can be used to remove batch effects in the data. CKG uses the library pyComBat (https://github.com/epigenelabs/pyComBat), which uses empirical Bayes frameworks for adjusting data for batch effects.
Batch effects can arise from pre-analytical conditions such as sampling technique or storage conditions or they can arise during the experimental procedure itself, such as different plates, dates, ms-machine, column or lc.
When an experiment is uploaded to CKG, the batch of each sample can be specified in the experimental design file column: batch
. At the time of generating the report, you can choose to correct any batch effect by indicating in the configuration args two parameters:
batch_correction
(bool): whether or not to run the correctionbatch_col
(str): what column defines the batch (generally, batch)
CKG’s analytical pipeline will take these parameters into account and run batch correction accordingly.
Here, we exemplify how to use CKG analytic_core functionality to correct the batch effect from 2 different cohorts of COVID-19 plasma samples that were combined in this study. The files were downloaded from PRIDE - identifiers:
PXD020580
PXD023016
For convenience, files are provided in the tmp folder (tmp/batch_correction) together with a generated experimental design file with what we could extract from the metadata. There can be differences with the actual study due to mapping issues but it is a good example of how batch effect is corrected in CKG.
To upload these files in CKG:
Create a new project
In the Data upload page, upload the experimental design (contains all the samples) and one of the proteomics files (i.e PXD020580)
Once these files are uploaded, upload the second proteomics file (i.e PXD023016).
To visualize in a PCA how the batch effect correction is removed from the proceesed dataset, you can add in the analysis configuration a section like this:
uncorrected_pca_batch:
data: uncorrected_batch
analyses:
- pca
plots:
- pca
args:
hovering_cols:
- sample
drop_cols:
- sample
- subject
- group
group: batch
x_title: PC1
y_title: PC2
components: 2
init: 'pca'
width: 1000
height: 700
loadings: 15
factor: 50
title: 'Sample stratification with PCA - uncorrected batch effect - Batch'
Notebook¶
In this notebook, we will load the original dataset from the uploaded project and follow the processing and batch effect correction steps.
[1]:
import os
import pandas as pd
import ckg.ckg_utils as ckg_utils
from ckg.graphdb_builder import builder_utils
from ckg.graphdb_builder.experiments.parsers import proteomicsParser
from ckg.report_manager import project
from ckg.analytics_core.analytics import analytics
from ckg.analytics_core.viz import viz
from plotly.offline import init_notebook_mode, iplot
%matplotlib inline
init_notebook_mode(connected=True)
c:\users\sande\.conda\envs\pip_rev\lib\site-packages\outdated\utils.py:18: OutdatedPackageWarning:
The package pingouin is out of date. Your version is 0.3.11, the latest is 0.3.12.
Set the environment variable OUTDATED_IGNORE=1 to disable these warnings.
WGCNA functions will not work. Module Rpy2 not installed.
R functions will not work. Module Rpy2 not installed.
The results from the CKG app can be visualized and further analyzed here¶
[14]:
p = project.Project(identifier='P0000017', configuration_files={}, datasets={}, knowledge=None, report={})
[15]:
p.build_project(force=False)
p.generate_report()
[16]:
original_data = p.get_dataset("proteomics").get_dataframe("original")
[17]:
original_data.head()
[17]:
LFQ_intensity | batch | group | identifier | name | sample | subject | |
---|---|---|---|---|---|---|---|
0 | 24.032210 | 2 | Negative | P04217 | A1BG | 39pool | PXD020580_68 |
1 | 22.818177 | 2 | Negative | P04217 | A1BG | 173_pool | PXD020580_11 |
2 | 21.312409 | 2 | Negative | P04217 | A1BG | 11_pool | PXD020580_37 |
3 | 22.561322 | 2 | Negative | P04217 | A1BG | 30_pool | PXD020580_18 |
4 | 23.135537 | 2 | Negative | P04217 | A1BG | 198_pool | PXD020580_165 |
[18]:
processed_data = analytics.get_proteomics_measurements_ready(original_data,
index_cols=['group', 'sample', 'subject', 'batch'],
drop_cols=[],
group='group',
identifier='identifier',
extra_identifier='name',
filter_samples=False,
imputation=True,
imputation_method='distribution',
missing_method='percentage',
missing_per_group=True,
missing_max=0.5,
value_col='LFQ_intensity',
shift=1.8,
nstd=0.3)
[19]:
processed_data.head()
[19]:
identifier | group | sample | subject | batch | A1BG~P04217 | A2M~P01023 | AARS1~P49588 | ABHD12B~Q7Z5M8 | ACBD3~Q9H3P7 | ACO2~Q99798 | ... | XRCC5~P13010 | XRCC6~P12956 | YBX1~P67809 | YWHAB~P31946 | YWHAE~P62258 | YWHAG~P61981 | YWHAH~Q04917 | YWHAQ~P27348 | YWHAZ~P63104 | ZG16B~Q96DA0 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Negative | 11_pool | PXD020580_37 | 2 | 21.312409 | 26.548911 | 20.405800 | 20.855230 | 21.046427 | 23.970005 | ... | 20.459354 | 19.064410 | 19.861664 | 22.851252 | 18.866241 | 20.314623 | 21.532212 | 18.000125 | 20.192224 | 26.406718 |
1 | Negative | 12_pool | PXD020580_38 | 2 | 18.947086 | 20.850335 | 18.897968 | 20.118419 | 23.031672 | 20.386911 | ... | 19.522988 | 20.895987 | 19.656218 | 19.898359 | 22.884871 | 19.358429 | 17.088187 | 19.988028 | 25.780417 | 27.887858 |
2 | Negative | 13_pool | PXD020580_39 | 2 | 18.930766 | 19.620407 | 22.170997 | 19.579852 | 22.880057 | 17.551108 | ... | 17.905797 | 20.626525 | 19.705407 | 20.674868 | 19.741435 | 19.140199 | 19.315568 | 20.172576 | 25.787782 | 27.925526 |
3 | Negative | 173_pool | PXD020580_11 | 2 | 22.818177 | 26.481222 | 22.629387 | 20.448801 | 24.477475 | 27.418765 | ... | 22.193099 | 23.641741 | 23.097641 | 27.678553 | 29.066149 | 28.566815 | 23.855478 | 28.147635 | 29.543552 | 22.736469 |
4 | Negative | 175pool | PXD020580_12 | 2 | 17.761723 | 19.701719 | 19.539977 | 18.480560 | 27.456776 | 18.619836 | ... | 18.825366 | 20.068438 | 18.434687 | 20.849923 | 20.595838 | 18.757111 | 20.322045 | 19.980570 | 23.856714 | 27.073165 |
5 rows × 439 columns
[20]:
processed_data_corrected = analytics.combat_batch_correction(data=processed_data, index_cols=['group','sample','subject'], batch_col = 'batch')
Found 2 batches.
Adjusting for 0 covariate(s) or covariate level(s).
Standardizing Data across genes.
Fitting L/S model and finding priors.
Finding parametric adjustments.
Adjusting the Data
c:\users\sande\.conda\envs\pip_rev\lib\site-packages\numpy\core\_asarray.py:83: VisibleDeprecationWarning:
Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
[21]:
processed_data_corrected.head()
[21]:
identifier | group | sample | subject | A1BG~P04217 | A2M~P01023 | AARS1~P49588 | ABHD12B~Q7Z5M8 | ACBD3~Q9H3P7 | ACO2~Q99798 | ACOX1~Q15067 | ... | XRCC5~P13010 | XRCC6~P12956 | YBX1~P67809 | YWHAB~P31946 | YWHAE~P62258 | YWHAG~P61981 | YWHAH~Q04917 | YWHAQ~P27348 | YWHAZ~P63104 | ZG16B~Q96DA0 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Negative | 11_pool | PXD020580_37 | 22.679424 | 26.405897 | 19.993245 | 24.080817 | 20.322720 | 22.965131 | 22.374179 | ... | 20.097775 | 18.957149 | 19.292474 | 22.088503 | 18.137226 | 19.858514 | 20.978543 | 17.626513 | 19.568032 | 25.653277 |
1 | Negative | 12_pool | PXD020580_38 | 20.108900 | 20.601108 | 18.573382 | 23.253694 | 22.041474 | 19.848725 | 19.299928 | ... | 19.285468 | 20.596626 | 19.111939 | 19.504370 | 21.697131 | 19.009884 | 17.024859 | 19.434463 | 24.727817 | 27.114564 |
2 | Negative | 13_pool | PXD020580_39 | 20.091164 | 19.348256 | 21.655458 | 22.649113 | 21.910212 | 17.382279 | 18.254541 | ... | 17.882538 | 20.355426 | 19.155163 | 20.183908 | 18.912517 | 18.816204 | 19.006477 | 19.602306 | 24.734618 | 27.151727 |
3 | Negative | 173_pool | PXD020580_11 | 24.315824 | 26.336947 | 22.087105 | 23.624571 | 23.293200 | 25.964700 | 20.581386 | ... | 21.601817 | 23.054399 | 22.136081 | 26.312967 | 27.172820 | 27.182400 | 23.045467 | 26.855436 | 28.202459 | 22.032220 |
4 | Negative | 175pool | PXD020580_12 | 18.820703 | 19.431083 | 19.177935 | 21.415079 | 25.872574 | 18.311807 | 23.261753 | ... | 18.680274 | 19.855873 | 18.038521 | 20.337102 | 19.669390 | 18.476210 | 19.901902 | 19.427681 | 22.951590 | 26.310791 |
5 rows × 438 columns
PCA with Uncorrected Data¶
As it can be seen, the PCA plot shows clear “batch effect” separating both proteomics experiments (batch 1: PXD023016, batch 2: PXD020580).
[22]:
r, nargs = analytics.run_pca(data = processed_data, drop_cols=['sample', 'subject', 'group'], group='batch', annotation_cols=['sample'], components=2, dropna=True)
plot = viz.get_pca_plot(r, identifier='pca', args=nargs)#{'loadings':15, 'title':'PCA Plot', 'x_title':'PC1', 'y_title':'PC2', 'group':'group','height':900, 'width':900})
iplot(plot.figure)
[24]:
r, nargs = analytics.run_pca(data = processed_data, drop_cols=['sample', 'subject', 'batch'], group='group', annotation_cols=['sample', 'batch'], components=2, dropna=True)
plot = viz.get_pca_plot(r, identifier='pca', args=nargs)#{'loadings':15, 'title':'PCA Plot', 'x_title':'PC1', 'y_title':'PC2', 'group':'group','height':900, 'width':900})
iplot(plot.figure)
PCA with Corrected Data¶
The PCA shows that both experiments are now integrated and the batch effect is removed.
[23]:
r, nargs = analytics.run_pca(data = processed_data_corrected, drop_cols=['sample', 'subject'], group='group', annotation_cols=['sample'], components=2, dropna=True)
plot = viz.get_pca_plot(r, identifier='pca', args=nargs)#{'loadings':15, 'title':'PCA Plot', 'x_title':'PC1', 'y_title':'PC2', 'group':'group','height':900, 'width':900})
iplot(plot.figure)
[ ]: