Analytics¶

analytics.py¶

wgcnaAnalysis.py¶

get_data(data, drop_cols_exp=['subject', 'group', 'sample', 'index'], drop_cols_cli=['subject', 'group', 'biological_sample', 'index'], sd_cutoff=0)[source]¶

This function cleanes up and formats experimental and clinical data into similarly shaped dataframes.

Parameters

data (dict) – dictionary with processed clinical and proteomics datasets.
drop_cols_exp (list) – list of columns to drop from processed experimental (protemics/rna-seq/dna-seq) dataframe.
drop_cols_cli (list) – list of columns to drop from processed clinical dataframe.

Returns

Dictionary with experimental and clinical dataframes (keys are the same as in the input dictionary).

get_dendrogram(df, labels, distfun='euclidean', linkagefun='ward', div_clusters=False, fcluster_method='distance', fcluster_cutoff=15)[source]¶

This function calculates the distance matrix and performs hierarchical cluster analysis on a set of dissimilarities and methods for analyzing it.

Parameters

df – pandas dataframe with samples/subjects as index and features as columns.
labels (list) – labels for the leaves of the tree.
distfun (str) – distance measure to be used (‘euclidean’, ‘maximum’, ‘manhattan’, ‘canberra’, ‘binary’, ‘minkowski’ or ‘jaccard’).
linkagefun (str) – hierarchical/agglomeration method to be used (‘single’, ‘complete’, ‘average’, ‘weighted’, ‘centroid’, ‘median’ or ‘ward’).
div_clusters (bool) – dividing dendrogram leaves into clusters (True or False).
fcluster_method (str) – criterion to use in forming flat clusters.
fcluster_cutoff (int) – maximum cophenetic distance between observations in each cluster.

Returns

Dictionary of data structures computed to render the dendrogram. Keys: ‘icoords’, ‘dcoords’, ‘ivl’ and ‘leaves’. If div_clusters is used, it will also return a dictionary of each cluster and respective leaves.

get_clusters_elements(linkage_matrix, fcluster_method, fcluster_cutoff, labels)[source]¶

This function implements the generation of flat clusters from an hierarchical clustering with the same interface as scipy.cluster.hierarchy.fcluster.

Parameters

linkage_matrix (ndarray) – hierarchical clustering encoded with a linkage matrix.
fcluster_method (str) – criterion to use in forming flat clusters (‘inconsistent’, ‘distance’, ‘maxclust’, ‘monocrit’, ‘maxclust_monocrit’).
fcluster_cutoff (float) – maximum cophenetic distance between observations in each cluster.
labels (list) – labels for the leaves of the dendrogram.

Returns

A dictionary where keys are the cluster numbers and values are the dendrogram leaves.

filter_df_by_cluster(df, clusters, number)[source]¶

Select only the members of a defined cluster.

Parameters

df – pandas dataframe with samples/subjects as index and features as columns.
clusters (dict) – clusters dictionary from get_dendrogram function if div_clusters option was True.
number (int) – cluster number (key).

Returns

Pandas dataframe with all the features (columns) and samples/subjects belonging to the defined cluster (index).

df_sort_by_dendrogram(df, Z_dendrogram)[source]¶

Reorders pandas dataframe by index and according to the dendrogram list of leaf nodes labels.

Parameters

df – pandas dataframe with the labels to be reordered as index.
Z_dendrogram (dict) – dictionary of data structures computed to render the dendrogram. Keys: ‘icoords’, ‘dcoords’, ‘ivl’ and ‘leaves’.

Returns

Reordered pandas dataframe.

get_percentiles_heatmap(df, Z_dendrogram, bydendro=True, bycols=False)[source]¶

This function transforms the absolute values in each row or column (option ‘bycols’) into relative values.

Parameters

df – pandas dataframe with samples/subjects as index and features as columns.
Z_dendrogram (dict) – dictionary of data structures computed to render the dendrogram. Keys: ‘icoords’, ‘dcoords’, ‘ivl’ and ‘leaves’.
bydendro (bool) – if labels should be ordered according to dendrogram list of leaf nodes labels set to True, otherwise set to False.
bycols (bool) – relative values calculated across rows (samples) then set to False. Calculation performed across columns (features) set to True.

Returns

Pandas dataframe.

get_miss_values_df(data)[source]¶

Proccesses pandas dataframe so missing values can be plotted in heatmap with specific color.

Parameters: data – pandas dataframe.
Returns: Pandas dataframe with missing values as integer 1, and originally valid values as NaN.

paste_matrices(matrix1, matrix2, rows, cols)[source]¶

Takes two matrices with analog shapes and concatenates each value in matrix 1 with corresponding one in matrix 2, returning a single pandas dataframe.

Parameters

matrix1 (ndarray) – input 1
matrix2 (ndarray) – input 2

Returns

Pandas dataframe.

cutreeDynamic(distmatrix, linkagefun='average', minModuleSize=50, method='hybrid', deepSplit=2, pamRespectsDendro=False, distfun=None)[source]¶

This function implements the R cutreeDynamic wrapper in Python, provinding an access point for methods of adaptive branh pruning of hierarchical clustering dendrograms.

Parameters

data – pandas dataframe.
distfun (str) – distance measure to be used (‘euclidean’, ‘maximum’, ‘manhattan’, ‘canberra’, ‘binary’, ‘minkowski’ or ‘jaccard’).
linkagefun (str) – hierarchical/agglomeration method to be used (‘single’, ‘complete’, ‘average’, ‘weighted’, ‘centroid’, ‘median’ or ‘ward’).
minModuleSize (int) – minimum module size.
method (str) – method to use (‘hybrid’ or ‘tree’).
deepSplit (int) – provides a rough control over sensitivity to cluster splitting, the higher the value (with ‘hybrid’ method) or if True (with ‘tree’ method), the more and smaller modules.
pamRespectsDendro (bool) – only used for method ‘hybrid’. Objects and small modules will only be assigned to modules that belong to the same branch in the dendrogram structure.

Returns

Numpy array of numerical labels giving assignment of objects to modules. Unassigned objects are labeled 0, the largest module has label 1, next largest 2 etc.

build_network(data, softPower=6, networkType='unsigned', linkagefun='average', method='hybrid', minModuleSize=50, deepSplit=2, pamRespectsDendro=False, merge_modules=True, MEDissThres=0.4, verbose=0)[source]¶

Weighted gene network construction and module detection. Calculates co-expression similarity and adjacency, topological overlap matrix (TOM) and clusters features in modules.

Parameters

data – pandas dataframe containing experimental data, with samples/subjects as rows and features as columns.
softPower (int) – soft-thresholding power.
networkType (str) – network type (‘unsigned’, ‘signed’, ‘signed hybrid’, ‘distance’).
linkagefun (str) – hierarchical/agglomeration method to be used (‘single’, ‘complete’, ‘average’, ‘weighted’, ‘centroid’, ‘median’ or ‘ward’).
method (str) – method to use (‘hybrid’ or ‘tree’).
minModuleSize (int) – minimum module size.
pamRespectsDendro (bool) – only used for method ‘hybrid’. Objects and small modules will only be assigned to modules that belong to the same branch in the dendrogram structure.
merge_modules (bool) – if True, very similar modules are merged.
MEDissThres (float) – maximum dissimilarity (i.e., 1-correlation) that qualifies modules for merging.
verbose (int) – integer level of verbosity. Zero means silent, higher values make the output progressively more and more verbose.

Paran int deepSplit

provides a rough control over sensitivity to cluster splitting, the higher the value (with ‘hybrid’ method) or if True (with ‘tree’ method), the more and smaller modules.

Returns

Tuple with TOM dissimilarity pandas dataframe, numpy array with module colors per experimental feature.

pick_softThreshold(data, RsquaredCut=0.8, networkType='unsigned', verbose=0)[source]¶

Analysis of scale free topology for multiple soft thresholding powers. Aids the user in choosing a proper soft-thresholding power for network construction.

Parameters

data – pandas dataframe containing experimental data, with samples/subjects as rows and features as columns.
RsquaredCut (float) – desired minimum scale free topology fitting index R^2.
networkType (str) – network type (‘unsigned’, ‘signed’, ‘signed hybrid’, ‘distance’).
verbose (int) – integer level of verbosity. Zero means silent, higher values make the output progressively more and more verbose.

Returns

Estimated appropriate soft-thresholding power: the lowest power for which the scale free topology fit R^2 exceeds RsquaredCut.

Return type

int

identify_module_colors(matrix, linkagefun='average', method='hybrid', minModuleSize=30, deepSplit=2, pamRespectsDendro=False)[source]¶

Identifies co-expression modules and converts the numeric labels into colors.

Parameters

matrix – dissimilarity structure as produced by R.stats dist.
minModuleSize (int) – minimum module size.
deepSplit (int) – provides a rough control over sensitivity to cluster splitting, the higher the value (with ‘hybrid’ method) or if True (with ‘tree’ method), the more and smaller modules.
pamRespectsDendro (bool) – only used for method ‘hybrid’. Objects and small modules will only be assigned to modules that belong to the same branch in the dendrogram structure.

Returns

Numpy array of strings with module color of each experimental feature.

calculate_module_eigengenes(data, modColors, softPower=6, dissimilarity=True)[source]¶

Calculates modules eigengenes to quantify co-expression similarity of entire modules.

Parameters

data – pandas dataframe containing experimental data, with samples/subjects as rows and features as columns.
modColors (ndarray) – array (numeric, character or a factor) attributing module colors to each feature in the experimental dataframe.
softPower (int) – soft-thresholding power.
dissimilarity – calculates dissimilarity of module eigengenes.

Returns

Pandas dataframe with calculated module eigengenes. If dissimilarity is set to True, returns a tuple with two pandas dataframes, the first with the module eigengenes and the second with the eigengenes dissimilarity.

merge_similar_modules(data, modColors, MEDissThres=0.4, verbose=0)[source]¶

Merges modules in co-expression network that are too close as measured by the correlation of their eigengenes.

Parameters

data – pandas dataframe containing experimental data, with samples/subjects as rows and features as columns.
modColors (ndarray) – array (numeric, character or a factor) attributing module colors to each feature in the experimental dataframe.
verbose (int) – integer level of verbosity. Zero means silent, higher values make the output progressively more and more verbose.

Para, float MEDissThres

maximum dissimilarity (i.e., 1-correlation) that qualifies modules for merging.

Returns

Tuple containing pandas dataframe with eigengenes of the new merged modules, and array with module colors of each expeirmental feature.

calculate_ModuleTrait_correlation(df_exp, df_traits, MEs)[source]¶

Correlates eigengenes with external traits in order to identify the most significant module-trait associations.

Parameters

df_exp – pandas dataframe containing experimental data, with samples/subjects as rows and features as columns.
df_traits – pandas dataframe containing clinical data, with samples/subjects as rows and clinical traits as columns.
MEs – pandas dataframe with module eigengenes.

Returns

Tuple with two pandas datafames, first the correlation between all module eigengenes and all clinical traits, second a dataframe with concatenated correlation and p-value used for heatmap annotation.

calculate_ModuleMembership(data, MEs)[source]¶

For each module, calculates the correlation of the module eigengene and the feature expression profile (quantitative measure of module membership (MM)).

Parameters

data – pandas dataframe containing experimental data, with samples/subjects as rows and features as columns.
MEs – pandas dataframe with module eigengenes.

Returns

Tuple with two pandas dataframes, one with module membership correlations and another with p-values.

calculate_FeatureTraitSignificance(df_exp, df_traits)[source]¶

Quantifies associations of individual experimental features with the measured clinical traits, by defining Feature Significance (FS) as the absolute value of the correlation between the feature and the trait.

Parameters

df_exp – pandas dataframe containing experimental data, with samples/subjects as rows and features as columns.
df_traits – pandas dataframe containing clinical data, with samples/subjects as rows and clinical traits as columns.

Returns

Tuple with two pandas dataframes, one with feature significance correlations and another with p-values.

get_FeaturesPerModule(data, modColors, mode='dictionary')[source]¶

Groups all experimental features by the co-expression module they belong to.

Parameters

data – pandas dataframe containing experimental data, with samples/subjects as rows and features as columns.
modColors (ndarray) – array (numeric, character or a factor) attributing module colors to each feature in the experimental dataframe.
mode (str) – type of the value returned by the function (‘dictionary’ or ‘dataframe’).

Returns

Depending on selected mode, returns a dictionary or dataframe with module color per experimental feature.

get_ModuleFeatures(data, modColors, modules=[])[source]¶

Groups and returns a list of the experimental features clustered in specific co-expression modules.

Parameters

data – pandas dataframe containing experimental data, with samples/subjects as rows and features as columns.
modColors (ndarray) – array (numeric, character or a factor) attributing module colors to each feature in the experimental dataframe.
modules (list) – list of module colors of interest.

Returns

List of lists with experimental features in each selected module.

get_EigengenesTrait_correlation(MEs, data)[source]¶

Eigengenes are used as representative profiles of the co-expression modules, and correlation between them is used to quantify module similarity. Clinical traits are added to the eigengenes to see how the traits fir into the eigengen network.

Parameters

MEs – pandas dataframe with module eigengenes.
data – pandas dataframe containing clinical data, with samples/subjects as rows and clinical traits as columns.

Returns

Tuple with two pandas dataframes, one with features and traits recalculates module eigengenes dissimilarity, and another with all the overall correlations.

kaplan_meierAnalysis.py¶

get_data_ready_for_km(dfs_dict, args)[source]¶

group_data_based_on_marker(df, marker, index_col, how, value)[source]¶

run_km(data, time_col, event_col, group_col, args={})[source]¶

get_km_results(df, group_col, time_col, event_col)[source]¶

get_hazard_ratio_results(df, group_col, time_col, event_col)[source]¶