Analytics¶
analytics.py¶
wgcnaAnalysis.py¶
-
get_data
(data, drop_cols_exp=['subject', 'group', 'sample', 'index'], drop_cols_cli=['subject', 'group', 'biological_sample', 'index'], sd_cutoff=0)[source]¶ This function cleanes up and formats experimental and clinical data into similarly shaped dataframes.
- Parameters
- Returns
Dictionary with experimental and clinical dataframes (keys are the same as in the input dictionary).
-
get_dendrogram
(df, labels, distfun='euclidean', linkagefun='ward', div_clusters=False, fcluster_method='distance', fcluster_cutoff=15)[source]¶ This function calculates the distance matrix and performs hierarchical cluster analysis on a set of dissimilarities and methods for analyzing it.
- Parameters
df – pandas dataframe with samples/subjects as index and features as columns.
labels (list) – labels for the leaves of the tree.
distfun (str) – distance measure to be used (‘euclidean’, ‘maximum’, ‘manhattan’, ‘canberra’, ‘binary’, ‘minkowski’ or ‘jaccard’).
linkagefun (str) – hierarchical/agglomeration method to be used (‘single’, ‘complete’, ‘average’, ‘weighted’, ‘centroid’, ‘median’ or ‘ward’).
div_clusters (bool) – dividing dendrogram leaves into clusters (True or False).
fcluster_method (str) – criterion to use in forming flat clusters.
fcluster_cutoff (int) – maximum cophenetic distance between observations in each cluster.
- Returns
Dictionary of data structures computed to render the dendrogram. Keys: ‘icoords’, ‘dcoords’, ‘ivl’ and ‘leaves’. If div_clusters is used, it will also return a dictionary of each cluster and respective leaves.
-
get_clusters_elements
(linkage_matrix, fcluster_method, fcluster_cutoff, labels)[source]¶ This function implements the generation of flat clusters from an hierarchical clustering with the same interface as scipy.cluster.hierarchy.fcluster.
- Parameters
linkage_matrix (ndarray) – hierarchical clustering encoded with a linkage matrix.
fcluster_method (str) – criterion to use in forming flat clusters (‘inconsistent’, ‘distance’, ‘maxclust’, ‘monocrit’, ‘maxclust_monocrit’).
fcluster_cutoff (float) – maximum cophenetic distance between observations in each cluster.
labels (list) – labels for the leaves of the dendrogram.
- Returns
A dictionary where keys are the cluster numbers and values are the dendrogram leaves.
-
filter_df_by_cluster
(df, clusters, number)[source]¶ Select only the members of a defined cluster.
- Parameters
- Returns
Pandas dataframe with all the features (columns) and samples/subjects belonging to the defined cluster (index).
-
df_sort_by_dendrogram
(df, Z_dendrogram)[source]¶ Reorders pandas dataframe by index and according to the dendrogram list of leaf nodes labels.
- Parameters
df – pandas dataframe with the labels to be reordered as index.
Z_dendrogram (dict) – dictionary of data structures computed to render the dendrogram. Keys: ‘icoords’, ‘dcoords’, ‘ivl’ and ‘leaves’.
- Returns
Reordered pandas dataframe.
-
get_percentiles_heatmap
(df, Z_dendrogram, bydendro=True, bycols=False)[source]¶ This function transforms the absolute values in each row or column (option ‘bycols’) into relative values.
- Parameters
df – pandas dataframe with samples/subjects as index and features as columns.
Z_dendrogram (dict) – dictionary of data structures computed to render the dendrogram. Keys: ‘icoords’, ‘dcoords’, ‘ivl’ and ‘leaves’.
bydendro (bool) – if labels should be ordered according to dendrogram list of leaf nodes labels set to True, otherwise set to False.
bycols (bool) – relative values calculated across rows (samples) then set to False. Calculation performed across columns (features) set to True.
- Returns
Pandas dataframe.
-
get_miss_values_df
(data)[source]¶ Proccesses pandas dataframe so missing values can be plotted in heatmap with specific color.
- Parameters
data – pandas dataframe.
- Returns
Pandas dataframe with missing values as integer 1, and originally valid values as NaN.
-
paste_matrices
(matrix1, matrix2, rows, cols)[source]¶ Takes two matrices with analog shapes and concatenates each value in matrix 1 with corresponding one in matrix 2, returning a single pandas dataframe.
- Parameters
matrix1 (ndarray) – input 1
matrix2 (ndarray) – input 2
- Returns
Pandas dataframe.
-
cutreeDynamic
(distmatrix, linkagefun='average', minModuleSize=50, method='hybrid', deepSplit=2, pamRespectsDendro=False, distfun=None)[source]¶ This function implements the R cutreeDynamic wrapper in Python, provinding an access point for methods of adaptive branh pruning of hierarchical clustering dendrograms.
- Parameters
data – pandas dataframe.
distfun (str) – distance measure to be used (‘euclidean’, ‘maximum’, ‘manhattan’, ‘canberra’, ‘binary’, ‘minkowski’ or ‘jaccard’).
linkagefun (str) – hierarchical/agglomeration method to be used (‘single’, ‘complete’, ‘average’, ‘weighted’, ‘centroid’, ‘median’ or ‘ward’).
minModuleSize (int) – minimum module size.
method (str) – method to use (‘hybrid’ or ‘tree’).
deepSplit (int) – provides a rough control over sensitivity to cluster splitting, the higher the value (with ‘hybrid’ method) or if True (with ‘tree’ method), the more and smaller modules.
pamRespectsDendro (bool) – only used for method ‘hybrid’. Objects and small modules will only be assigned to modules that belong to the same branch in the dendrogram structure.
- Returns
Numpy array of numerical labels giving assignment of objects to modules. Unassigned objects are labeled 0, the largest module has label 1, next largest 2 etc.
-
build_network
(data, softPower=6, networkType='unsigned', linkagefun='average', method='hybrid', minModuleSize=50, deepSplit=2, pamRespectsDendro=False, merge_modules=True, MEDissThres=0.4, verbose=0)[source]¶ Weighted gene network construction and module detection. Calculates co-expression similarity and adjacency, topological overlap matrix (TOM) and clusters features in modules.
- Parameters
data – pandas dataframe containing experimental data, with samples/subjects as rows and features as columns.
softPower (int) – soft-thresholding power.
networkType (str) – network type (‘unsigned’, ‘signed’, ‘signed hybrid’, ‘distance’).
linkagefun (str) – hierarchical/agglomeration method to be used (‘single’, ‘complete’, ‘average’, ‘weighted’, ‘centroid’, ‘median’ or ‘ward’).
method (str) – method to use (‘hybrid’ or ‘tree’).
minModuleSize (int) – minimum module size.
pamRespectsDendro (bool) – only used for method ‘hybrid’. Objects and small modules will only be assigned to modules that belong to the same branch in the dendrogram structure.
merge_modules (bool) – if True, very similar modules are merged.
MEDissThres (float) – maximum dissimilarity (i.e., 1-correlation) that qualifies modules for merging.
verbose (int) – integer level of verbosity. Zero means silent, higher values make the output progressively more and more verbose.
- Paran int deepSplit
provides a rough control over sensitivity to cluster splitting, the higher the value (with ‘hybrid’ method) or if True (with ‘tree’ method), the more and smaller modules.
- Returns
Tuple with TOM dissimilarity pandas dataframe, numpy array with module colors per experimental feature.
-
pick_softThreshold
(data, RsquaredCut=0.8, networkType='unsigned', verbose=0)[source]¶ Analysis of scale free topology for multiple soft thresholding powers. Aids the user in choosing a proper soft-thresholding power for network construction.
- Parameters
data – pandas dataframe containing experimental data, with samples/subjects as rows and features as columns.
RsquaredCut (float) – desired minimum scale free topology fitting index R^2.
networkType (str) – network type (‘unsigned’, ‘signed’, ‘signed hybrid’, ‘distance’).
verbose (int) – integer level of verbosity. Zero means silent, higher values make the output progressively more and more verbose.
- Returns
Estimated appropriate soft-thresholding power: the lowest power for which the scale free topology fit R^2 exceeds RsquaredCut.
- Return type
-
identify_module_colors
(matrix, linkagefun='average', method='hybrid', minModuleSize=30, deepSplit=2, pamRespectsDendro=False)[source]¶ Identifies co-expression modules and converts the numeric labels into colors.
- Parameters
matrix – dissimilarity structure as produced by R.stats dist.
minModuleSize (int) – minimum module size.
deepSplit (int) – provides a rough control over sensitivity to cluster splitting, the higher the value (with ‘hybrid’ method) or if True (with ‘tree’ method), the more and smaller modules.
pamRespectsDendro (bool) – only used for method ‘hybrid’. Objects and small modules will only be assigned to modules that belong to the same branch in the dendrogram structure.
- Returns
Numpy array of strings with module color of each experimental feature.
-
calculate_module_eigengenes
(data, modColors, softPower=6, dissimilarity=True)[source]¶ Calculates modules eigengenes to quantify co-expression similarity of entire modules.
- Parameters
data – pandas dataframe containing experimental data, with samples/subjects as rows and features as columns.
modColors (ndarray) – array (numeric, character or a factor) attributing module colors to each feature in the experimental dataframe.
softPower (int) – soft-thresholding power.
dissimilarity – calculates dissimilarity of module eigengenes.
- Returns
Pandas dataframe with calculated module eigengenes. If dissimilarity is set to True, returns a tuple with two pandas dataframes, the first with the module eigengenes and the second with the eigengenes dissimilarity.
-
merge_similar_modules
(data, modColors, MEDissThres=0.4, verbose=0)[source]¶ Merges modules in co-expression network that are too close as measured by the correlation of their eigengenes.
- Parameters
data – pandas dataframe containing experimental data, with samples/subjects as rows and features as columns.
modColors (ndarray) – array (numeric, character or a factor) attributing module colors to each feature in the experimental dataframe.
verbose (int) – integer level of verbosity. Zero means silent, higher values make the output progressively more and more verbose.
- Para, float MEDissThres
maximum dissimilarity (i.e., 1-correlation) that qualifies modules for merging.
- Returns
Tuple containing pandas dataframe with eigengenes of the new merged modules, and array with module colors of each expeirmental feature.
-
calculate_ModuleTrait_correlation
(df_exp, df_traits, MEs)[source]¶ Correlates eigengenes with external traits in order to identify the most significant module-trait associations.
- Parameters
df_exp – pandas dataframe containing experimental data, with samples/subjects as rows and features as columns.
df_traits – pandas dataframe containing clinical data, with samples/subjects as rows and clinical traits as columns.
MEs – pandas dataframe with module eigengenes.
- Returns
Tuple with two pandas datafames, first the correlation between all module eigengenes and all clinical traits, second a dataframe with concatenated correlation and p-value used for heatmap annotation.
-
calculate_ModuleMembership
(data, MEs)[source]¶ For each module, calculates the correlation of the module eigengene and the feature expression profile (quantitative measure of module membership (MM)).
- Parameters
data – pandas dataframe containing experimental data, with samples/subjects as rows and features as columns.
MEs – pandas dataframe with module eigengenes.
- Returns
Tuple with two pandas dataframes, one with module membership correlations and another with p-values.
-
calculate_FeatureTraitSignificance
(df_exp, df_traits)[source]¶ Quantifies associations of individual experimental features with the measured clinical traits, by defining Feature Significance (FS) as the absolute value of the correlation between the feature and the trait.
- Parameters
df_exp – pandas dataframe containing experimental data, with samples/subjects as rows and features as columns.
df_traits – pandas dataframe containing clinical data, with samples/subjects as rows and clinical traits as columns.
- Returns
Tuple with two pandas dataframes, one with feature significance correlations and another with p-values.
-
get_FeaturesPerModule
(data, modColors, mode='dictionary')[source]¶ Groups all experimental features by the co-expression module they belong to.
- Parameters
data – pandas dataframe containing experimental data, with samples/subjects as rows and features as columns.
modColors (ndarray) – array (numeric, character or a factor) attributing module colors to each feature in the experimental dataframe.
mode (str) – type of the value returned by the function (‘dictionary’ or ‘dataframe’).
- Returns
Depending on selected mode, returns a dictionary or dataframe with module color per experimental feature.
-
get_ModuleFeatures
(data, modColors, modules=[])[source]¶ Groups and returns a list of the experimental features clustered in specific co-expression modules.
- Parameters
data – pandas dataframe containing experimental data, with samples/subjects as rows and features as columns.
modColors (ndarray) – array (numeric, character or a factor) attributing module colors to each feature in the experimental dataframe.
modules (list) – list of module colors of interest.
- Returns
List of lists with experimental features in each selected module.
-
get_EigengenesTrait_correlation
(MEs, data)[source]¶ Eigengenes are used as representative profiles of the co-expression modules, and correlation between them is used to quantify module similarity. Clinical traits are added to the eigengenes to see how the traits fir into the eigengen network.
- Parameters
MEs – pandas dataframe with module eigengenes.
data – pandas dataframe containing clinical data, with samples/subjects as rows and clinical traits as columns.
- Returns
Tuple with two pandas dataframes, one with features and traits recalculates module eigengenes dissimilarity, and another with all the overall correlations.