Knowledge Annotation and Visualization¶

Summarizing Knowledge using Network Analysis¶

The Clinical Knowledge Graph (CKG) can be used to annotate a list of proteins based on their connections in the Knowledge Graph. CKG generates a comprehensive graph with all the connections to Diseases, Drugs, Protein Complexes, Pathways and Biological processes.

All the connections extracted from CKG are then summarized into a smaller subgraph containing only the top 15 nodes of each type (Disease, Durg, Complex, Pathway, Biological_process Publications) based on different network analysis algorithms (centrality, pagerank).

The connections extracted from the graph are:

Protein-protein interactions
Protein-disease associations
Protein-drug associations
Drug-drug interactions
Dug-disease indications
Protein-complex association
Protein-publication mentions
Disease-publication mentions

These connections are extracted using these queries: report_manager/queries/knowledge_annotation.yml and can be easily extended following the same query format.

Here, we show several examples of how to extract and visualize knowledge for a list of proteins, either specifying the disease/diseases being studied or letting CKG figure it out.

[1]:

import pandas as pd
from ckg.report_manager import knowledge

c:\users\sande\.conda\envs\pip_rev\lib\site-packages\outdated\utils.py:18: OutdatedPackageWarning:

The package pingouin is out of date. Your version is 0.3.11, the latest is 0.3.12.
Set the environment variable OUTDATED_IGNORE=1 to disable these warnings.

WGCNA functions will not work. Module Rpy2 not installed.
R functions will not work. Module Rpy2 not installed.

Annotation of Proteins Linked to a Specific Disease¶

We use the Open Targets platform https://www.targetvalidation.org/ to obtain lists of genes associated to Fibromyalgia. Open Targets compiled a list of 57 proteins targets that are associated to Fibromyalgia (https://www.targetvalidation.org/disease/EFO_0005687/associations?fcts=datatype:known_drug).

Fibrimyalgia is a medical condition characterized by chronic widespread pain and a heightened pain response to pressure. Other symptoms include tiredness to a degree that normal activities are affected, sleep problems and troubles with memory (source: https://en.wikipedia.org/wiki/Fibromyalgia).

We feed the list of proteins to CKG to prioritize all the knowledge gathered in the graph to reveal relationships to other possibly related diseases as well as possible treatments and altered biological processes and pathways.

[2]:

opentargets_covid_file = 'tmp/EFO_0005687-associated-diseases.tsv'
opentargets_data = pd.read_csv(opentargets_covid_file, sep='\t', header=0)

[3]:

opentargets_data.head()

[3]:

	symbol	overallAssociationScore	geneticAssociations	somaticMutations	drugs	pathwaysSystemsBiology	textMining	rnaExpression	animalModels	targetName
0	SLC6A2	0.605935	No data	No data	0.6040872459884047	No data	0.03695798546847018	No data	No data	solute carrier family 6 member 2
1	SLC6A4	0.605472	No data	No data	0.6040991203885059	No data	0.02744951709599876	No data	No data	solute carrier family 6 member 4
2	CACNA2D1	0.602479	No data	No data	0.6024789602067924	No data	No data	No data	No data	calcium voltage-gated channel auxiliary subuni...
3	HTR2A	0.559159	No data	No data	0.5546461928861836	No data	0.09024626746046689	No data	No data	5-hydroxytryptamine receptor 2A
4	HTR2C	0.554646	No data	No data	0.5546461928861836	No data	No data	No data	No data	5-hydroxytryptamine receptor 2C

[4]:

target_list = opentargets_data['symbol'].tolist()

[5]:

len(target_list)

[5]:

Knowledge Object¶

To annotate the list of proteins, we create an empty object of type Knowledge.

Once we have the object, we can simply call the function annotate_list() specifying the list of proteins and in this case the disease (or diseases) and what type of entities we want to annotate (Disease, Drug, Pathway, etc.).

[6]:

#Create Knowledge object
kn = knowledge.Knowledge(identifier='Fibromyalgia', data=None)

[7]:

# Annotate the list of proteins using function annotate_list
kn.annotate_list(query_list=target_list, # list of proteins
                 entity_type='protein', # type of items in the list
                 queries_file=None, # Allows YML file with customized queries or the default (None)
                 attribute='name',  # What we provide in the list (name, id)
                 diseases=['fibromyalgia'], # List of diseases
                 entities=None) # what types of annotations (Disease, Drug, Pathway, etc.)

This function runs all the queries in queries_file (default: report_manager/queries/knowledge_annotation.yml) associated to the entity_type (protein) and limits the queried information to relationships to the list of proteins provided.

Summarization and Visualization¶

The graph contains millions of relationships and the results from the annotation may be too combersome.

In order to summarize the results and make them easier to understand and navigate, CKG uses network analysis algorithms (centrality (betweenness, closeness) and pagerank) to prioritize the nodes in the knowledge annotation graph.

The result summarizes the relationships of the top 15 nodes of each entity type according to these algorithms (Disease, Drug, Pathway, Biological_process, Complex, Publication).

The summarized results can be visualized either as a Sankey plot or as a network.

[8]:

kn.generate_report(visualizations=['network', 'sankey'], # how to visualize the results (network, sankey)
                   summarize=True, # Whether or not to summarize the annotation
                   method='betweenness', # Method for summarizing the annotation (betweenness, closeness, pagerank)
                   inplace=True) # If True, the summarized is saved, otherwise keep full graph

[9]:

kn.report.visualize_report(environment='notebook')[0]

All the Knowledge is Accessible¶

All the relationships extracted from the CKG are stored as a dataframe in the class property data.

[10]:

kn.data.shape

[10]:

(98877, 7)

[11]:

kn.data.head()

[11]:

	r.source	rel_type	source	source_type	target	target_type	weight
0	Reactome	ANNOTATED_IN_PATHWAY	ACE	[Protein]	Metabolism of Angiotensinogen to Angiotensins	[Pathway]	NaN
1	Reactome	ANNOTATED_IN_PATHWAY	ACHE	[Protein]	Synthesis of PC	[Pathway]	NaN
2	Reactome	ANNOTATED_IN_PATHWAY	ACHE	[Protein]	Neurotransmitter clearance	[Pathway]	NaN
3	Reactome	ANNOTATED_IN_PATHWAY	ACHE	[Protein]	Synthesis, secretion, and deacylation of Ghrelin	[Pathway]	NaN
4	Reactome	ANNOTATED_IN_PATHWAY	ACSS2	[Protein]	Transcriptional activation of mitochondrial bi...	[Pathway]	NaN

[12]:

kn.data.tail()

[12]:

	r.source	rel_type	source	source_type	target	target_type	weight
9209	None	MENTIONED_IN_PUBLICATION	PRKCA	[Protein]	PMID:25942533	[Publication]	NaN
9210	None	MENTIONED_IN_PUBLICATION	PTGS2	[Protein]	PMID:25942533	[Publication]	NaN
9211	None	MENTIONED_IN_PUBLICATION	SCN10A	[Protein]	PMID:25942533	[Publication]	NaN
9212	None	MENTIONED_IN_PUBLICATION	TNF	[Protein]	PMID:25942533	[Publication]	NaN
9213	None	MENTIONED_IN_PUBLICATION	TRPA1	[Protein]	PMID:25942533	[Publication]	NaN

The generated knowledge subgraph can also be accessed as a NetworkX Directed graph.

[13]:

kn.graph

[13]:

<networkx.classes.digraph.DiGraph at 0x23ab638e2c8>

And the report can be downloaded to a specified directory. The directory will contain the Sankey visualization in png and svg formats, the network in gml and json formats as well as the nodes and edges (relationships) tables in tsv format.

[14]:

kn.report.download_report('tmp/fibromyalgia')

In some cases, we are interested in annotating a list of proteins and identify what diseases they may be related to.

In the example before, by specifying a disease, we prioritized the relationships to that disease and at the same time identified many others also associated to the list of proteins. Running the same annotation without specifying fibromyalgia, brings up other diseases that may be relevant to investigate (i.e opiate dependence).

[15]:

# Annotate the list of proteins using function annotate_list
kn.annotate_list(query_list=target_list, # list of proteins
                 entity_type='protein', # type of items in the list
                 queries_file=None, # Allows YML file with customized queries or the default (None)
                 attribute='name',  # What we provide in the list (name, id)
                 diseases=[], # List of diseases
                 entities=None) # what types of annotations (Disease, Drug, Pathway, etc.)

[ ]:

kn.generate_report(visualizations=['network'], summarize=True, method='betweenness', inplace=True)

[ ]:

kn.report.visualize_report(environment='notebook')[0]

Annotation of a List of Proteins¶

Providing a list of proteins without specifying a list of diseases shows also the validity of the summarization method and the usefulness of the extracted knowledge.

Here, we show another example annotating a list of proteins (n=84) related to Alzheimer’s disease from the Open Targets Platform (https://www.targetvalidation.org/disease/EFO_0000249/associations?fcts=datatype:affected_pathway).

[ ]:

opentargets_covid_file = 'tmp/EFO_0000249-associated-diseases.tsv'
opentargets_data = pd.read_csv(opentargets_covid_file, sep=',', header=0)
target_list = opentargets_data['symbol'].tolist()
len(target_list)

[ ]:

# Annotate the list of proteins using function annotate_list
kn.annotate_list(query_list=target_list, # list of proteins
                 entity_type='protein', # type of items in the list
                 queries_file=None, # Allows YML file with customized queries or the default (None)
                 attribute='name',  # What we provide in the list (name, id)
                 diseases=[], # List of diseases
                 entities=None) # what types of annotations (Disease, Drug, Pathway, etc.)

[ ]:

kn.generate_report(visualizations=['sankey'], summarize=True, method='betweenness', inplace=False)

[ ]:

kn.report.visualize_report(environment='notebook')

Betweenness centrality can be slow depending on the number of relationships in the graph. There are other options for summarizing the knowledge annotation: closeness centrality or pagerank.

[ ]:

kn.generate_report(visualizations=['sankey'], summarize=True, method='closeness', inplace=False)

[ ]:

kn.report.visualize_report(environment='notebook')

References¶

Ochoa, D. et al. (2021). Open Targets Platform: supporting systematic drug–target identification and prioritisation. Nucleic Acids Research. https://academic.oup.com/nar/article/49/D1/D1302/5983621

[ ]: