Knowledge Annotation and Visualization

Summarizing Knowledge using Network Analysis

The Clinical Knowledge Graph (CKG) can be used to annotate a list of proteins based on their connections in the Knowledge Graph. CKG generates a comprehensive graph with all the connections to Diseases, Drugs, Protein Complexes, Pathways and Biological processes.

All the connections extracted from CKG are then summarized into a smaller subgraph containing only the top 15 nodes of each type (Disease, Durg, Complex, Pathway, Biological_process Publications) based on different network analysis algorithms (centrality, pagerank).

The connections extracted from the graph are:

  • Protein-protein interactions

  • Protein-disease associations

  • Protein-drug associations

  • Drug-drug interactions

  • Dug-disease indications

  • Protein-complex association

  • Protein-publication mentions

  • Disease-publication mentions

These connections are extracted using these queries: report_manager/queries/knowledge_annotation.yml and can be easily extended following the same query format.

Here, we show several examples of how to extract and visualize knowledge for a list of proteins, either specifying the disease/diseases being studied or letting CKG figure it out.

[1]:
import pandas as pd
from ckg.report_manager import knowledge
c:\users\sande\.conda\envs\pip_rev\lib\site-packages\outdated\utils.py:18: OutdatedPackageWarning:

The package pingouin is out of date. Your version is 0.3.11, the latest is 0.3.12.
Set the environment variable OUTDATED_IGNORE=1 to disable these warnings.

WGCNA functions will not work. Module Rpy2 not installed.
R functions will not work. Module Rpy2 not installed.

Annotation of Proteins Linked to a Specific Disease

We use the Open Targets platform https://www.targetvalidation.org/ to obtain lists of genes associated to Fibromyalgia. Open Targets compiled a list of 57 proteins targets that are associated to Fibromyalgia (https://www.targetvalidation.org/disease/EFO_0005687/associations?fcts=datatype:known_drug).

Fibrimyalgia is a medical condition characterized by chronic widespread pain and a heightened pain response to pressure. Other symptoms include tiredness to a degree that normal activities are affected, sleep problems and troubles with memory (source: https://en.wikipedia.org/wiki/Fibromyalgia).

We feed the list of proteins to CKG to prioritize all the knowledge gathered in the graph to reveal relationships to other possibly related diseases as well as possible treatments and altered biological processes and pathways.

[2]:
opentargets_covid_file = 'tmp/EFO_0005687-associated-diseases.tsv'
opentargets_data = pd.read_csv(opentargets_covid_file, sep='\t', header=0)
[3]:
opentargets_data.head()
[3]:
symbol overallAssociationScore geneticAssociations somaticMutations drugs pathwaysSystemsBiology textMining rnaExpression animalModels targetName
0 SLC6A2 0.605935 No data No data 0.6040872459884047 No data 0.03695798546847018 No data No data solute carrier family 6 member 2
1 SLC6A4 0.605472 No data No data 0.6040991203885059 No data 0.02744951709599876 No data No data solute carrier family 6 member 4
2 CACNA2D1 0.602479 No data No data 0.6024789602067924 No data No data No data No data calcium voltage-gated channel auxiliary subuni...
3 HTR2A 0.559159 No data No data 0.5546461928861836 No data 0.09024626746046689 No data No data 5-hydroxytryptamine receptor 2A
4 HTR2C 0.554646 No data No data 0.5546461928861836 No data No data No data No data 5-hydroxytryptamine receptor 2C
[4]:
target_list = opentargets_data['symbol'].tolist()
[5]:
len(target_list)
[5]:
335

Knowledge Object

To annotate the list of proteins, we create an empty object of type Knowledge.

Once we have the object, we can simply call the function annotate_list() specifying the list of proteins and in this case the disease (or diseases) and what type of entities we want to annotate (Disease, Drug, Pathway, etc.).

[6]:
#Create Knowledge object
kn = knowledge.Knowledge(identifier='Fibromyalgia', data=None)
[7]:
# Annotate the list of proteins using function annotate_list
kn.annotate_list(query_list=target_list, # list of proteins
                 entity_type='protein', # type of items in the list
                 queries_file=None, # Allows YML file with customized queries or the default (None)
                 attribute='name',  # What we provide in the list (name, id)
                 diseases=['fibromyalgia'], # List of diseases
                 entities=None) # what types of annotations (Disease, Drug, Pathway, etc.)

This function runs all the queries in queries_file (default: report_manager/queries/knowledge_annotation.yml) associated to the entity_type (protein) and limits the queried information to relationships to the list of proteins provided.

Summarization and Visualization

The graph contains millions of relationships and the results from the annotation may be too combersome.

In order to summarize the results and make them easier to understand and navigate, CKG uses network analysis algorithms (centrality (betweenness, closeness) and pagerank) to prioritize the nodes in the knowledge annotation graph.

The result summarizes the relationships of the top 15 nodes of each entity type according to these algorithms (Disease, Drug, Pathway, Biological_process, Complex, Publication).

The summarized results can be visualized either as a Sankey plot or as a network.

[8]:
kn.generate_report(visualizations=['network', 'sankey'], # how to visualize the results (network, sankey)
                   summarize=True, # Whether or not to summarize the annotation
                   method='betweenness', # Method for summarizing the annotation (betweenness, closeness, pagerank)
                   inplace=True) # If True, the summarized is saved, otherwise keep full graph
[9]:
kn.report.visualize_report(environment='notebook')[0]

All the Knowledge is Accessible

All the relationships extracted from the CKG are stored as a dataframe in the class property data.

[10]:
kn.data.shape
[10]:
(98877, 7)
[11]:
kn.data.head()
[11]:
r.source rel_type source source_type target target_type weight
0 Reactome ANNOTATED_IN_PATHWAY ACE [Protein] Metabolism of Angiotensinogen to Angiotensins [Pathway] NaN
1 Reactome ANNOTATED_IN_PATHWAY ACHE [Protein] Synthesis of PC [Pathway] NaN
2 Reactome ANNOTATED_IN_PATHWAY ACHE [Protein] Neurotransmitter clearance [Pathway] NaN
3 Reactome ANNOTATED_IN_PATHWAY ACHE [Protein] Synthesis, secretion, and deacylation of Ghrelin [Pathway] NaN
4 Reactome ANNOTATED_IN_PATHWAY ACSS2 [Protein] Transcriptional activation of mitochondrial bi... [Pathway] NaN
[12]:
kn.data.tail()
[12]:
r.source rel_type source source_type target target_type weight
9209 None MENTIONED_IN_PUBLICATION PRKCA [Protein] PMID:25942533 [Publication] NaN
9210 None MENTIONED_IN_PUBLICATION PTGS2 [Protein] PMID:25942533 [Publication] NaN
9211 None MENTIONED_IN_PUBLICATION SCN10A [Protein] PMID:25942533 [Publication] NaN
9212 None MENTIONED_IN_PUBLICATION TNF [Protein] PMID:25942533 [Publication] NaN
9213 None MENTIONED_IN_PUBLICATION TRPA1 [Protein] PMID:25942533 [Publication] NaN

The generated knowledge subgraph can also be accessed as a NetworkX Directed graph.

[13]:
kn.graph
[13]:
<networkx.classes.digraph.DiGraph at 0x23ab638e2c8>

And the report can be downloaded to a specified directory. The directory will contain the Sankey visualization in png and svg formats, the network in gml and json formats as well as the nodes and edges (relationships) tables in tsv format.

[14]:
kn.report.download_report('tmp/fibromyalgia')

In some cases, we are interested in annotating a list of proteins and identify what diseases they may be related to.

In the example before, by specifying a disease, we prioritized the relationships to that disease and at the same time identified many others also associated to the list of proteins. Running the same annotation without specifying fibromyalgia, brings up other diseases that may be relevant to investigate (i.e opiate dependence).

[15]:
# Annotate the list of proteins using function annotate_list
kn.annotate_list(query_list=target_list, # list of proteins
                 entity_type='protein', # type of items in the list
                 queries_file=None, # Allows YML file with customized queries or the default (None)
                 attribute='name',  # What we provide in the list (name, id)
                 diseases=[], # List of diseases
                 entities=None) # what types of annotations (Disease, Drug, Pathway, etc.)
[ ]:
kn.generate_report(visualizations=['network'], summarize=True, method='betweenness', inplace=True)
[ ]:
kn.report.visualize_report(environment='notebook')[0]

Annotation of a List of Proteins

Providing a list of proteins without specifying a list of diseases shows also the validity of the summarization method and the usefulness of the extracted knowledge.

Here, we show another example annotating a list of proteins (n=84) related to Alzheimer’s disease from the Open Targets Platform (https://www.targetvalidation.org/disease/EFO_0000249/associations?fcts=datatype:affected_pathway).

[ ]:
opentargets_covid_file = 'tmp/EFO_0000249-associated-diseases.tsv'
opentargets_data = pd.read_csv(opentargets_covid_file, sep=',', header=0)
target_list = opentargets_data['symbol'].tolist()
len(target_list)
[ ]:
# Annotate the list of proteins using function annotate_list
kn.annotate_list(query_list=target_list, # list of proteins
                 entity_type='protein', # type of items in the list
                 queries_file=None, # Allows YML file with customized queries or the default (None)
                 attribute='name',  # What we provide in the list (name, id)
                 diseases=[], # List of diseases
                 entities=None) # what types of annotations (Disease, Drug, Pathway, etc.)
[ ]:
kn.generate_report(visualizations=['sankey'], summarize=True, method='betweenness', inplace=False)
[ ]:
kn.report.visualize_report(environment='notebook')

Betweenness centrality can be slow depending on the number of relationships in the graph. There are other options for summarizing the knowledge annotation: closeness centrality or pagerank.

[ ]:
kn.generate_report(visualizations=['sankey'], summarize=True, method='closeness', inplace=False)
[ ]:
kn.report.visualize_report(environment='notebook')

References

Ochoa, D. et al. (2021). Open Targets Platform: supporting systematic drug–target identification and prioritisation. Nucleic Acids Research. https://academic.oup.com/nar/article/49/D1/D1302/5983621

[ ]: