Graph Database Builder (graphdb_builder)¶
- Ontology Databases
- Biomedical Databases
- Biomedical Databases Parsers
- cancerGenomeInterpreterParser.py
- corumParser.py
- disgenetParser.py
- drugBankParser.py
- drugGeneInteractionDBParser.py
- exposomeParser.py
- foodbParser.py
- goaParser.py
- gwasCatalogParser.py
- hgncParser.py
- hmdbParser.py
- hpaParser.py
- intactParser.py
- jensenlabParser.py
- mutationDsParser.py
- oncokbParser.py
- pathwayCommonsParser.py
- pfamParser.py
- pspParser.py
- reactomeParser.py
- refseqParser.py
- siderParser.py
- signorParser.py
- smpdbParser.py
- stringParser.py
- textminingParser.py
- uniprotParser.py
- databases_controller.py
- Biomedical Databases Parsers
- Experimental Data
- User Creation
- CKG Builder
builder_utils.py¶
-
parse_contents
(contents, filename)[source]¶ Reads binary string files and returns a Pandas DataFrame.
-
export_contents
(data, dataDir, filename)[source]¶ Export Pandas DataFrame to file, with UTF-8 endocing.
-
write_relationships
(relationships, header, outputfile)[source]¶ Reads a set of relationships and saves them to a file.
-
write_entities
(entities, header, outputfile)[source]¶ Reads a set of entities and saves them to a file.
-
get_config
(config_name, data_type='databases')[source]¶ Reads YAML configuration file and converts it into a Python dictionary.
- Parameters
- Returns
Dictionary.
Note
Use this function to obtain configuration for individual database/ontology parsers.
-
expand_cols
(data, col, sep=';')[source]¶ Expands the rows of a dataframe by splitting the specified column
-
setup_config
(data_type='databases')[source]¶ Reads YAML configuration file and converts it into a Python dictionary.
- Parameters
data_type – configuration type (‘databases’, ‘ontologies’, ‘experiments’ or ‘builder’).
- Returns
Dictionary.
Note
This function should be used to obtain the configuration for databases_controller.py, ontologies_controller.py, experiments_controller.py and builder.py.
-
list_ftp_directory
(ftp_url, user='', password='')[source]¶ Lists all files present in folder from FTP server.
-
download_PRIDE_data
(pxd_id, file_name, to='.', user='', password='', date_field='publicationDate')[source]¶ This function downloads a project file from the PRIDE repository
- Parameters
pxd_id (str) – PRIDE project identifier (id. PXD013599).
file_name (str) – name of the file to dowload
to (str) – local directory where the file should be downloaded
user (str) – username to access biomedical database server if required.
password (str) – password to access biomedical database server if required.
date_field (str) – projects deposited in PRIDE are search based on date, either submissionData or publicationDate (default)
-
downloadDB
(databaseURL, directory=None, file_name=None, user='', password='', avoid_wget=False)[source]¶ This function downloads the raw files from a biomedical database server when a link is provided.
- Parameters
databaseURL (str) – link to access biomedical database server.
file_name (str or None) – name of the file to dowload. If None, ‘databaseURL’ must contain filename after the last ‘/’.
user (str) – username to access biomedical database server if required.
password (str) – password to access biomedical database server if required.
-
searchPubmed
(searchFields, sortby='relevance', num='10', resultsFormat='json')[source]¶ Searches PubMed database for MeSH terms and other additional fields (‘searchFields’), sorts them by relevance and returns the top ‘num’.
-
is_number
(s)[source]¶ This function checks if given input is a float and returns True if so, and False if it is not.
- Parameters
s – input
- Returns
Boolean.
-
getMedlineAbstracts
(idList)[source]¶ This function accesses NCBI over the WWWW and returns Medline data as a handle object, which is parsed and converted to a Pandas DataFrame.
-
listDirectoryFiles
(directory)[source]¶ Lists all files in a specified directory.
- Parameters
directory (str) – path to folder.
- Returns
List of file names.
-
listDirectoryFolders
(directory)[source]¶ Lists all directories in a specified directory.
- Parameters
directory (str) – path to folder.
- Returns
List of folder names.
-
listDirectoryFoldersNotEmpty
(directory)[source]¶ Lists all directories in a specified directory.
- Parameters
directory (str) – path to folder.
- Returns
List of folder names.
-
checkDirectory
(directory)[source]¶ Checks if given directory exists and if not, creates it.
- Parameters
directory (str) – path to folder.
-
flatten
(t)[source]¶ Code from: https://gist.github.com/shaxbee/0ada767debf9eefbdb6e Acknowledgements: Zbigniew Mandziejewicz (shaxbee) Generator flattening the structure
>>> list(flatten([2, [2, (4, 5, [7], [2, [6, 2, 6, [6], 4]], 6)]])) [2, 2, 4, 5, 7, 2, 6, 2, 6, 6, 4, 6]
-
pretty_print
(data)[source]¶ This function provides a capability to “pretty-print” arbitrary Python data structures in a forma that can be used as input to the interpreter. For more information visit https://docs.python.org/2/library/pprint.html.
- Parameters
data – python object.
-
convertOBOtoNet
(ontologyFile)[source]¶ Takes an .obo file and returns a NetworkX graph representation of the ontology, that holds multiple edges between two nodes.
- Parameters
ontologyFile (str) – path to ontology file.
- Returns
NetworkX graph.
-
getCurrentTime
()[source]¶ Returns current date (Year-Month-Day) and time (Hour-Minute-Second).
- Returns
Two strings: date and time.
-
convert_bytes
(num)[source]¶ This function will convert bytes to MB…. GB… etc.
- Parameters
num – float, integer or pandas.Series.
-
buildStats
(count, otype, name, dataset, filename, updated_on=None)[source]¶ Returns a tuple with all the information needed to build a stats file.
- Parameters
- Returns
Tuple with date, time, database name, file where entities/relationships are stored, file size, number of entities/relationships imported, type and label.
-
unrar
(filepath, to)[source]¶ Decompress RAR file :param str filepath: path to rar file :param str to: where to extract all files
-
unzip_file
(filepath, to)[source]¶ Decompress zipped file :param str filepath: path to zip file :param str to: where to extract all files
-
compress_directory
(folder_to_backup, dest_folder, file_name)[source]¶ Compresses folder to .tar.gz to create data backup archive file.
-
read_gzipped_file
(filepath)[source]¶ Opens an underlying process to access a gzip file through the creation of a new pipe to the child.
- Parameters
filepath (str) – path to gzip file.
- Returns
A bytes sequence that specifies the standard output.
-
parse_fasta
(file_handler)[source]¶ Using BioPython to read fasta file as SeqIO objects
- Parameters
file_handler (file_handler) – opened fasta file
- Return iterator records
iterator of sequence objects
-
batch_iterator
(iterator, batch_size)[source]¶ Returns lists of length batch_size.
This can be used on any iterator, for example to batch up SeqRecord objects from Bio.SeqIO.parse(…), or to batch Alignment objects from Bio.AlignIO.parse(…), or simply lines from a file handle.
This is a generator function, and it returns lists of the entries from the supplied iterator. Each list will have batch_size entries, although the final list may be shorter.
- Parameters
iterator (iterator) – batch to be extracted
batch_size (integer) – size of the batch
- Return list batch
list with the batch elements of size batch_size