Graph Database Builder (graphdb_builder)¶

builder_utils.py¶

readDataset(uri)[source]¶

readDataFromCSV(uri, sep=', ', header=0, comment=None)[source]¶: Read the data from csv file

readDataFromTXT(uri)[source]¶: Read the data from tsv or txt file

readDataFromExcel(uri)[source]¶: Read the data from Excel file

get_files_by_pattern(regex_path)[source]¶

get_extra_pairs(directory, extra_file)[source]¶

parse_contents(contents, filename)[source]¶: Reads binary string files and returns a Pandas DataFrame.

export_contents(data, dataDir, filename)[source]¶: Export Pandas DataFrame to file, with UTF-8 endocing.

parse_mztab_filehandler(mztabf)[source]¶

parse_mztab_file(mztab_file)[source]¶

parse_sdrf_filehandler(sdrf_fh)[source]¶

convert_ckg_to_sdrf(df)[source]¶

convert_sdrf_to_ckg(df)[source]¶

convert_ckg_clinical_to_sdrf(df)[source]¶

convert_sdrf_file_to_ckg(file_path)[source]¶

write_relationships(relationships, header, outputfile)[source]¶

Reads a set of relationships and saves them to a file.

Parameters

relationships (set) – set of tuples with relationship data: source node, target node, relationship type, source and other attributes.
header (list) – list of column names.
outputfile (str) – path to file to be saved (including filename and extention).

write_entities(entities, header, outputfile)[source]¶

Reads a set of entities and saves them to a file.

Parameters

entities (set) – set of tuples with entities data: identifier, label, name and other attributes.
header (list) – list of column names.
outputfile (str) – path to file to be saved (including filename and extention).

get_config(config_name, data_type='databases')[source]¶

Reads YAML configuration file and converts it into a Python dictionary.

Parameters

config_name (str) – name of the configuration YAML file.
data_type (str) – configuration type (‘databases’ or ‘ontologies’).

Returns

Dictionary.

Note

Use this function to obtain configuration for individual database/ontology parsers.

expand_cols(data, col, sep=';')[source]¶

Expands the rows of a dataframe by splitting the specified column

Parameters

data – dataframe to be expanded
col (str) – column that contains string to be expanded (i.e. ‘P02788;E7EQB2;E7ER44;P02788-2;C9JCF5’)
sep (str) – separator (i.e. ‘;’)

Returns

expanded pandas dataframe

setup_config(data_type='databases')[source]¶

Reads YAML configuration file and converts it into a Python dictionary.

Parameters: data_type – configuration type (‘databases’, ‘ontologies’, ‘experiments’ or ‘builder’).
Returns: Dictionary.

Note

This function should be used to obtain the configuration for databases_controller.py, ontologies_controller.py, experiments_controller.py and builder.py.

list_ftp_directory(ftp_url, user='', password='')[source]¶

Lists all files present in folder from FTP server.

Parameters

ftp_url (str) – link to access ftp server.
user (str) – username to access ftp server if required.
password (str) – password to access ftp server if required.

Returns

List of files contained in ftp server folder provided with ftp_url.

setup_logging(path='log.config', key=None)[source]¶

Setup logging configuration.

Parameters

path (str) – path to file containing configuration for logging file.
key (str) – name of the logger.

Returns

Logger with the specified name from ‘key’. If key is None, returns a logger which is the root logger of the hierarchy.

download_from_ftp(ftp_url, user, password, to, file_name)[source]¶

download_PRIDE_data(pxd_id, file_name, to='.', user='', password='', date_field='publicationDate')[source]¶

This function downloads a project file from the PRIDE repository

Parameters

pxd_id (str) – PRIDE project identifier (id. PXD013599).
file_name (str) – name of the file to dowload
to (str) – local directory where the file should be downloaded
user (str) – username to access biomedical database server if required.
password (str) – password to access biomedical database server if required.
date_field (str) – projects deposited in PRIDE are search based on date, either submissionData or publicationDate (default)

downloadDB(databaseURL, directory=None, file_name=None, user='', password='', avoid_wget=False)[source]¶

This function downloads the raw files from a biomedical database server when a link is provided.

Parameters

databaseURL (str) – link to access biomedical database server.
directory (str or None) –
file_name (str or None) – name of the file to dowload. If None, ‘databaseURL’ must contain filename after the last ‘/’.
user (str) – username to access biomedical database server if required.
password (str) – password to access biomedical database server if required.

searchPubmed(searchFields, sortby='relevance', num='10', resultsFormat='json')[source]¶

Searches PubMed database for MeSH terms and other additional fields (‘searchFields’), sorts them by relevance and returns the top ‘num’.

Parameters

searchFields (list) – list of search fields to query for.
sortby (str) – parameter to use for sorting.
num (str) – number of PubMed identifiers to return.
resultsFormat (str) – format of the PubMed result.

Returns

Dictionary with total number of PubMed ids, and top ‘num’ ids.

is_number(s)[source]¶

This function checks if given input is a float and returns True if so, and False if it is not.

Parameters: s – input
Returns: Boolean.

getMedlineAbstracts(idList)[source]¶

This function accesses NCBI over the WWWW and returns Medline data as a handle object, which is parsed and converted to a Pandas DataFrame.

Parameters: idList (str or list) – single identifier or comma-delimited list of identifiers. All the identifiers must be from the database PubMed.
Returns: Pandas DataFrame with columns: ‘title’, ‘authors’, ‘journal’, ‘keywords’, ‘abstract’, ‘PMID’ and ‘url’.

remove_directory(directory)[source]¶

listDirectoryFiles(directory)[source]¶

Lists all files in a specified directory.

Parameters: directory (str) – path to folder.
Returns: List of file names.

listDirectoryFolders(directory)[source]¶

Lists all directories in a specified directory.

Parameters: directory (str) – path to folder.
Returns: List of folder names.

listDirectoryFoldersNotEmpty(directory)[source]¶

Lists all directories in a specified directory.

Parameters: directory (str) – path to folder.
Returns: List of folder names.

checkDirectory(directory)[source]¶

Checks if given directory exists and if not, creates it.

Parameters: directory (str) – path to folder.

flatten(t)[source]¶

Code from: https://gist.github.com/shaxbee/0ada767debf9eefbdb6e Acknowledgements: Zbigniew Mandziejewicz (shaxbee) Generator flattening the structure

>>> list(flatten([2, [2, (4, 5, [7], [2, [6, 2, 6, [6], 4]], 6)]]))
[2, 2, 4, 5, 7, 2, 6, 2, 6, 6, 4, 6]

pretty_print(data)[source]¶

This function provides a capability to “pretty-print” arbitrary Python data structures in a forma that can be used as input to the interpreter. For more information visit https://docs.python.org/2/library/pprint.html.

Parameters: data – python object.

convertOBOtoNet(ontologyFile)[source]¶

Takes an .obo file and returns a NetworkX graph representation of the ontology, that holds multiple edges between two nodes.

Parameters: ontologyFile (str) – path to ontology file.
Returns: NetworkX graph.

getCurrentTime()[source]¶

Returns current date (Year-Month-Day) and time (Hour-Minute-Second).

Returns: Two strings: date and time.

convert_bytes(num)[source]¶

This function will convert bytes to MB…. GB… etc.

Parameters: num – float, integer or pandas.Series.

copytree(src, dst, symlinks=False, ignore=None)[source]¶

file_size(file_path)[source]¶

This function returns the file size.

Parameters: file_path (str) – path to file.
Returns: Size in bytes of a plain file.
Return type: str

buildStats(count, otype, name, dataset, filename, updated_on=None)[source]¶

Returns a tuple with all the information needed to build a stats file.

Parameters

count (int) – number of entities/relationships.
otype (str) – ‘entity’ or ‘relationsgips’.
name (str) – entity/relationship label.
dataset (str) – database/ontology.
filename (str) – path to file where entities/relationships are stored.

Returns

Tuple with date, time, database name, file where entities/relationships are stored, file size, number of entities/relationships imported, type and label.

unrar(filepath, to)[source]¶: Decompress RAR file :param str filepath: path to rar file :param str to: where to extract all files

unzip_file(filepath, to)[source]¶: Decompress zipped file :param str filepath: path to zip file :param str to: where to extract all files

compress_directory(folder_to_backup, dest_folder, file_name)[source]¶

Compresses folder to .tar.gz to create data backup archive file.

Parameters

folder_to_backup (str) – path to folder to compress and backup.
dest_folder (str) – path where to save compressed folder.
file_name (str) – name of the compressed file.

read_gzipped_file(filepath)[source]¶

Opens an underlying process to access a gzip file through the creation of a new pipe to the child.

Parameters: filepath (str) – path to gzip file.
Returns: A bytes sequence that specifies the standard output.

parse_fasta(file_handler)[source]¶

Using BioPython to read fasta file as SeqIO objects

Parameters: file_handler (file_handler) – opened fasta file
Return iterator records: iterator of sequence objects

batch_iterator(iterator, batch_size)[source]¶

Returns lists of length batch_size.

This can be used on any iterator, for example to batch up SeqRecord objects from Bio.SeqIO.parse(…), or to batch Alignment objects from Bio.AlignIO.parse(…), or simply lines from a file handle.

This is a generator function, and it returns lists of the entries from the supplied iterator. Each list will have batch_size entries, although the final list may be shorter.

Parameters

iterator (iterator) – batch to be extracted
batch_size (integer) – size of the batch

Return list batch

list with the batch elements of size batch_size

source: https://biopython.org/wiki/Split_large_file

Graph Database Builder (graphdb_builder)¶

builder_utils.py¶

mapping.py¶