Graph Database Builder (graphdb_builder)

builder_utils.py

readDataset(uri)[source]
readDataFromCSV(uri, sep=', ', header=0, comment=None)[source]

Read the data from csv file

readDataFromTXT(uri)[source]

Read the data from tsv or txt file

readDataFromExcel(uri)[source]

Read the data from Excel file

get_files_by_pattern(regex_path)[source]
get_extra_pairs(directory, extra_file)[source]
parse_contents(contents, filename)[source]

Reads binary string files and returns a Pandas DataFrame.

export_contents(data, dataDir, filename)[source]

Export Pandas DataFrame to file, with UTF-8 endocing.

parse_mztab_filehandler(mztabf)[source]
parse_mztab_file(mztab_file)[source]
parse_sdrf_filehandler(sdrf_fh)[source]
convert_ckg_to_sdrf(df)[source]
convert_sdrf_to_ckg(df)[source]
convert_ckg_clinical_to_sdrf(df)[source]
convert_sdrf_file_to_ckg(file_path)[source]
write_relationships(relationships, header, outputfile)[source]

Reads a set of relationships and saves them to a file.

Parameters
  • relationships (set) – set of tuples with relationship data: source node, target node, relationship type, source and other attributes.

  • header (list) – list of column names.

  • outputfile (str) – path to file to be saved (including filename and extention).

write_entities(entities, header, outputfile)[source]

Reads a set of entities and saves them to a file.

Parameters
  • entities (set) – set of tuples with entities data: identifier, label, name and other attributes.

  • header (list) – list of column names.

  • outputfile (str) – path to file to be saved (including filename and extention).

get_config(config_name, data_type='databases')[source]

Reads YAML configuration file and converts it into a Python dictionary.

Parameters
  • config_name (str) – name of the configuration YAML file.

  • data_type (str) – configuration type (‘databases’ or ‘ontologies’).

Returns

Dictionary.

Note

Use this function to obtain configuration for individual database/ontology parsers.

expand_cols(data, col, sep=';')[source]

Expands the rows of a dataframe by splitting the specified column

Parameters
  • data – dataframe to be expanded

  • col (str) – column that contains string to be expanded (i.e. ‘P02788;E7EQB2;E7ER44;P02788-2;C9JCF5’)

  • sep (str) – separator (i.e. ‘;’)

Returns

expanded pandas dataframe

setup_config(data_type='databases')[source]

Reads YAML configuration file and converts it into a Python dictionary.

Parameters

data_type – configuration type (‘databases’, ‘ontologies’, ‘experiments’ or ‘builder’).

Returns

Dictionary.

Note

This function should be used to obtain the configuration for databases_controller.py, ontologies_controller.py, experiments_controller.py and builder.py.

list_ftp_directory(ftp_url, user='', password='')[source]

Lists all files present in folder from FTP server.

Parameters
  • ftp_url (str) – link to access ftp server.

  • user (str) – username to access ftp server if required.

  • password (str) – password to access ftp server if required.

Returns

List of files contained in ftp server folder provided with ftp_url.

setup_logging(path='log.config', key=None)[source]

Setup logging configuration.

Parameters
  • path (str) – path to file containing configuration for logging file.

  • key (str) – name of the logger.

Returns

Logger with the specified name from ‘key’. If key is None, returns a logger which is the root logger of the hierarchy.

download_from_ftp(ftp_url, user, password, to, file_name)[source]
download_PRIDE_data(pxd_id, file_name, to='.', user='', password='', date_field='publicationDate')[source]

This function downloads a project file from the PRIDE repository

Parameters
  • pxd_id (str) – PRIDE project identifier (id. PXD013599).

  • file_name (str) – name of the file to dowload

  • to (str) – local directory where the file should be downloaded

  • user (str) – username to access biomedical database server if required.

  • password (str) – password to access biomedical database server if required.

  • date_field (str) – projects deposited in PRIDE are search based on date, either submissionData or publicationDate (default)

downloadDB(databaseURL, directory=None, file_name=None, user='', password='', avoid_wget=False)[source]

This function downloads the raw files from a biomedical database server when a link is provided.

Parameters
  • databaseURL (str) – link to access biomedical database server.

  • directory (str or None) –

  • file_name (str or None) – name of the file to dowload. If None, ‘databaseURL’ must contain filename after the last ‘/’.

  • user (str) – username to access biomedical database server if required.

  • password (str) – password to access biomedical database server if required.

searchPubmed(searchFields, sortby='relevance', num='10', resultsFormat='json')[source]

Searches PubMed database for MeSH terms and other additional fields (‘searchFields’), sorts them by relevance and returns the top ‘num’.

Parameters
  • searchFields (list) – list of search fields to query for.

  • sortby (str) – parameter to use for sorting.

  • num (str) – number of PubMed identifiers to return.

  • resultsFormat (str) – format of the PubMed result.

Returns

Dictionary with total number of PubMed ids, and top ‘num’ ids.

is_number(s)[source]

This function checks if given input is a float and returns True if so, and False if it is not.

Parameters

s – input

Returns

Boolean.

getMedlineAbstracts(idList)[source]

This function accesses NCBI over the WWWW and returns Medline data as a handle object, which is parsed and converted to a Pandas DataFrame.

Parameters

idList (str or list) – single identifier or comma-delimited list of identifiers. All the identifiers must be from the database PubMed.

Returns

Pandas DataFrame with columns: ‘title’, ‘authors’, ‘journal’, ‘keywords’, ‘abstract’, ‘PMID’ and ‘url’.

remove_directory(directory)[source]
listDirectoryFiles(directory)[source]

Lists all files in a specified directory.

Parameters

directory (str) – path to folder.

Returns

List of file names.

listDirectoryFolders(directory)[source]

Lists all directories in a specified directory.

Parameters

directory (str) – path to folder.

Returns

List of folder names.

listDirectoryFoldersNotEmpty(directory)[source]

Lists all directories in a specified directory.

Parameters

directory (str) – path to folder.

Returns

List of folder names.

checkDirectory(directory)[source]

Checks if given directory exists and if not, creates it.

Parameters

directory (str) – path to folder.

flatten(t)[source]

Code from: https://gist.github.com/shaxbee/0ada767debf9eefbdb6e Acknowledgements: Zbigniew Mandziejewicz (shaxbee) Generator flattening the structure

>>> list(flatten([2, [2, (4, 5, [7], [2, [6, 2, 6, [6], 4]], 6)]]))
[2, 2, 4, 5, 7, 2, 6, 2, 6, 6, 4, 6]
pretty_print(data)[source]

This function provides a capability to “pretty-print” arbitrary Python data structures in a forma that can be used as input to the interpreter. For more information visit https://docs.python.org/2/library/pprint.html.

Parameters

data – python object.

convertOBOtoNet(ontologyFile)[source]

Takes an .obo file and returns a NetworkX graph representation of the ontology, that holds multiple edges between two nodes.

Parameters

ontologyFile (str) – path to ontology file.

Returns

NetworkX graph.

getCurrentTime()[source]

Returns current date (Year-Month-Day) and time (Hour-Minute-Second).

Returns

Two strings: date and time.

convert_bytes(num)[source]

This function will convert bytes to MB…. GB… etc.

Parameters

num – float, integer or pandas.Series.

copytree(src, dst, symlinks=False, ignore=None)[source]
file_size(file_path)[source]

This function returns the file size.

Parameters

file_path (str) – path to file.

Returns

Size in bytes of a plain file.

Return type

str

buildStats(count, otype, name, dataset, filename, updated_on=None)[source]

Returns a tuple with all the information needed to build a stats file.

Parameters
  • count (int) – number of entities/relationships.

  • otype (str) – ‘entity’ or ‘relationsgips’.

  • name (str) – entity/relationship label.

  • dataset (str) – database/ontology.

  • filename (str) – path to file where entities/relationships are stored.

Returns

Tuple with date, time, database name, file where entities/relationships are stored, file size, number of entities/relationships imported, type and label.

unrar(filepath, to)[source]

Decompress RAR file :param str filepath: path to rar file :param str to: where to extract all files

unzip_file(filepath, to)[source]

Decompress zipped file :param str filepath: path to zip file :param str to: where to extract all files

compress_directory(folder_to_backup, dest_folder, file_name)[source]

Compresses folder to .tar.gz to create data backup archive file.

Parameters
  • folder_to_backup (str) – path to folder to compress and backup.

  • dest_folder (str) – path where to save compressed folder.

  • file_name (str) – name of the compressed file.

read_gzipped_file(filepath)[source]

Opens an underlying process to access a gzip file through the creation of a new pipe to the child.

Parameters

filepath (str) – path to gzip file.

Returns

A bytes sequence that specifies the standard output.

parse_fasta(file_handler)[source]

Using BioPython to read fasta file as SeqIO objects

Parameters

file_handler (file_handler) – opened fasta file

Return iterator records

iterator of sequence objects

batch_iterator(iterator, batch_size)[source]

Returns lists of length batch_size.

This can be used on any iterator, for example to batch up SeqRecord objects from Bio.SeqIO.parse(…), or to batch Alignment objects from Bio.AlignIO.parse(…), or simply lines from a file handle.

This is a generator function, and it returns lists of the entries from the supplied iterator. Each list will have batch_size entries, although the final list may be shorter.

Parameters
  • iterator (iterator) – batch to be extracted

  • batch_size (integer) – size of the batch

Return list batch

list with the batch elements of size batch_size

source: https://biopython.org/wiki/Split_large_file

mapping.py