bripipetools Core Packages

Overview

“Core” packages are where most of the heavy lifting happens, and are called by application-level modules to perform various pipeline tasks. Packages are listed roughly in order of dependency hierarchy (i.e., packages listed first depend on subsequently listed packages).

Note

Intended for developers!

The documentation below is effectively a dump of all low-level packages, modules, classes, and methods that are used to run bripipetools. This amount of detail shouldn’t be needed for most users, but provides a starting point for those looking to understand or modify the code.


Package details

annotation package

Includes critical functionality for identifying, locating, and describing data and results at various points (e.g., data generation, computational processing) in the bioinformatics pipeline. Each “annotator” class, contained in its respective module, is responsible for collecting and/or updating information for a specific object in the GenLIMS database. When possible, details for an object are retrieved directly from the database; for new objects or objects with missing fields, information is compiled, parsed, and formatted (as needed) from files on the server.

sequencedlibraries module

Classify / provide details for sequenced libraries (outputs of a flowcell sequencing run) and the associated raw data.

flowcellruns module

Classify / provide details for objects generated from an Illumina sequencing run performed by the BRI Genomics Core.

processedlibraries module

workflowbatches module

Classify / provide details for objects generated from a Globus Galaxy workflow processing batch performed by the BRI Bioinformatics Core.


qc package

Contains classes and methods for performing post-hoc quality control operations on raw or processed genomics data. Modules are organized according to the specifc QC step performed. Unlike routine quality inspection metrics and information provided by standard bioinformatics tools through processing workflows, modules here are aimed more at identifying problems with sample handling or data generation. As such, outputs from these submodules are designated as a special type, ‘validation’, to distinguish them from the QC, metrics, counts, and other output types generated through processing.

sexcheck module

Class and methods to perform routine sex check on all processed libraries.

sexverify module

Class and methods to perform routine sex check on all processed libraries.

sexpredict module

Class and methods to perform routine sex check on all processed libraries.


database package

Contains methods for interacting with - connecting to, retrieving data from, and inserting data into - BRI databases (GenLIMS and ResDB) at a low level. Under the hood, much of the functionality in this package relies on the pymongo client library for MongoDB. The database.operations module provides wrapper functions for getting/putting objects from/to commonly used database collections, while database.mapping helps to construct Python model class objects from database documents. Methods in the database.connection module manage the database connection, depending on environment and configurations.

connection module

Connect to a BRI Mongo database.

operations module

Basic operations for BRI Mongo databases.

mapping module

bripipetools mapping submodule: methods to map from Mongo documents to model classes.


model package

Establishes the underlying data model linking data from bioinformatics processing pipelines to the GenLIMS/TG3 database. Python class representations of database objects (documents) are defined in the model.documents module. These classes include some basic functionality, mostly related to setting/formatting attributes, which are eventually fed back into the database as key-value pairs. However, model classes are also the basic “currency” for several other modules, where they are used to retrieve, modify, store, and return data.

Depends on the util and parsing modules.

documents module

Classes representing documents in the GenLIMS database.


io package

Contains class representations of various file types produced through the generation or processing of genomics data. In particular, most of these classes provide methods for reading and parsing raw data from files and storing/returning these data in a more usable format, such as dictionaries or data frames. Each module contains the representaiton of a file generated by a particular tool or routine; some submodules may handle files from multiple methods within a tool (e.g., Picard). While not explicitly organized as such, modules adhere to a hierarchy based on the “type” of file, where current types include metrics, counts, QC, and validation.

workflow module

Class for reading and parsing Galaxy workflow files.

workflowbatch module

Classes for reading, parsing, and writing workflow batch submit files for Globus Galaxy.

picardmetrics module

Class for reading and parsing Picard metrics files.

htseqmetrics module

Class for reading and parsing Tophat Stats metrics files.

tophatstats module

Class for reading and parsing Tophat Stats metrics files.

fastqc module

Class for reading and parsing FastQC report files.

htseqcounts module

Class for reading and parsing htseq files.

sexcheck module

Class for reading and parsing sex check validation files.


parsing package

Slightly more specialized than methods in the util.strings module, provides functions for parsing and extracting information from strings that follow some expected nomenclature. The primary examples of this information are IDs, names, labels, and other metadata for files and objects generated either by Illumina technology or the BRI Genomics Core (via GenLIMS). The parsing.processing module is also designed to handle specialized strings and labels related to processing workflows in Globus Galaxy.

Depends on the util module.

gencore module

illumina module

processing module


util module

Includes convenience methods related to handling and manipulating strings (util.strings), file paths (util.files), as well as user interactions via the command line (util.ui). Methods are used throughout other packages to streamline common operations.

strings submodule

files submodule