Processing Data¶
Data processing is the primary function of bripipetools, encompassing all bioinformatics operations performed on raw data (typically sequencing libraries) to generate processed output files. BRI processing pipelines do not include statistical analyses performed on output data.
Workflows¶
Workflows are the heart of bioinformatics processing operation. They comprise a repeatable series of steps performed on a collection of data files. Steps within a workflow can be broken down or categorized into specific modules, depending on the goal or output of the step.
Processing modules¶
data QC¶
Quality control refers to basic inspection and computation of quality metrics/statistics for raw sequencing data. Of course, quality assessment can and should occur at multiple stages from data generation to analysis. Many QC-related metrics can be evaluated to help diagnose library performance, but in practice we tend to focus on 3 main metrics:
The median CV of gene model coverage for the top-expressed genes (generated by
picard)The total number of reads (ie: “depth”) in the input fastq file (generated by
tophatStats)The number of reads that were aligned, either as singletons or pairs (generated by
picard)
Other metrics such as library complexity, GC content, percent duplication, etc. may be useful in some experiments, and can be retrieved from the genomicsMetrics collection in The Research Database.
Current tools: FastQC, picard, tophatStats
trimming¶
Trimming in an RNAseq workflow refers to two different processing steps.
Hard trimming: Removal of a fixed number of bases from raw sequencing reads. A nearly universal characteristic of sequencing data from the BRI Genomics Core is an unexpected spike in GC skew at the last position (on the 3’ end) of reads INCLUDE SCREENSHOT. For this reason, the first step in most workflows is the removal of the last 3’ base from each read.
Quality trimming: Modern aligners use “dynamic” or “adaptive” trimming to remove bases from either end of individual reads to improve mapping to the reference. Reads can also be pre-processed to remove bases that do not pass a certain quality threshold. Generally, BRI workflows trim reads from both ends until a minimum PHRED-scaled quality score of 30 is reached at both ends. In theory, removing low quality (and thus, low confidence) bases can also improve mapping rate; however, care must be taken to impose a minimum length for trimmed reads, as extremely short reads will lead to high duplication rates and biased results.
Current tools: FASTQ Trimmer, FASTQ Quality Trimmer, trimmomatic
adapter removal¶
For certain library prep procedures (e.g., Nextera), oligonucleotide indexes will be included as part of the PCR amplifcation step, prior to library construction. In these cases, these adapter sequences will appear within reads, in contrast to typical sequencing adapters (e.g., Illumina adapters and indexes used for demultiplexing) that are nominally removed by tools like CASAVA and bcl2fastq.
Current tools: FastqMcf, trimmomatic
alignment¶
This is the central step for almost all NGS processing workflows. Following any trimming, short reads are aligned to a reference genome and the results are stored in a Sequence Alignment Map (SAM) file or in the corresponding binary BAM format. For RNAseq data, it is important to use aligners that are “splice aware” (e.g., TopHat and STAR) to account for the fact that reads from mRNA transcripts may include one or more exons that are not adjacent in the genome (and therefore might not align). Alternatively, RNA reads could be aligned to a reference transcriptome.
In summer of 2018, after multiple comparisons between STAR-aligned and TopHat-aligned libraries, BRI decided to transition sequencing workflows to STAR from TopHat. Some ongoing projects may require continued support for the legacy TopHat workflows.
Current tools: STAR
Previous tools: TopHat
counting¶
After reads have been aligned to the genome, reference annotations (i.e., gene models) can be used to inspect and quantify read coverage within regions of interest (e.g., exons). Note that the exact numbers resulting from this counting are sensitive to the version of the gene model annotations (ie: GTF file) that are used.
Current tool: htseq-count
variant calling and SNP fingerprinting¶
In order to detect sample annotation errors, SNPs called in multiple libraries from the same subject can be compared for consistency. This is done using KING, which generates an estimate of a “kinship coefficient.” Additionally, SNPs can be called for association with traits of interest, such as gene expression or disease phenotype.
Current tools: samtools, bcftools, KING
metrics¶
Unlike QC, “metrics” is a catch-all category to describe any summary or quality measures of post-alignment data. The primary source of metrics is the Picard suite of tools for evaluating alignment in SAM/BAM files. For downstream tools that produce reports or logs (e.g., htseq-count), outputs are also categorized and stored under metrics.
Current tools: Picard
VDJ annotation¶
Many projects (eg: scRNAseq projects) involve identification of TCR sequences. To achieve sequencing of these highly polymorphic sequences, we first perform a de novo assembly using Trinity (see below), and then use MiXCR to identify the TCR chains from the assembly.
Current tools: mixcr
assembly¶
For experiments where it doesn’t make sense to align to a reference (eg: no reference available, huge number of polymorphisms), we perform a de novo assembly of the reads. This essentially aligns the reads to one another in a series of steps, building long sequences representing transcripts from the short reads of fragments. These long transcript sequences can then be used for a number of purposes, such as building a transcriptome, or aligning transcripts with partially conserved and partially variable regions (as in TCR identification).
Current tools: Trinity
peak calling¶
For some sequencing experiments (ChIPseq, ATACseq, CUT&RUN, CUT&TAG) it is necessary to call peaks where reads are found to pile up. MACS2 is the most common tool used to achieve this, but performs variably depending on the details of the experimental design and the settings used. In practice, we have found it best to assess the quality of peak calls from different peak callers on an experiment-by-experiment basis. Current comparisons in the literature suggest that there is a fair amount of room for optimization of peak calling precision and recall (<https://www.biorxiv.org/content/10.1101/306621v2.full/>_).
Current tools: MACS2 (used in Galaxy workflows), Genrich (used outside of Galaxy)
Workflow options¶
The following workflows are currently available for batch processing in Globus Genomics.
Production Workflows: RNAseq: TruSeq, single-end, stranded, STAR (with or without Trinity) RNAseq: Nextera, single-end, non-stranded, STAR (with or without Trinity)
Workflows in Development: ATACseq: Nextera, paired-end, non-stranded, STAR, MACS2 CUT&RUN/CUT&TAG: custom libraries, paired-end, non-stranded, STAR, MACS2
Deprecated Workflows: TruSeq, Stranded, TopHat (with or without Trinity) Nextera, Non-stranded, TopHat (with or without Trinity)
Composing a workflow¶
(in Globus Galaxy)
Implementing a new (production) workflow in Globus Galaxy consists of two steps: building a new workflow and annotating all input and output steps.
Building a workflow in Galaxy¶
Use the Workflow Editor in Globus Galaxy for the following steps:
Add all tools for processing modules (e.g., trimming, alignment, counting).
Connect inputs and outputs of individual tools.
Add workflow inputs: 1. Get Globus FASTQ data 2. Input Dataset (for reference/annotation files)
Add workflow outputs (Send Globus data)
Set all get/send data endpoint and path options to ‘set at runtime’
(optional) Set build-specific and other options to ‘set at runtime’
Annotate input and output steps (and potentially build-specific parameters)
Annotating parameters¶
For all parameters where values are to be set at runtime *, tags of the following format should be added to the Annotation / Notes field in the Globus Galaxy Workflow Editor.
* “option” parameters are recognized by the combination of their tag (in the Annotation field) as well as their name. This is different than the label field in the Galaxy workflow. In older versions of Galaxy the label was assigned automatically, but more recent versions require the user to specify a unique label name for each tool in the Galaxy workflow. This is important to remember when importing/editing workflows that were developed on a previous instance of Galaxy.
Input parameters¶
Input parameters — indicating local files that will be uploaded to Globus Galaxy nodes at the start of workflow processing — should have the following form:
extension_in
This typically only applies to fastq_in.
Output parameters¶
Output parameters are expected to have the following form:
<source>_<type>_<extension>_<out>
For example, the tag picard-rnaseq_metrics_html_out will be parsed into a dictionary like this::
{
'type': 'metrics',
'label': 'metrics',
'source': 'picard-rnaseq',
'extension': 'html'
}
Both source and label can be given added specificity with a hyphen-separated string (e.g., picard vs. picard-rnaseq or metrics vs. metrics-rmdup). The parsing code should automatically detect and group these clauses appropriately.
Annotation input parameters¶
Some workflows will access and load datasets stored in the Globus Galaxy library. These inputs (represented as Input Dataset in the workflow editor) should have annotation tags in the following form:
annotation_<type>
You can also give a name to the dataset to possibly ease navigation within the editor, but these names will not be used by downstream code.
The most common annotation input parameters are the following:
GTF gene model files:
annotation_gtf(optional name:gtfFile)Gene model refFlat files:
annotation_refflat(optional name:refFlatFile)Ribosomal interval files:
annotation_ribosomal-intervals(optional name:riboIntsFile)Adapter files:
annotation_adapters(optional name:adapterFile)
Saving the workflow for use in bripipetools¶
Once a workflow is finished and ready for testing, both the workflow template and the workflow detail files must be downloaded from Galaxy. The template file will be used to generate workflow batch files, and the workflow detail file will be used to store tool version information in the research database.
Save the workflow template¶
Click the arrow next to the workflow name in the Galaxy Workflows tab.
Select “Submit via API batch mode”.
On the following page, click the link to “Export Workflow Parameters for batch submission” and save the .txt file under
genomics/galaxy_workflows(wherever the path exists relative to your local system); make sure to remove the leadingGalaxy-API-Workflow-from the filename.
Save the workflow details¶
Click the arrow next to the workflow name in the Galaxy Workflows tab.
Select “Download or Export”
Click the link that says “Download workflow to file so that it can be saved or imported into another Galaxy server” and save the .ga file under
genomics/galaxy_workflows(wherever the path exists relative to your local system); make sure to remove the leadingGalaxy-Workflow-from the filename.
You should now have a template file with a .txt extension and a details file with a .ga extension, with otherwise identical file names that corresponding to your workflow. Note that bripipetools requires both of these files for a given workflow in order to function properly.
Importing a new workflow to the Research Database¶
When bripipetools wrapup is run on a workflow batch file for a new workflow, the tool details will be included in the new document in the genomicsWorkflowbatches collection (see Databases for more information).
Running a workflow¶
The following is a general overview of how to a workflow. For more details please see RNAseq Processing Quickstart.
All of the following steps except the initial BaseSpace download should work while on srvgalaxy01.
Pipeline steps¶
process-upload
process-collect
Downloading & prepping data¶
When a new flow cell is ready for processing, a notification email is sent from the Genomics Core via BaseSpace. Information about the flowcell and corresponding projects can be found in the Flowcell log.xlsx file under DFS_Chaussabel_LabShare/Illumina HiScan SQ/ on the [srvstor01](srvstor01.brivmrc.org) server. In particular, you’ll need to pay attention to the Lane Contents tab to determine the appropriate workflow to use for each project.
On srvgalaxy01 under /mnt/genomics/Illumina/<flowcell-folder>/, create a new folder called Unaligned/ (if it doesn’t already exist). Modify permissions such that all users can write to and read from the folder (chmod -R 777 Unaligned/). The new folder should look something like this:
FC_FOLDER="/mnt/genomics/Illumina/150615_D00565_0087_AC6VG0ANXX/Unaligned"
Using bripipetools¶
The bripipetools command (which calls bripipetools/__main__.py) is the entrypoint to application functionality. If you have the bripipetools package installed, you should be able to use this command from anywhere on your system.
bripipetools --help
Usage: bripipetools [OPTIONS] COMMAND [ARGS]...
Command line interface for the `bripipetools` library.
Options:
--quiet only display printed outputs in the console - i.e., no log messages
--debug include all debug log messages in the console
--help Show this message and exit.
Commands:
dbify Import data from a flowcell run or workflow...
postprocess Perform postprocessing operations on outputs...
qc Run quality control analyses on a target...
submit Prepare batch submission for unaligned...
wrapup Perform 'dbify' and 'postprocess' operations...
Preparing workflow batches for submission¶
At this point, you’ll need to identify the most applicable workflow. Important considerations are:
Species (mouse, human, E. coli, etc)
Genome assembly and gene annotation version (GRCh38.91 vs GRCh38.77, etc)
Library preparation method (TruSeq, Nextera, CUT&RUN, etc.)
Aligner (STAR unless the project needs to be combined with data from older, TopHat-based workflows)
Additional workflow requirements (Trinity/MiXCR, MACS2, etc)
Refer to flowcell log¶
The flowcell log can be found at DFS_Chaussabel_LabShare/Illumina HiScan SQ/Flowcell log.xlsx.
Using bripipetools to submit¶
bripipetools submit --help
Usage: bripipetools submit [OPTIONS] PATH
Prepare batch submission for unaligned samples from a flowcell run or from
a list of paths in a manifest file.
Options:
--endpoint TEXT Globus Online endpoint where input data is
stored and outputs will be saved
--workflow-dir TEXT path to folder containing Galaxy workflow
template files to be used for batch
processing
--all-workflows / --optimized-only
indicate whether to include all detected
workflows as options or to keep 'optimized'
workflows only
-s, --sort-samples sort samples from smallest to largest (based
on total size of raw data files) before
submitting; this is most useful when also
restricting the number of samples
-n, --num-samples INTEGER restrict the number of samples submitted for
each project on the flowcell
-m, --manifest indicates that input path is a manifest of
sample or folder paths (not a flowcell run)
from which a workflow batch is to be created
(note: options 'sort-samples' and 'num-
samples' will be ignored)
-o, --out-dir TEXT for input manifest, folder where outputs are
to be saved; default is current directory
--help Show this message and exit.
Here’s an example call::
bripipetools submit \
--workflow-dir /mnt/genomics/galaxy_workflows \
--endpoint benaroyaresearch#BRIGridFTP \
/mnt/genomics/Illumina/150615_D00565_0087_AC6VG0ANX
Here’s another example with a manifest file:
bripipetools submit \
--workflow-dir /Volumes/genomics/galaxy_workflows/ \
--out-dir /Volumes/genomics/ICAC/Gern/ -\
-tag gern \
--manifest <(find /Volumes/genomics/ICAC/Gern -name "Sample_*")
Submitting batches in Galaxy/Globus Genomics¶
Authenticating Globus endpoint¶
First, sign in to Globus <https://app.globus.org/endpoints/>_ and navigate to the ENDPOINTS page. Select benaroyaresearch#BRIGridFTP then Activate, after which you’ll be prompted to enter your login credentials for the srvgridftp01 BRI server. Make sure to expand the “advanced” options and set the “Credential Lifetime” to something large like 10000 hours (that way, you won’t need to reauthenticate for about a week).
Uploading batch submit files¶
In the web interface for the instance of Galaxy managed by Globus for BRI (currently at <bri.globusgenomics.org/>_), select “Batch Management -> Workflow batch submit” from the left-hand side menu.
Enter “benaroyaresearch#BRIGridFTP” into the Source Endpoint field, and enter the file path generated by
bripipetools submitfor your workflow batch into the Source Path field.
Submitting batch jobs¶
Note
Monitoring Batch Jobs
In general, it’s a good idea to monitor the status of jobs intermittently during a run. This can help diagnose any issues that come up early, which will save time and AWS resources. To view currently-running jobs, you can click on the gear in the top right corner of the Galaxy dashboard, then select “Saved Histories”. Any jobs with errors will appear with red boxes in the “Datasets” column.
Warning
Batch Submission Size
Depending on the number and type of jobs in the batch, it may take several hours or even a day or two for Galaxy to complete all of the jobs. It’s best to submit workflows with only a couple hundred jobs and wait for them to complete, in case there’s any troubleshooting that needs to take place during this phase. However, there’s nothing wrong with uploading all of your batch files at once and submitting them one at a time after each finishes.
In the Galaxy web interface select “Globus Data Transfer -> Get Data via Globus” from the left-hand side menu.
Select the job number for the workflow batch file you’ve uploaded, and click Execute.
Collecting workflow batch results¶
Usage: bripipetools wrapup [OPTIONS] PATH
Perform 'dbify' and 'postprocess' operations on all projects and workflow
batches from a flowcell run.
Options:
-t, --output-type [c|m|q|v|a] type of output file to combine: c [counts],
m [metrics], q [qc], v [validation], a [all]
-x, --exclude-types [c|m|q|v] type of output file to exclude: c [counts],
m [metrics], q [qc], v [validation]
--stitch-only / --stitch-and-compile
Do NOT compile and merge all summary (non-
count) data into a single file at the
project level
--clean-outputs / --outputs-as-is
Attempt to clean/organize output files
--help Show this message and exit.
Importing flowcell data into GenLIMS¶
Usage: bripipetools dbify [OPTIONS] PATH
Import data from a flowcell run or workflow processing batch into GenLIMS
database.
Options:
--help Show this message and exit.
Postprocessing workflow outputs¶
Usage: bripipetools postprocess [OPTIONS] PATH
Perform postprocessing operations on outputs of a workflow batch.
Options:
-t, --output-type [c|m|q|v|a] type of output file to combine: c [counts],
m [metrics], q [qc], v [validation], a [all]
-x, --exclude-types [c|m|q|v] type of output file to exclude: c [counts],
m [metrics], q [qc], v [validation]
--stitch-only / --stitch-and-compile
Do NOT compile and merge all summary (non-
count) data into a single file at the
project level
--clean-outputs / --outputs-as-is
Attempt to clean/organize output files
--help Show this message and exit.
Follow up steps¶
Not all pipeline steps have been integrated into the bripipetools application code base. Remaining steps are performed with scripts located in the scripts folder.
Generating gene model coverage plots¶
usage: plot_gene_coverage.py PATH
while read path; do \
python scripts/plot_gene_coverage.py $path;
done < <(find <path-to-flowcell-folder> -name "metrics" -maxdepth 2)
Running MiXCR (depending on workflow version)¶
Note: requires SLURM!! (must run on server srvgalaxy01)
/mnt/code/shared/bripipetools/
usage: run_mixcr.py [-h] -i INPUTDIR -o RESULTSDIR [-x EXCLUDENODES]
[-s SPECIES] [-c CHAINTYPE] [-k]
Notes on arguments:
-x nodename allows the user to exclude slurm nodes from use in processing the batch
-s species sets the species (‘hsa’ = human (default), ‘mmu’ = mouse)
-c chaintype defines the immunological chain to check for. This should be ‘TCR’ (default) or ‘ALL’ (for BCR identification)
-k sets the use of KAligner2 during the align phase, which is useful for alignments with large gaps (ie: BCR identification).
while read path; do \
outdir="$(dirname $path)/mixcrOutput_trinity";
python scripts/run_mixcr.py -i $path -o $outdir;
done < <(find <path-to-flowcell-folder> -name "Trinity" -maxdepth 2)
Handy shortcut::
# Custom formatted output from squeue
alias squeuel='squeue -o "%.7i %.9P %.30j %.10u %.8T %.10M %.6D %.5C %.8p %R"'
Concatenating Trinity outputs¶
usage: concatenate_trinity_output.py PATH
while read path; do \
python scripts/concatenate_trinity_output.py $path;
done < <(find <path-to-flowcell-folder> -name "Trinity" -maxdepth 2)
Generating project links¶
usage: generate_project_links.sh PATH
bash scripts/generate_project_links.sh <path-to-flowcell-folder>
Inspecting outputs¶
After data have been processed on Galaxy and sent back to BRI’s storage, results will be stored under the flowcell folder in a new folder that looks like Project_<project-id>Processed_<date>, where date is the YYMMDD string of the date on which the script was run — e.g., Project_P43-12Processed_151208.
Reprocessing old data¶
Because of the nature of NGS processing and rapid developments in the technology available to the field, we can expect periodic updates to the pipeline that will be unable to support older data. The following section describes how to address some of these situations.
Generating .fastqs: bcl2fastq¶
Overview¶
bcl2fastq is software provided by Illumina to help convert raw sequence data (binary call files - ‘.bcl’s) to .fastq data. This program will also perform ‘demultiplexing’, or the assignment of reads to libraries. Because we most commonly sequence multiple libraries on the same lane of a flow cell, it is necessary to find the index sequence for each read to determine which library the sequence data came from.
This step is usually performed automatically by BaseSpace based on the sample sheet that’s submitted when the sequencing run begins. However, it is sometimes necessary to manually run bcl2fastq, for example if BaseSpace encounters an error, or if there’s a need to recover .fastq files from older sequencing runs before BaseSpace.
Version¶
The exact version of bcl2fastq being used is important, because different versions support different sample sheet formats for use in demultiplexing (see details below). As of June 22nd 2020, bcl2fastq v2.20.0.422 is installed on srvgalaxy01 at /usr/local/bin/bcl2fastq. All details below are with respect to this version. Documentation for other versions is available at Illumina’s support site.
Demultiplexing Considerations¶
The connection between the index sequence and library/sample identifier is made through the sample sheet. bcl2fastq looks for this file in a default location at path/to/flowcell/run/SampleSheet.csv, but you can specify a different location for the sample sheet using the --sample-sheet argument.
If bcl2fastq cannot find a sample sheet or cannot read the format of the sample sheet, it won’t display any diagnostic message, but will assign all reads to a project-less “Undetermined_*.fastq.gz” output file.
Sample Sheet Format
It is important to make sure that the sample sheet conforms to the file specifications outlined in the bcl2fastq documentation. In particular, the sample sheet must have a ‘[data]’ section with columns named ‘Lane’, ‘Sample_ID’, ‘Sample_Name’, ‘Sample_Project’, and ‘index’ (note capitalization). Older versions of the sample sheet contained the camel case variants of these column names (eg: ‘SampleName’) and do not have a ‘[data]’ section designator. These files cannot be read by bcl2fastq and need to be re-formatted for use in demultiplexing.
Index Reads
By default, bcl2fastq will attmept to use the information contained in path/to/flowcell/run/RunInfo.xml to determine where the index sequence information is contained in the read data. In most cases this should work fine, but if you encounter issues with different types of indexing combined on one flow cell, you can explicitly tell bcl2fastq where to read the index data using the --use-bases-mask argument, as detailed in the bcl2fastq documentation. You can also specify which lanes/tiles to use with the --tiles argument.
For example, a sequencing run with Nextera dual-indexed libraries and TruSeq single indices may have a RunInfo.xml file that contains
<Reads>
<Read Number="1" NumCycles="100" IsIndexedRead="N" />
<Read Number="2" NumCycles="8" IsIndexedRead="Y" />
<Read Number="3" NumCycles="8" IsIndexedRead="Y" />
</Reads>
This is appropriate for the dual, 8-length indices used in the Nextera libraries, but not the 6-length single-indexed TruSeq libraries. If the TruSeq libs were run in lanes 1 and 5, processing the TruSeq libraries can be handled using the command
bcl2fastq \
--runfolder-dir /path/to/my/flowCell \
--output-dir /path/to/my/flowCell/Unaligned \
--sample-sheet /path/to/my/flowCell/SampleSheet.csv \
--tiles s_1,s_5 \
--use-bases-mask y*,i6n*,n*
The example base mask y*,i6n*,n* means “Read 1 is all fragment insert sequence data. Read 2 contains index data in the first 6 bases, then ignore the rest of the read. Ignore all of read 3.”
Retrieving details for old workflows (DEPRECATED)¶
Note
For Reference Only
The following details refer to an instance of Galaxy that was hosted and managed by BRI. This instance has been turned off and the data archived. The following information is retained for reference and completeness, but workflow and history tracking are now managed via the Databases collections.
To collect details about old workflows and histories from processing jobs on the local Galaxy server, one can either use the PostgreSQL database directly, or take advantage of an R script for interacting with the database.
Galaxy PostgreSQL database queries¶
Keeping track of various queries here with thought of eventually combining into scripts or functions.
Basic login to db::
svc_galaxy@srvgalaxy02:~$ psql svc_galaxy
History info for a project::
svc_galaxy=# select * from history where name like '%P15-8%';
svc_galaxy=# select id from history where name like '%P15-8%';
Dataset info for a specific History¶
List datasets::
svc_galaxy=# SELECT dataset_id FROM history_dataset_association WHERE history_id = '536';
Get full dataset info::
svc_galaxy=# SELECT * FROM dataset WHERE id IN (SELECT dataset_id FROM history_dataset_association WHERE history_id = '536');
Job info for a specific History¶
svc_galaxy=# SELECT * FROM job WHERE history_id = '536';
Job metrics for specific steps¶
svc_galaxy=# SELECT * FROM job_metric_numeric WHERE job_id IN (SELECT id FROM job WHERE history_id = '529' AND tool_id LIKE '%/tophat/%') AND metric_name = 'runtime_seconds';
Job metrics for datasets¶
svc_galaxy=# SELECT * FROM job_to_input_dataset WHERE dataset_id IN (SELECT dataset_id FROM history_dataset_association WHERE history_id = '536');