Assembly Workflows

The following three workflows describe the reference based assembly methods for paired-end and single end illumina seuqencing data and ONT sequencing data. Each workflow accepts "sample" as the root entity type.

Illumina PE

File: SC2_illumina_pe_assembly.wdl

This workflow was developed for the assembly of Illumina 150 bp paired-end read data using the Illumina Nextera XT library prep protocol. The workflow accepts "sample" as the root entity type. The workflow will:

Use Seqyclean to quality filter and trim raw fastq files
- Seqyclean parameters include a minimum read length set to 70 bp and quality trimming set to a minimum Phred quality score of 30.
Run FastQC on both the raw and cleaned reads
Align reads to the reference genome using bwa and then sort the bam by coordinates using Samtools
Use iVar trim to trim primer regions and then sort the trimmed bam by coordinates using Samtools
Use iVar variants to call variants from the trimmed and sorted bam
- iVar variants parameters include a minimum quality score set to 20, a minimum variant base frequency set to 0.6 and a minimum read depth set to 10.
Use iVar consensus to call the consensus genome sequence from the trimmed and sorted bam
- iVar consensus parameters include a minimum quality score set to 20, a minimum variant base frequency set to 0.6 and a minimum read depth set to 10.
Use Samtools flagstat, stats, and coverage to output statistics from the bam
Rename the fasta header of consensus sequences in the GISAID-acceptable format: CO-CDPHE-{sample_id}
Calculate the percent coverage using the calc_percent_coverage.py script available in the python_scripts directory of this repo.

SC2_illumina_pe_assembly.wdl workflow diagram

Inputs

1. Terra Data Table

The terra data table must include the following columns as listed below. Note that optional columns are not neccessary for the assembly workflow but must be present for the SC2_lineage_calling_and results.wdl and Transfer workflows described below under Lineage Calling Workflows and Transfer Workflows, respecitively.

column header	description
`entity:sample_id`	column with the list of sample names. (e.g. `entity:covwwt-0203_id`)
`fastq_1`	The google bucket path to the R1 fastq file.
`fastq_2`	The google bucket path to the R2 fastq file.
`out_dir`	User defined google bucket for where the files will be transfered during the transfer workflows.
`workbook_path`	(optional; required for lineage calling workflow)
`project_name`	(optional; requried for lineage calling workflow)

2. Terra Workspace Data

See setup.

3. Setting Up the Workflow Inputs

For setting up the worklfow inputs, use the SC2_illumina_pe_assembly-input.json in the workflow_inputs directory.

workflow variable	attribute (input syntax into workflow)
`adapters_and_contaminants`	workspace.adapters_and_contaminants_fa
`calc_percent_coverage_py`	workspace.covid_calc_percent_coverage_py
`covid_genome`	workspace.covid_genome_fa
`covid_gff`	workspace.covid_genome_gff
`fastq_1`	this.fastq_1
`fastq_2`	this.fastq_2
`primer_bed`	workspace.artic_v4-1_bed
`s_gene_amplicons`	workspace.artic_v4-1_s_gene_amplicons
`sample_name`	this.{entity_name}_id

Outputs

WDL task name	software/program	variable name	description
seqyclean	seqyclean	`filtered_reads_1`	file
seqyclean	seqyclean	`filtered_reads_2`	file
seqyclean	seqyclean	`seqyclean_summary`	file
fastqc as fastqc_raw	fastqc	`fastqc_raw1_html`	file
fastqc as fastqc_raw	fastqc	`fastqc_raw1_zip`	file
fastqc as fastqc_raw	fastqc	`fastqc_raw2_html`	file
fastqc as fastqc_raw	fastqc	`fastqc_raw2_zip`	file
fastqc as fastqc_cleaned	fastqc	`fastqc_clean1_html`	file
fastqc as fastqc_cleaned	fastqc	`fastqc_clean1_zip`	file
fastqc as fastqc_cleaned	fastqc	`fastqc_clean2_html`	file
fastqc as fastqc_cleaned	fastqc	`fastqc_clean2_zip`	file
align_reads	bwa and samtools	`out_bam`	file
align_reads	bwa and samtools	`out_bamindex`	file
align_reads	bwa and samtools	`assembler_version`	string recording the version for bwa, this information is used later for submitting to public repositories.
ivar trim	ivar trim and samtools	`trim_bam`	file
ivar trim	ivar trim and samtools	`trimsort_bam`	file
ivar trim	ivar trim and samtools	`trimsort_bamindex`	file
ivar variants	ivar variants	`variants`	vcf file formatted as a tsv
ivar consensus	ivar consnesus	`consensus`	fasta file of consensus genome, Ns are called in places with less than 10 bp read depth.
bam_stats	samtools flagstat, stats, percent_coverage	`flagstat_out`	file
bam_stats	samtools flagstat, stats, percent_coverage	`stats_out`	file
bam_stats	samtools flagstat, stats, percent_coverage	`covhist_out`	file
bam_stats	samtools flagstat, stats, percent_coverage	`cov_out`	file
bam_stats	samtools flagstat, stats, percent_coverage	`cov_s_gene_amplcions_out`	file
bam_stats	samtools flagstat, stats, percent_coverage	`cov_s_gene_out`	file
rename_fasta	N/A	`renamed_consensus`	fasta file; consensus genome sequence with the fasta header renamed to be CO-CDPHE-{sample_name}
calc_percent_cvg	calc_percent_coverage.py	`percent_cvg_csv`	csv file, see calc_percent_cvg.py script readme for details found in the ./python_scripts directory of this repository.

Illumina SE

This workflow is no longer maintained.

File: SC2_illumina_se_assembly.wdl

This workflow was developed for the assembly of Illumina 72 bp single-end read data using the Illumina COVIDSEQ library prep protocol. The workflow accepts "sample" as the root entity type. The workflow will:

Use Trimmomatic and bbduk to quality filter, trim, and remove adapters from raw fastq files
- Trimmomatic parameters inlcude a sliding widnow set to trim reads when the 4bp sliding window quality score falls below a mean Phred quality score of 30 (i.e. 4:30) and a minimum read lenght of 25 bp.
- bbduck parameters include adapter trimming set to trim everything to the right of a kmer match and removal of PhiX sequences.
Run FastQC on both the raw and cleaned reads
Align reads to the reference genome using bwa and then sort the bam by coordinates using Samtools
Use iVar trim to trim primer regions and then sort the trimmed bam by coordinates using Samtools
Use iVar variants to call variants from the trimmed and sorted bam
- iVar variants parameters include a minimum quality score set to 20, a minimum variant base frequency set to 0.6 and a minimum read depth set to 10.
Use iVar consensus to call the consensus genome sequence from the trimmed and sorted bam
- iVar consensus parameters include a minimum quality score set to 20, a minimum variant base frequency set to 0.6 and a minimum read depth set to 10.
Use Samtools flagstat, stats, and coverage to output statistics from the bam
Rename the fasta header of consensus sequences in the format: CO-CDPHE{sample_id}
Calculate the percent coverage using the calc_percent_coverage.py script available in the python_scripts directory of this repo.

SC2_illumina_se_assembly.wdl workflow diagram

Inputs

1. Terra Data Table

The terra data table can be generated using the preprocess python scripts available in the data preprocessing repository. The terra data table must include the following columns as listed below. Note that optional columns are not neccessary for the assembly workflow but be present for the SC2_lineage_calling_and results.wdl and Transfer workflows described below under Lineage Calling Workflows and Transfer Workflows, respectively.

entity:sample_id: column with the list of sample names/ids. Note that if there is more than one data table in the Terra Workspace, you need to add a number after the word sample to keep the datatables seperate (e.g. entity:sample2_id).
fastq: The google bucket path to the fastq file.
seq_run (optional): the name of the sequencing run (e.g. NEXSEQ_101)
tech_platform (optional) : e.g. Illumina NexSeq
read_type (optional): single
primer_set (optional): e.g. COVIDSeqV3
plate_name (optional): name of sequencing plate
plate_sample_well (optional): location of well on sequencing plate
out_dir (optional): user defined google bucket fro where the files will be transfered during the transfer workflows.

2. Terra Workspace Data

The following reference files can be found in the workspace_data directory and the python_scripts directory. These files should be saved as Workspace data in your Terra Workspace. To do so, upload the files to a google bucket an link the file path to the wrokspace data variable. Once saved as workspace data variables, they can be used as inputs for the workflow.

covid_genome: the path to the google bucket directory contianing the SARS-CoV-2 reference genome fasta (we use NCBI genbank ID MN908947.3).
covid_gff: the path to the google bucket directory containing the SARS-CoV-2 reference genome gff annotation file (we use NCBI genbank ID MN908947.3)
primer_bed: the path to the google bucket directory containing a bed file with the primers used for amplicon sequencing
- currenly we have bed files for Artic V3, Artic V4, Artic V4.1 and Midnight.
preprocess_python_script: [do we want to change the name of this variable in the WDL to match the python script name?] the path to the google bucket containing the calc_percent_coverage.py script.

Below is a summary of the workflow input variables along with the syntax used for the attribute column when setting up the workflow to run on Terra.bio. For the attributes, the "this." syntax refers Terra to pull the variable from the terra datatable (#1 above). The "workspace." syntax refers Terra to pull the variable from the terra workspace data (#2 above).

workflow variable	attribute (input syntax into workflow)
`covid_genome`	workspace.covid_genome
`covid_gff`	workspace.covid_gff
`fastq`	this.fastq
`preprocess_python_script`	workspace.preprocess_python_script
`primer_bed`	workspace.V4-1Artic
`sample_id`	this.sample{terra_datatable_name}_id

Outputs

1. Output Files from Trimmomatic and ddbuk

trimmed_reads: file
trim_stats: file
filtered_reads: file
adapter_stats: file
PhiX_stats: file

2. Output Files from FastQC

fastqc_raw1_html: file
fastqc_raw1_zip: file
fastqc_raw2_html: file
fastqc_raw2_zip: file
fastqc_clean1_html: file
fastqc_clean1_zip: file
fastqc_clean2_html: file
fastqc_clean2_zip: file

3. Output files from bwa and samtools (align reads)

out_bam: file

4. Output files from iVar trim and samtools

trim_bam: file
trimsort_bam: file
trimsort_bamindex: file

5. Output files from iVar variants

variants: vcf file formated as a tsv

6. Output files from iVar consensus

consensus: fasta file of conensus genome, Ns are called in places with less than 10 bp read depth.

7. Output files from Samtools flagstat, stats, and percent_coverage

fagstat_out: file
stats_out: file
covhist_out: file
cov_out: file

8. Output from rename consensus fasta headers

renamed_consensus: fasta file; consesnus genome sequence with the fasta header renamed to be CO-CDPHE-{sample_id}

9. Output from calc_percent_coverage.py

percent_cvg_csv: csv file, see calc_percent_cvg.py script readme for details.

10. bwa assembler version string output

assembler_version: string recording the version for bwa, this information is used later for submitting to public repositories.

Oxford Nanopore Technologies (ONT)

File: SC2_ont_assembly.wdl

This workflow was developed for the assembly of Oxford Nanopore Technology (ONT) read data following the ARTIC SARS-CoV-2 sequencing protocol and using the ONT native barcoding kit. This workflow assumes that basecalling and conversion of fast5 files into fastq has already occurred (e.g. using MinKNOW). The workflow accepts "sample" as the root entity type. The workflow will:

Demuliplex basecalled fastq files using guppy_barcoder
Perform quality filering using guppyplex
- guppyplex inlcudes a min length parameter set to 400 and a max length set to 700 for Artic primers and a min lingth set ot 400 and a max length set to 1500 for midnight primers.
Run artic minion --medaka for variant calling and to generate a consensus fagstat_out
- medaka uses minimap2 by default to align reads to the SARS-CoV-2 reference genome
- the default parameter in medaka for base calling is 20x depth and at least 60% of reads containing the base call
Scaffold assembly with pyScaf
- this step ensures a single continuous consensus sequence with only one sequence in the consensus fasta file
Rename consensus to CO-CDPHE-{sample_id}
Generate bam quality statistics using samtools
Calculates percent coverage using the calc_percent_coverage.py script

SC2_ont_assembly.wdl workflow diagram

Inputs

1. Terra Data Table

The terra data table can be generated using the preprocess python scripts available in the data preprocessing repository. The terra data table must include the following columns as listed below. Note that optional columns are not neccessary for the assembly workflow but but be present for the SC2_lineage_calling_and results.wdl and Transfer workflows described below under Lineage Calling Workflows and Transfer Workflows, respectively.

column header	description
`entity:sample_id`	column with the list of sample names. (e.g. `entity:covwwt-0203_id`)
`index_1_id`	the ont barcode associated with the sample
`fastq_dir`	the google bucket path with the set of fastq files
`out_dir`	User defined google bucket for where the files will be transfered during the transfer workflows.
`workbook_path`	(optional; required for lineage calling workflow)
`project_name`	(optional; requried for lineage calling workflow)

2. Terra Workspace Data

See setup.

3. Setting Up the Workflow Inputs

For setting up the worklfow inputs, use the SC2_ont_assembly-input.json in the workflow_inputs directory.

workflow variable	attribute (input syntax into workflow)
`calc_percent_coverage_py`	workspace.covid_calc_percent_coverage_py
`covid_genome`	workspace.covid_genome_fa
`gcs_fastq_dir`	this.fastq_dir
`index_1_id`	this.index_1_id
`primer_bed`	workspace.artic_v4-1_bed
`primer_set`	this.primer_set
`s_gene_amplicons`	workspace.artic_v4-1_s_gene_amplicons
`s_gene_primer_bed`	workspace.artic_v4-1_s_gene_primer_bed
`sample_name`	this.{entity_name}_id

Outputs

WDL task name	software/program	variable name	description
Demultiplex	guppy_barcoder	`barcode_summary`	file
Demultiplex	guppy_barcoder	`guppy_dmux_fastq`	file
guppyplex quality filtering	`filtered_fatsq`	file
Medaka	medaka and minimap2	`sorted_bam`	file
Medaka	medaka and minimap2	`trim_sort_bam`	file
Medaka	medaka and minimap2	`trimsort_bai`	file
Medaka	medaka and minimap2	`variants`	file
Medaka	medaka and minimap2	`consensus`	file
Medaka	medaka and minimap2	`assembler_version`	string recording the version for artic medaka, this information is used later for submitting to public repositories.
Bam_stats	samtools flagstat, stats, percent_coverage	`flagstat_out`	file
Bam_stats	samtools flagstat, stats, percent_coverage	`stats_out`	file
Bam_stats	samtools flagstat, stats, percent_coverage	`covhist_out`	file
Bam_stats	samtools flagstat, stats, percent_coverage	`cov_out`	file
Scaffold	pyScaf	`scaffold_consensus`	consesnus sequence as a fasta file
rename_fasta	N/A	`renamed_consensus`	fasta file; consesnus genome sequence with the fasta header renamed to be CO-CDPHE-{sample_name}
calc_percent_cvg	calc_percent_coverage.py	`percent_cvg_csv`	csv file, see calc_percent_cvg.py script readme for details found in the ./python_scripts directory of this repository.
get_primer_site_variants	bcftools	`primer_site_variants`	file