SARS-CoV-2 Python Scripts

This repo contains three custom python scripts called in our SC2 wdl workflows. These scripts should be stored in a google bucket and linked to your workspace data within your Terra.bio workspace, so they can be used as inputs within each respective workflow. Below lists the three scripts and their associated workflows and further below provides greater detail about each script.

calc_percent_coverage.py is called in the SC2_ont_assembly.wdl, SC2_illumina_pe_assembly.wdl, and SC2_illumina_se_assembly.wdl.
nextclade_json_parser.py is called in the SC2_lineage_calling_and_results.wdl.
concat_seq_metrics_and_lineage_results.py is called in the Sc2_lineage_calling_and_results.wdl.

calc-percent-coverage-py

Overview

This script is called in the SC2_illumina_pe_assembly.wdl, SC2_ilumina_se_assembly.wdl, and SC2_nanopore_assembly.wdl WDL workflows. These workflows act on individual samples (as opposed to sample sets), therefore this script also works on individual samples. The script reads in a consensus genome as a fasta file and calculates the percent coverage of the sample consensus genome. Specifically, the percent coverage is calculated as:

percent_coverage = frac{number_non_ambigous_bases}{29903}, * 100

where the number_non_ambigous_bases is the number of basepair calls not including Ns in the sample consensus sequence and 29903 is the number of basepairs in the reference genome (NC_045512).

Inputs

flag	description
`--sample_name`	sample name as recorded in the `entity:sample_ID` column in the terra data table
`--fasta_file`	renamed consensus sequence fasta file; generated during the ivar_consnesus (illumina data) or mekdata (ont data) and rename fasta tasks of the assembly workflows

Outputs

The script also records the number of aligned bases, the number of ambiguous bases (Ns), and the number of nonambiguous bases (A,G,C,T). The output is a csv file called {sample_name}_consensus_cvg_stats.csv and has the following column headers:

column header	description
`sample_name`	sample name as recorded in the `entity:sample_ID` column in the terra data table
`number_aligned_bases`	the total length of the consensus genome (including Ns)
`number_N_bases`	the number of ambiguous (N) bases in the consensus genome
`number_non_ambiguous_bases`	the number of non-ambiguous (A, C, G, T) bases in the consensus genome
`percent_coverage`	percent coverage of the reference genome represented in the consensus sequence as calculated above

There is an example output in the example data directory within this repo.

nextclade-json-parser-py

Overview

This script is called in the SC2_lineage_calling_and_results WDL workflow. This workflow acts on sample sets and so therefore this script also works on a sample set. This script is called in the parse_nextclade task within the workflow which can be seen in the SC2_lineage_calling_and_results.wdl workflow diagram in the README.md one directory out. Briefly, the workflow concatentates all consesnus sequences of the samples in the sample set into a single fasta file (concatenate task). The concatentated fasta file is run through nextclade which generates a nextclade.json file (nextclade task). Within the nextclade.json file is data for each sample consensus sequence inlcuding the nextclade clade designation, AA substitutions, deletions, insertions, etc. Generally, this script reads in the nextclade.json file, parses the json file to extract the data of interest, formats the data into a table and saves it as a csv file.

Inputs

flag	description
`--nextclade_json`	the nextclade.json file generated in the `nextclade` task of the workflow.
`--project_name`	the project name of the sequencing run

Outputs

There are two outputs from this script, each accomplished from a seperate function within the script. Example outputs can be found in the example data directory within this repo. These functions are:

extract_variant_list() function : This function generates a summmary of the AA substitions, insertions, and deletions for each sample within the nextclade.json file. The output is a csv file called {project_name}_nextclade_variant_summary.csv which is one of the files that is transfered to the google bucket as outputs of the workflow. The data is formatted such that each row corresponds to a either an AA substition, insertion, or deletion, such that each consensus seuqence can have more than one row of data. The csv file has the following column headers:

column header	description
`sample_name`	the sample name as listed in the fasta header (therefore there will be a "CO-CDPHE-" prefix added to the sample_name as listed in the entity column in the terra datatable)
`variant_name`	the full name of the variant formatted as {gene}_{refAA}{AApos}{altAA} (e.g. S_L452Q or S_L24del); For insertions the gene is not listed, the refAA is defined as "ins", the AA position is the nucleotide position in the genome and the altAA is listed as the string of nucleotides. So the variant is formated to look something like this: "_ins1027T" which would be interpreted as an insertion of a T nucleotide occured at genome position 1027
`gene`	the gene where the AA substition, deletion or insertion occurs (e.g. N, S, ORF1a, M etc.). No gene is listed for insertions
`codon_position`	the codon position (or protien position) within the gene where the AA substition, deletion or insertion occured. For insertions it is the nucleotide genome position
`refAA`	the reference AA at the position where the AA substition, deletion, or insertion occured. For insertions the refAA is listed as "ins".
`altAA`	he AA in the consensus sequence at the position where the AA substition, deletion, or insertion occured. For insertions the altAA is the string of nucleotide base pairs that were inserted
`start_nuc_pos`	the starting nucleotide position within the genome where the AA substition, deletion, or insertion occured
`end_nuc_pos`	the ending nucleotide position within the genome where the AA substition, deletion, or insertion occured (for a single AA substition the start_ncu_pos and end_nuc_pos will be a difference of 3)

get_nextclade() function: This function generates a summary of the nextclade designation, total nucleotide and AA substitions, total nucleotide and AA deletions, and total nucleotide insertions. The output file is called {seq_run}_nextclade_results.csv and is used as input for the concat_seq_metrics_and_lineage_results.py called in the results_table task in the workflow. The output file has the following column headers:

column header	description
`sample_name`	the sample name as listed in the fasta header (therefore there will be a "CO-CDPHE-" prefix added to the sample_name as listed in the entity column in the terra datatable)
`nextclade`	nextclade clade designation (e.g. 22C (Omicron))
`total_nucleotide_mutations`	number of SNPs in consensus sequence
`total_nucleotide_deletions`	number of deletions in the consensus sequence
`total_nucleotide_insertions`	number of insertions in the consensus sequence
`total_AA_substitutions`	number of AA substitions in the consensus sequence
`total_AA_deletions`	number of AA deletions in the consensus sequence

concat-seq-metrics-and-lineage-results-py

Overview

This script is called in the SC2_lineage_calling_and_results WDL workflow. This workflow acts on sample sets and so therefore this script also works on a sample set. This script is called in the results_table task within the workflow which can be seen in the SC2_lineage_calling_and_results.wdl workflow diagram in the README.md one directory out. Generally, this script pulls together a bunch of metadata and data regarding the consensus sequence and outputs the data in csv file.

Inputs

The script takes the following inputs:

flag	description
`--sample_name`	the sample_name variables for the set of samples as an array and written to a txt file using the wdl function `write_lines`
`--workbook_path`	the gcp file path to the workbook. The workbook can inlcude any column you'd like but must include at minimum the following columns: `hsn`, `sample_id`, `--project_name`, `plate_name`, `run_name`. These columns can be left blank if needed.
`--cov_out_filles`	the list of the `cov_out` variable (column in the terra data table) for the worfklow written to a text file. This variable is a file path to a file with the bam stats generated in the `SC2_ont_assembly.wdl`, `SC2_ilumina_se_assembly.wdl` or `SC2_illumina_pe_assembly.wdl` from the bam stats task.
`--percent_cvg_files`	the list of the `percent_cvg_csv` variable (column in the terra data table) for the worfklow written to a text file. This variable is a file path to a file with the bam stats generated in the `SC2_ont_assembly.wdl`, `SC2_ilumina_se_assembly.wdl` or `SC2_illumina_pe_assembly.wdl` workflows from the `calc_percent_coverage.py` script called during the `calc_percent_cvg` task.
`--assembler_version`	the assembler_version variable (column in the terra data table) for the worfklow . This is written to the terra data table during the `SC2_ont_assembly.wdl`, `SC2_ilumina_se_assembly.wdl` or `SC2_illumina_pe_assembly.wdl` workflows.
`--pangolin_lineage_csv`	this is the lineage report csv file generated from pangolin during the `pangolin` task
`--nextclade_clades_csv`	this is the `{seq_run}_nextclade_results.csv` file generated from the `nextclade_json_parser.py` script during the `parse_nextclade` task.
`--nextclade_variants_csv`	this is the `{seq_run}_nextclade_variant_summary.csv` file generated from the `nextclade_json_parser.py` script during the `parse_nextclade` task.
`--nextclade_version`	this is the nextclade version which is defined as output during the `nextclade` task.
`--project_name`	project_name from column in terra data table

Outputs

There are three outputs from this script. Example outputs can be found in the example data directory within this repo.

{project_name}_sequencing_results.csv: summary of sequencing metrics for all samples within the sample set. Below is a table of the column headers and their description. Currently all headers from the sequencing workbook which get carried over to the terra datatable are inlcuded in this output file. The list below includes some but not all of the columns in the file.

column header name	description
`sample_name`	sample name
`hsn`	hsn (horizon serial number)
`primer_set`	name of primer set used for tiled amplicon squenicng (Artic V3, Artic V4, Artic V4.1, Midnight or COVIDSeqV3)
`percent_coverage`	percent coverage; the total proportion of the genome that is covered not including regions where an N is called for a basecall
`nextclade`	the nextclade clade assignment
`panoglin_lineage`	the pangolin lineage assignment
`pangolin_expanded_lineage`	the expanded panglin lineage
`assembler_version`	assembler software version (either bwa or minimpa depending on assembly workflow used)
`spike_mutations`	list of spike muations in the spike gene squence that correspond to key spike mutations identified in the sample consensus sequence (this column was created prior to VOCs and inlcudes spike mutatuations we were watching and has not been updated since)
`total_nucleotide_mutations`	number of SNPs in the consensus sequence genome
`total_AA substitutions`	number of amino acid substitions in the consensus sequence genome
`total_AA_deletions`	number of deletions in the consensus sequence genome
`mean_depth`	average number of reads per nucleotide site in the the conesnus sequnce genome
`number_aligned_bases`	total number of bases aligned to the refernece genome (including Ns; so pretty much tells you how much was cut of the ends of the genome)
`number_non_ambious_bases`	total number of non-N bases in the conesnus genome sequence
`number_seqs_in_fasta`	total number of sequences in the concensus fasta - should always be 1
`total_nucleotide_deletions`	number of deletions in the consensus genome sequence
`total_nucleotide_insertions`	number of insertions in the consensus genome seqeunce
`num_reads`	total sequencing reads
`mean_base_quality`	mean quality score across all reads
`mean_map_quality`	mean mapping quality score for reads mapping to reference genome sequence
`number_N_bases`	number of bases called as N in the consensus genome sequence
`nextclade_version`	nextclade version
`panolgin_version`	pangolin version
`pangoLEARN_conflict`	from pangolin lineage report file
`pangolin_ambiguity_score`	from pangolin lineage report file
`pangolin_scorpio_call`	from pangolin lineage report file
`pangolin_scropio_support`	from pangolin lineage report file
`pangolin_scropio_conflict`	from pangolin lineage report file
`panoglin_scorpio_notes`	from pangolin lineage report file
`pangolin_designation_Version`	from pangolin lineage report file
`pangolin_scorpio_version`	from pangolin lineage report file
`pangolin_constellation_version`	from pangolin lineage report file
`pngolin_is_designated`	from pangolin lineage report file
`pangolin_qc_status`	from pangolin lineage report file
`pangolin_qc_notes`	from pangolin lineage report file
`panoglin_note`	from pangolin lineage report file
`seq_run`	sequencing run name
`tech_platform`	seuqencing platform (e.g. Illumina MiSeq, Illumina NextSeq, Oxford Nanopore GridION)
`read_type`	single or paired end
`fasta_header`	name of the fasta header for gisaid submission (e.g. CO-CDPHE-{accession_id})
`analysis_date`	date assembly workflow ran

{seq_run}_wgs_horizon_report.csv: for internal use, parsing sequencing results into LIMS.Below is a table of the column headers and their description.

column header name	description
`accession_id`	sample name
`percent_coverage`	percent coverage
`pangolin_lineage`	pangolin lineage
`pangolin_version`	pangolin version
`report_to_epi`	this column is meaningless now but have to keep
`Run_Date`	date assembly workflow ran
`pangoLEARN_version`	this column is also not used but we have to keep it

Details About Working with Sample Sets

Here I describe a way to create a single summary data table output for sample sets using a wwdl workflow in terra. (It's a bit clunkly but seems to work). Essentially this method enables one to create python lists from the columns in the terra data table when workflows are run as a sample set. The easiest way to explain this is with an example.

So for example, in the concat_seq_metrics_and_lineage_results.py, there is the input flag --plate_name_file_list. As input for this flag I use ${write_lines(plate_name)}. The plate_name corresponds to the plate_name column in the terra data table. Each element in column is a string. The write_lines() wdl function will write each element it's own line to a text file. Thus, the input into the python script is really a text file with a list of plate names. I then wrote some code that reads in the text file and generates a python list, with each line being a new element in the list. So it looks something like this:

plate_name_list = []
with open(plate_name_file_list) as f:
  for line in f:
    plate_name_list.append(line.strip())

Similiarly in some cases, instead of string variables being stored in a column wihtin the terra data table, file paths are stored (ie. the data type is a File or Array[File] in the case of sample sets). For example, in the concat_seq_metrics_and_lineage_results.py, there is the input flag --percent_cvg_file_list. As input for this flag I use ${write_lines(percent_cvg_csv_non_empty)}, where the percent_cvg_csv_non_empty variable corresponds to the column percent_cvg_csv. (note the non-empty part just means that the variable may be empty for some samples in the terra data table. To set the variable as input at the begining of the wdl I use: Array[File?] percent_cvg_csv). Similiarly as above, the script will create a list of file paths from the text file. The script can then loop through the list of file paths, open each file, extract the data from that file, and store it in a list or other dataframe to be written out.

Changelog

updated 2023-03-09 to sync with universal naming switch over

SARS-CoV-2 Python Scripts

Table of Contents

calc-percent-coverage-py

Overview

Inputs

Outputs

nextclade-json-parser-py

Overview

Inputs

Outputs

concat-seq-metrics-and-lineage-results-py

Overview

Inputs

Outputs

Details About Working with Sample Sets

Changelog