PiGx SARS-CoV-2
Introduction
PiGx SARS-CoV-2 is a pipeline for detecting viral lineages in sequencing data obtained from enriched wastewater samples. It was developed with SARS-CoV-2 as its target virus, but other targets are theoretically possible. The viral lineages are provided by the user together with their characteristic signature mutations. The pipeline is very flexible, allowing the user to choose from multiple input and output files, and giving them fine control about parameters used by the individual tools. PiGx SARS-CoV-2 has been developed with a focus on reproucible outputs, and it can be used for continuous sampling. The output of the PiGx SARS-CoV-2 pipeline is summarized in a report which provides an intuitive visual overview about the development of lineage abundance and single significantly increasing mutations over time and location. In addition, the pipeline will generate more detailed reports per sample, which cover the quality control of the samples, the detected variants, and a taxonomic classification of all unaligned reads. This version of the pipeline was designed to work with paired-end amplicon sequencing data e.g. following the ARtIc protocols ARTIC nCoV-2019 primers, but single-end sequencing reads are supported as well.
Workflow
In the first step the pipeline takes the raw reads and the additional information about used primers and adapters to perform extensive quality control. Primer trimming is done with iVAR, and fastp is used for adapter trimming and quality filtering. Next, the trimmed reads are aligned to the reference genome of SARS-CoV-2 using BWA, and the results are SAM/BAM files of aligned and unaligned reads. Following the alignment a quality check on raw and processed reads is performed by using MultiQC. Furthermore samples are checked for genome coverage and how many of the provided signature mutation sites are covered. Based on this every samples gets a quality score. Samples with genome coverage below a user defined percentage threshold are reported as discarded samples, as they are not included in time series analysises and summaries.
Calling the variants and inferring single nucleotide polymorphisms (SNVs) on the aligned reads is done with LoFreq. Mutations are annotated with VEP. Estimation of lineage frequencies is done by deconvolution (see Methods in the realted publication for details). To investigate the abundance of RNA matching other existing species in the wastewater samples the unaligned reads will be taxonomicly classified with Kraken2. Kraken2 requires a database downloaded locally of the genomes against the reads are getting aligned. For documentation how to set this up, see: Prepare databases. For a better and interactive visualization of all species present in the wastewater Krona is used. Also here a small step of setting up a database is needed before running the pipeline, see: Prepare databases. Interactive reports are generated using R-markdown and plotly for R for visualizations.
Pooling of samples for time series analysis and plots
For summarizing across daytime and location, the lineage frequencies are pooled by calculating the weighted average using the total number of reads of each sample as weights (missing samples are removed).
Output
- Overview reports including:
- Summary and visualization of the development of SARS-CoV-2 variants and mutations over time and locations from all samples provided.
- Quality Control reports per sample.
- Per sample variant report: variant analysis of SARS-CoV-2 from each wastewater sample and identification of variants of concern.
- Taxonomic classification of unaligned reads: Overview over taxa inferred from all the sequencing reads not aligning to the reference genome.
- Deconvoluted variant abundances
- Mutation abundances
- Single nucleotide variation call files
- VEP reports
- Kraken2 taxonomic classifications
- Numerous intermediate read, alignment, and statistics files
- Log files for all major analysis steps performed by the pipeline
Installation
Pre-built binaries for PiGx are available through GNU Guix, the functional package manager for reproducible, user-controlled software management. You can install the PiGx SARS-CoV-2 pipeline with
guix install pigx-sars-cov-2
If you want to install PiGx SARS-CoV-2 from source, please clone this repository and change directory accordingly:
git clone https://github.com/BIMSBbioinfo/pigx_sars-cov-2.git
cd pigx_sars-cov-2
To fetch code that is common to all PiGx pipelines run this:
git submodule update --init
Before setting everything up, though, make sure all dependencies are met by either installing them manually, or by entering the provided reproducible Guix environment. If you are using Guix we definitely recommend the latter. This command spawns a sub-shell in which all dependencies are available at exactly the same versions that we used to develop the pipeline:
USE_GUIX_INFERIOR=t guix environment --pure -m manifest.scm --preserve=GUIX_LOCPATH
To use your current Guix channels instead of the fixed set of channels, just omit the USE_GUIX_INFERIOR
shell variable:
guix environment --pure -m manifest.scm --preserve=GUIX_LOCPATH
Note that --pure
unsets all environment variables that are not explicitly preserved. To access other executables that are not part of the environment please address them by their absolute file name.
Inside the environment you can then perform the usual build steps:
./bootstrap.sh # to generate the "configure" script
./configure
make
make check
At this point you are able to run PiGx SARS-CoV-2. To see all available options type --help
.
pigx-sars-cov-2 --help
Preparing the databases
Before the pipeline can work, three databases must be downloaded to a location specified in the settings file. Depending on the size of the databases this can take some time.
Without any user intervention, this will happen automatically via snakemake rules. This behaviour is controlled via parameters in the settings file. See the Settings file section for details.
Alternatively, the databases may be downloaded manually via the download_databases.sh
scripts accessible like so:
prefix="$(dirname pigx-sars-cov-2)/../"
$prefix/libexec/pigx_sars-cov-2/scripts/download_databases.sh
However, the download_databases.sh
script does not offer the flexibility of downloading the databases automatically with user defined parameters, unless it is manually edited.
Note: The directory that the databases will be downloaded to needs to match the database directories given in the settings file, else the pipeline will download the databases to the given directory again, unnecessarily using up space. This should be the case by default though.
Read on for details if you want to download the databases manually.
Kraken2 database
There are several libraries of genomes that can be used to classify the (unaligned) reads. It is up to you which one to use, but be sure that they fulfill the necessities stated by Kraken2 (Kraken2 manual). For an overall overview we recommend to use the Plus-PFP library provided here, which is also the default library used in the pipeline. If the classification is not of concern or only the viruses are of interest, we recommend using a smaller one. This will speed up the pipeline.
After downloading and unpacking the database files, use kraken2-build
to download the taxonomy data and build the database.
Krona database
The way we use Krona, we only need the taxonomy database, as downloaded via their updateTaxonomy.sh
script.
VEP database
For our use of VEP in the pipeline, we need a pre-indexed cache of the VEP database of transcript models. This is the main point of the pipeline that determines which virus can be analysed with the pipeline. Currently, VEP only has data on SARS-CoV-2. But the VEP cache is the only point strictly determining which virus the pipeline can deal with.
Per default the pipeline uses the indexed cache archive at http://ftp.ensemblgenomes.org/pub/viruses/variation/indexed_vep_cache/sars_cov_2_vep_101_ASM985889v3.tar.gz
, which only needs to be unpacked to the target directory. Currently this is the only available chache file.
Quick start
To check whether the pipeline and the databases have been properly set up, run the pipeline on a minimal test dataset. If you installed the pipeline from source, start at step 3.
Download the test data
git clone https://github.com/BIMSBbioinfo/pigx_sars-cov-2 sarscov2-test
Enter the directory
cd sarscov2-test
Run the pipeline using a preconfigured settings file. If you installed the pipeline from source, the database location will be set to whatever dir was specified during the configure step. If you installed the databases manually, you will need to also adjust the database paths in the test settings file accordingly.
pigx-sars-cov-2 -s tests/setup_test_settings.yaml tests/sample_sheet.csv
Inside tests/
a new directory output_setup_test
is created, which includes specific directories containing output data for the respective step of the pipeline. The tests/output_setup_test/reports/index.html
gives the overview over all merged reports for the test data.
Preparing the input
In order to run the pipeline, you need to supply
- Sample sheet (CSV format): containing information about sampling date and
location - Settings file (YAML format) for specifying the experimental setup and optional custom parameter adjustments
- Mutation sheet containing the lineages of interest and their signature mutations in nucleotide notation (CSV format)
- Mutation BED file containing the genomic coordinates of the mutation sites (see below for details)
- Reference genome of the target species in fasta format (so far the pipeline is only optimized for SARS-CoV-2, others might work too but not yet
tested) - Primer BED file containing the PCR primer locations (e.g the primers suggested from ARTIC protocols)
In order to generate template settings and sample sheet files, type
pigx-sars-cov-2 --init
in the shell, and a boilerplate sample_sheet.csv
and settings.yaml
will be written to your current directory. An example for both files is provided in the tests/
directory.
Sample sheet
The sample sheet is a tabular file (csv
format) describing the experiment. The table has the following columns:
SampleName | Read | Read2 | date | location_name | coordinates_lat | coordinates_long |
---|---|---|---|---|---|---|
Test0 | Test0_R1.fastq | Test0_R2.fastq | 2021-01-01T08:00:00 | Berlin | 52.364 | 13.509 |
Test2 | Test2_R1.fastq | Test2_R2.fastq | 2021-01-03T08:00:00 | Munich | 48.208 | 11.628 |
- SampleName is the name for the sample
- Read & Read2 are the fastq file names of paired end reads
- the location of these files is specified in the settings file
- in the case of single-end data, leave the
Read2
column empty
- date is a date/time in ISO format (
yyyy-mm-ddThh:mm:ss
) - location_name is the name of the location and should be unique per coordinates
- coordinates_lat & coordinates_long correspond the latitude and longitude of the location name
Mutation sheet
The mutation sheet should contain one column of siganture mutations per lineage that shall be tracked and analysed by deconvolution. They should be given in the format GENE:RxxxV
, where GENE
is the name of the gene in which the mutation is found, R
is a string of reference nucleotides in upper case, xxx
is the first reference nucleotides position (a number), and V
is a string of variant nucleotides, also in upper case. There is no upper or lower limit for the number of signature mutations per lineage. However, please note that the deconvolution results are more robust and precise with a higher number of mutations. (Tested with 10-30 mutations per lineage).
Mutation BED file
The BED file for testing if the mutation sites are covered should have 4
columns:
- chromosome name
- start - 5bp before the mutation location
- end - 5bp after the mutation location
- name with the original location of the mutation in the format of: name_MutationLocation_name, e.g: “nCoV-2019_210_SigmutLocation” A row in the BED file should look like:
NC_045512.2 205 215 nCoV-2019_210_SigmutLocation
Please see the example file within the test directory for a detailed example.
Primer BED file
The primer file contains the locations of primer sequences on the reads, i.e. it defines where in the genomes primer sequences may occurr. This is determined by the primer scheme used in generating the sequencing reads. It is required by iVar for primer trimming. An example file can be found in the test directory (nCoV-2019_NCref.bed
).
Settings file
The settings file contains parameters (in YAML format) to configure the execution of the PiGx SARS-CoV-2 pipeline. There are generally two settings files at play:
- The pipeline interal settings file (source:
etc/settings.yaml
, installed:$prefix/share/pigx_sars-cov-2/settings.yaml
), containing default settings. This file is not supposed to be modified by the user. - An analysis specific, user provided settings file containing changes to the default settings. These are generally paths to the input files, databases, etc.
When the pipeline is executed, both settings files are combined into a run specific config file (config.json
), used by the internal snakemake
workflow manager. It is always generated in the place the pigx-sars-cov-2
program was called and will be overwritten on subsequent calls.
Click here for a detailed settings explanation.
locations
Paths to various input files and directories needed by the pipeline.
- output-dir output directory for the pipeline.
- input-dir direcotry containing the input files, the files therin should match the file suffix given under control/start.
- reference-fasta Mutation sheet
- primers-bed Primer BED file
- mutations-bed Mutation BED file
- mutation-sheet Mutation table
- kraken-db-dir Kraken2 database
- krona-db-dir Krona database
- vep-db-dir VEP database
databases
Settings controlling the download and subsequent processing of databases needed for the pipeline. If the databases are already present, none of these settings will have any effect.
When prebuilt archives are to be used, the pipeline can download them both via ftp
and http
/https
. If no protocol is included, http
/https
is assumed.
Note: Generally the official databases will be used except when running make check
/make distcheck
without the databases pre-installed at the default location. In that case the settings file at tests/settings.yaml
will be used, which configures the pipeline to use database archives prebuilt by us. This is to ensure the github actions can run smoothly. In order to build the archives, at least 1GB of disk space is necessary, which is not given on a github action runner.
kraken2
The kraken2 database download is the most complex of the three database downloads. The download is done via one of the database archive file provided at the Index Zone, except in the case outlined above.
When no downsampling should occur, the database archive is extracted and used as-is. In the other case, everything except the hash file (hash.k2d
) is extracted, the taxonomy is downloaded, and the database is built on disk. The taxonomy files are removed afterwards as they are not used again and only take up space after database building, at least for our purposes.
archive-url
The url the kraken2 archive will be downloaded from. The default archive is the very large database including protozoa and fungi in addition to the standard database.
When prebuilt archives are to be used, the pipeline can download them both via ftp
and http
/https
. If no protocol is included, http
/https
is assumed.
Default: https://genome-idx.s3.amazonaws.com/kraken/k2_pluspfp_20210127.tar.gz
downsample-db
Whether or not downsampling of the hash.k2d
file should occurr.
Default: false
max-db-size-bytes
If downsampling should occurr, what is the maximum allowable size in bytes? The resulting file may be smaller than the given value. The value is passed on directly to kraken2-build
under the --max-db-size
option. Will only have an effect of downsample_db
is true.
Default: 250000000
krona
The Krona database download is fairly simple. Either the Krona internal download script updateTaxnomy.sh
(Documentation) is used, or a prebuilt archive is downloaded.
use-prebuilt
Whether or not a prebuilt archive should be downloaded instead of running the update taxonomy script.
Default: false
archive-url
If a prebuilt archive should be downloaded, this tells the pipeline where to find it.
When prebuilt archives are to be used, the pipeline can download them both via ftp
and http
/https
. If no protocol is included, http
/https
is assumed.
Default: ““
vep
The vep download is the simplest. The database is always in a compressed archive.
archive-url
Location from where the archive will be downloaded.
When prebuilt archives are to be used, the pipeline can download them both via ftp
and http
/https
. If no protocol is included, http
/https
is assumed.
parameters
vep
Parameters for rule vep. See documentation about vep arguments --species
, --buffer_size
, and --distance
respectively in the vep documentation.
species
Needs to match with the downloaded VEP database cache. The vep
default is “homo_sapiens”, which is unlikely to be of use for this pipeline.
Directly passed to the vep
parameter --species
.
Default: sars_cov_2
buffer-size
Number of variants in memory at one time. Trades off between run time and memory usage. The higher, the faster.
Directly passed to the vep
parameter --buffer-size
.
Default: 5000
transcript-distance
Up- and downstream distance of a gene and a transcript which will be classified as an up- or downstream variant.
Directly passed to the vep
parameter --distance
.
Default: 5000
db-version
This specifies a database version, which is needed when the database version differs from the vep
executable version. By default, vep
looks for a database version matching its own. This is needed even when using a offline database cache like this pipeline does. Therefore this needs to be adjusted when changing the vep
database used.
Directly passed to the vep
parameter --distance
.
Default: 101
ivar_trimming
Parameters for rule ivar_primer_trim
. See documentation about ivar arguments -q
, -m
, and -s
respectively in the iVar documentation.
ivar
trimms primers from alignment BED files by first removing the sections listed in a second primer BED file, and then performing quality trimming by sliding a window along each read from 5’ to 3’ end, clipping the read if the average window quality drops below a given cutoff.
quality-cutoff
If the average base call quality in the sliding window drops below this value, the read will be trimmed to the last base of sufficient average window quality.
Directly passed to the ivar
parameter -q
.
Default: 15
length-cutoff
Read length threshold, if a read is shorter than this after trimming, it will be discarded.
Directly passed to the ivar
parameter -m
.
Default: 30
window-width
Number of bases in the sliding window, i.e. the window width.
Directly passed to the ivar
parameter -qs
.
Default: 4
reporting
These paramerters are used to set quality control filters for the reports.
mutation-coverage-threshold
Results from samples without sufficient coverage measures are not included in the visualizations or the linear regression calculations.
Default: 90
deconvolution
method
Control the deconvolution method used. Possible options are:
rlm
: Robust linear regression as implemented in theMASS
package, executed via thedeconvR
package.nnls
: Non-negative least squares (deconvR
).qp
: Quadratic programming (deconvR
).svr
: Support vector regression (deconvR
).
Prepending “weighted_” to the chosen method will add weighting of bulk mutation abundances per variant by the inverse of the proportion of detected mutations to known mutations. This biases the abundance of variants with high proportions towards higher values, and coversely biases the abundence regression is of variants with low proportions towards lower values.
Default: weighted_rlm
mutation-depth-threshold
Minimum sequencing per mutation for it to be used in the analysis.
Default: 100
control
The following settings control from which point the pipeline starts, which path it follows (i.e. which rules are executed on the way) and where it stops.
start
Start points for the analysis
- fastq(.gz): Raw reads in (gzipped) FASTQ format
- bam: Unfiltered alignments
- vcf: Variant calling files. Note: Ideally these should have been generated by lofreq. The minimum requirement is that the INFO fields “AF” and “DP” are present. (More details on here)
When using a non-default target, not all rules will be able to run and it won’t be possible to generate all reports. The main reasons for this are the missing quality control statistics that are necessary mainly for the final report, and the missing unalinged read files when starting after the alignment step.
Default: fastq.gz
targets
Desired results of the analysis. Multiple targets at the same time is possible.
- help: Print all rules and their descriptions.
- final_reports: Produce a comprehensive report. This is the default target.
- devonvolution: Run deconvolution for all provided samples and create a summary table containing abundances from all samples.
- lofreq: Call variants and produce .vcf file and overview .csv file.
- multiqc: Create MultiQC reports for including raw and trimmed reads.
Default: - final_reports
run-ivar-primer-trimming
Whether the primer trimming step should be performed.
Default: yes
execution
Settings directly related to the program execution itself.
jobs
The number of jobs allowed to be executed at one time. If executed locally this specifies the maximum number of cores used at a time. May not extend to the number of jobs used by individual tools. See section --cores
& --jobs
in the snakemake docs
Default: 6
submit-to-cluster
Whether cluster specific settings will be respected, and a qsub submission call should be executed.
Default: no
cluster
Settings specific to executing the pipeline on a computing cluster. None of these are relevant when submit-to-cluster
is “no”.
missing-file-timeout
How long before a rule output file is declared missing and the pipeline stopped.
When executing the pipeline on a cluster, file system latency can be higher than when the pipeline is executed on a single server. Therefore the time snakemake waits before declaring a rule output file as missing and stopping needs to be adjusted.
Default: 120
stack
Stack memory used for each rule. We recommend leaving it as it is, unless you really know what you are doing.
Default: 128M
queue
The name of a specific queue the whole pipeline should be submitted to.
Default: all
contact-email
How the cluster administration may contact you.
Default: none
args
Additional arguments passed to qsub
, as a single string.
Default: ’’
rules
Per rule submission settings. Give the rule name as a heading to configure that specific rule.
__default__
Default settings used in absence of rule specific settings.
threads
: Number of threads. Default is 1.memory
: RAM available for the rule, with unit (e.g. “90K”, “5M”). In the case of no unit, unit M is assumed. Default is 4G.
tools
Overwrites for locations of specific tools used. Each tool has its own subheading, i.e. “bwa”, and the following settings:
executable
Path to the executable file for the tool, defaults to the system installation.
arguments
Additional arguments to the tool as one string. Only use this if you know what you are doing, these may cause conflicts with the arguments supplied in each rule.
Running the pipeline
PiGx SARS-CoV-2 wastewater is executed using the command pigx-sars-cov-2 -s settings.yaml sample_sheet.csv
. See pigx-sars-cov-2 --help
for information about additional command line arguments.
The execution
section of the settings file provides some control over the execution of the pipeline.
Local / cluster execution
The workflow may be executed locally (on a single computer), or, if a Sun Grid Engine-compatible HPC environment is available, supports cluster execution. In order to enable cluster execution, specify submit-to-cluster: yes
in the settings file.
Parallel execution
If the workflow is run on a cluster, or a single computer with sufficient resources, some of the tasks may be computed in parallel. To specify the allowed level or parallelism, set the jobs
setting under execution
in the settings file. For instance,
execution:
submit-to-cluster: yes
jobs: 40
in the settings file, will submit up to 40 simultaneous compute jobs on the cluster.
Output description
PiGx SARS-CoV-2 wastewater creates an output directory, as specified in the settings file, that contains all of the following outputs.
Time series reports
This pipeline performs mutation analysis of SARS-CoV-2 and reports and quantifies the occurrence of variants of concern (VOC) and signature mutations by which they are characterised.
The visualizations in the time series report provide an overview of the evolution of VOCs and signature mutations found in the analyzed samples across given time points and locations. The abundance values for the variants are derived by deconvolution. The frequencies of the mutations are the output of LoFreq.
Quality control
A quality control report is generated for each sample. It includes reports on amplicon coverage and read coverage, as well as general quality control and preprocessing metrics.
General quality control metrics are computed using FastQC and MultiQC. The MultiQC report is particularly useful, collating quality control metrics from many steps of the pipeline in a single HTML report, which may be found under the multiqc
directory in the PiGx output folder.
Taxonomic classification
This report provides an overview of the species found in the provided wastewater samples apart from SARS-CoV-2. The SARS-CoV-2 enriched wastewater samples are aligned to the virus genome. It provides insight about possible biases and contamination of the samples. In case of abundance of species very similar to SARS-CoV-2 only the taxonomic family will be reported which could indicate that an identification/alignment of SARS-CoV-2 could be biased or impossible. In case of a high percentage of read matching SARS-CoV-2 a refining of trimming parameters should be considered.
Variant report
This report shows the variant analysis of SARS-CoV-2 from wastewater samples. Mutations are identified by single-nucleotide-variant (SNV) calling performed by LoFreq. Translated to amino acid mutations by using Ensemble VEP - COVID-19. The list of found mutations (including synonymous and non-synonymous mutations) were matched against lists of signature mutations characterising variants of concern (VOC) of SARS-CoV-2 provided by outbreak.info and CoVariant.org.
All outputs with their locations.
Given locations are relative to the output directory. SAMPLE
indicates a variable part of a path that will be replaced with a sample name. VIRUS
will be replaced with the virus being investigated.
- Overview reports:
- Summary and visualization of the development of SARS-CoV-2 variants and mutations over time and locations from all samples provided (
report/index.html
). - Quality Control reports per sample:
- Overall QC report with number of covered amplicons, read coverage, etc (
report/SAMPLE.qc_report_per_sample.html
). - MultiQC report per sample for raw and trimmed reads along with several supplementary files (
report/multiqc/SAMPLE/*
). - FASTQC reports per sample and read for raw reads, adapter trimmed reads, and aligned reads, along with an archive of supplementary files (
report/fastqc/SAMPLE/*
). - FASTP reports per sample on adapter trimming statistics (
fastp/SAMPLE/*
).
- Overall QC report with number of covered amplicons, read coverage, etc (
- Per sample variant report: variant analysis of SARS-CoV-2 from each wastewater sample and identification of variants of concern (
report/SAMPLE.qc_report_per_sample.html
). - Taxonomic classification of unaligned reads: Overview over taxa inferred from all the sequencing reads not aligning to the reference genome (
report/SAMPLE.taxonomic_classification.html
).
- Summary and visualization of the development of SARS-CoV-2 variants and mutations over time and locations from all samples provided (
- Alignments:
- Per sample aligned reads: Reads aligned against the reference genome, at various stages of trimming and sorting, along with their index files (
mapped_reads/SAMPLE_aligned.(bam|sam|bai)
). - Per sample unaligned reads: Reads not aligning to the reference genome (
mapped_reads/SAMPLE_unalingned.(bam|fastq)
).
- Per sample aligned reads: Reads aligned against the reference genome, at various stages of trimming and sorting, along with their index files (
- SNV files:
- Per sample files listing all detected single nucleotide variants (SNVs) from the aligned reads (also parsed to csv,
variants/SAMPLE_snv.(vcf|csv)
).
- Per sample files listing all detected single nucleotide variants (SNVs) from the aligned reads (also parsed to csv,
- VEP reports:
- Per sample report files constituting the VEP output including the variants uploaded to the VEP database, resulting amino acid changes and consequences for the corresponding protein.
- Raw VEP output (
variants/SAMPLE_vep_VIRUS.txt
). - Report with detailed run statistics (
variants/SAMPLE_vep_VIRUS.txt_summary.html
). - Warnings generated by vep during the run (
variants/SAMPLE_vep_VIRUS.txt_warnings.txt
) - File with most of the raw output parsed into a comma separated table for downstream processing (
variants/SAMPLE_vep_VIRUS_parsed.csv
).
- Raw VEP output (
- Per sample report files constituting the VEP output including the variants uploaded to the VEP database, resulting amino acid changes and consequences for the corresponding protein.
- Deconvoluted variant abundances:
- Per sample abundance of each of the VOCs as
- a regular table (
variants/SAMPLE_variant_abundance.csv
), - in a single row together with metadata (used later to construct summary,
variants/SAMPLE_variants_with_meta.csv
).
- a regular table (
- Overall summary table of per sample variant abundances with sample metadata (
variants/data_variant_plot.csv
).
- Per sample abundance of each of the VOCs as
- Mutation abundances:
- Per sample abundance of each signature mutation found and meeting the read depth threshold criterium as a single row together with metadata (used later to construct summary,
mutations/SAMPLE_mutations.csv
). - Overall summary table CSV file of per sample mutation abundances with sample metadata.
- Mutation data of non-signature mutations meeting the read depth threshold (
mutations/SAMPLE_non_sigmuts.csv
).
- Per sample abundance of each signature mutation found and meeting the read depth threshold criterium as a single row together with metadata (used later to construct summary,
- Sample quality:
- Per sample tables of read depth at each signature mutation locus, derived from the untrimmed alignment (
coverage/SAMPLE_genome_cov.tsv
). - Per sample tables giving coverage statistics of the untrimmed alignment (
coverage/SAMPLE_mut_cov.tsv
). - Per sample tables aggregating the statistics of genome and mutation coverage files in a sample quality table (single row for downstream use,
coverage/SAMPLE_quality.csv
). - Overall table concatenating the per sample quality rows into one table (
coverage/sample_quality_table.csv
).
- Per sample tables of read depth at each signature mutation locus, derived from the untrimmed alignment (
- Adapter-trimmed reads (
trimmed_reads/SAMPLE_trimmed.fastq.gz
). - Kraken2 raw output: provides an overview of all found species in the unaligned reads together with the NCBI taxonomy ID (
kraken/SAMPLE_classified_unaligned_reads.txt
). - Per sample Krona diagram of taxa proportions in unaligned reads (
report/SAMPLE.Krona_report.html
). - Mutation count summary table containing count statistics for mutations overall and per sample (
mutation_counts.csv
). - Raw table of per mutation linear model coefficients and their p-values, generated by the
mutation_regression.R
script (unfiltered_mutations_sig.csv
). - Sample summary table containing various statistics about each sample (
overview_QC.csv
). - Log files for all major analysis steps performed by the pipeline (
logs/*.log
).
Troubleshooting
If you have any questions please e-mail: pigx@googlegroups.com or use the web form to ask questions https://groups.google.com/forum/#!forum/pigx/.
If you run into any bugs, please open an issue here: https://github.com/BIMSBbioinfo/pigx_rnaseq/issues.