Input parameters
AsaruSim simulation requires an input parameter to work, describing the macro characteristics of the desired synthetic reads.
Count matrix
--matrix
The --matrix
parameter is a feature-by-cell (gene/cell or isoform/cell) count table (.CSV) file
where rows represent the features of interest (genes or transcripts) and columns represent cells (or spatial barcodes). The input matrix may be derived from an existing single-cell short- or long-read preprocessed run. this parameter is required.
AsaruSim require a feature name and GTF annotation for gene-per-cell matrix
Since the sequence names in the reference transcriptome correspond to transcript ids, it is necessary for users to specify the feature name within their matrix when gene-per-cell matrix is provided. Set the feature name using the --features
parameter.
Available options include:
transcript_id
(default)gene_id
gene_name
Additionally, users are required to supply a gene annotation (.gtf format)
file using the --gtf
parameter.
--bc_counts
To simulate specific UMI counts per cell barcode, set the --bc_counts
parameter to the path of a UMI counts .CSV file
. This parameter eliminates the need for an input matrix, enabling the simulation of UMI counts where transcripts are chosed randomly.
CB | counts |
---|---|
ACGGCGATCGCGAGCC | 1260 |
ACGGCGATCGCGAGCC | 1104 |
--cell_types_annotation
AsaruSim can generate synthetic count tables with varying cell types, differentially expressed genes, or isoforms. This capability is particularly useful for simulating count matrices that mimic the characteristics of existing cell populations.
To simulate cell groups, AsaruSim requirs suplimentary parameters describing the characteristic of desired cell groups.
Requirements for simulating cell groups :
AsaruSim use SPARSim R package to simulate synthetic count table. for each cell group to simulate, SPARSim needs 3 information as input:
- expression level intensities.
- expression level variabilities.
- cell group library sizes.
fore more information see SPARSim vignettes.
AsaruSim allows user to estimate this characteristic from an existing count table. To do so, the user need to set --sim_celltypes
parameter to true
and to provide the list of cell barcodes of each group (.CSV file)
using --cell_types_annotation
parameter:
CB | cell_type |
---|---|
ACGGCGATCGCGAGCC | type 1 |
ACGGCGATCGCGAGCC | type 2 |
AsaruSim will then use the provided matrix to estimate characteristic of each cell groups and generate a synthetic count matrix.
Template
AsaruSim generates reads that correspond to a 10X Genomics library construction coupled with Nanopore sequencing. The final construction corresponds to : an adaptor sequence composed of 10X and Nanopore adaptors, a cellular barcode (CB), UMI sequences at the same frequencies as in the synthetic count matrix, a 20 bp oligo(dT), the feature-corresponding cDNA sequence from the reference transcriptome and a template switch oligo (TSO) at the end.
--dT_LENGTH
Specifies the length of the oligo(dT) tail sequence. Default: 20
bp.
--ADAPTER_SEQ
Defines the sequence of the adapter used in the ONT and 10X Genomics libraries. By default, the 10X 3' solution V3 adapter sequence is used
Default: ACTAAAGGCCATTACGGCCTACACGACGCTCTTCCGATCT
.
--TSO_SEQ
Specifies the sequence of the Template Switching Oligo (TSO) nucleotid. Default: TGTACTCTGCGTTGATACCACTGCTT
.
Reference transcriptome
The feature-corresponding cDNA sequence is sampled from the reference transcriptome.
--transcriptome
A reference transcriptome file in .fasta
format can be downloaded from Ensembl.
--length_dist
To mimic the real read length distribution when a gene expression matrix is provided, a realistic read length distribution is achieved by selecting a random cDNA of the corresponding gene, with a prior probability favoring short-length cDNA.
AsaruSim estimates this read length using a log-normal distribution. Users may provide their parameters to personalize the distribution using three comma-delimited values (shape, location, scale) with the parameter --length_dist
.
(default :0.37,0.0,825
)
Shape (σ)
: The standard deviation of the log values.Location (μ)
: The location parameter using the basic form of the log-normal distribution.Scale
: The scale factor (the median of your distribution).
Fit read distribution of real reads
Users may also fit their real reads distribution with this approach by providing a subset of real reads (in .FASTQ
format) using the --model_fastq
parameter. (See also build model in the Error model section)
PCR amplification
AsaruSim take into account the bias of PCR amplification introduced during library constructions process. The PCR amplification is simulated by replicating the synthetic reads at each cycle, with a capturing probabily and un error rate.
--PCR_cycles
The number of PCR cycles to simulate. During each cycle, the reads are duplicated exponentially, following the formula: $$\ N = N_0 \times (1 + E)^{C} \ $$
where: N is the final number of reads, N0 is the initial number of reads, E is the efficiency rate, and C is the number of cycles.
--PCR_efficiency
The efficiency rate of duplication is fixed by the user (default: --PCR_efficiency 0.9
)
--PCR_error_rate
The probability to be mutated during the process for each nucleotide in the duplicated read. The error rate is also fixed by the user (default: --PCR_error_rate 3.5e-05
)
--total_reads
Number of total reads to random subset from the resulting artificial PCR product, to mimic the experimental protocol where only a subset of the sample is used for the sequencing step.
Users can use amplification rate instead of PCR amplification
Inspired by SLSim, the amplification rate allows users to repeat each template read a specified number of times. This is a simpler way to simulate amplification with: $$\ x \sim \text{Poi}(\text{amp_rate}) \ $$ The value of x is set by the user using --amp_rate
Error model
AsaruSim uses the Badread Python library to simulate nanopore sequencing errors and assign per-base quality scores based on pre-trained error models (see Badread documentation for more information). To do so, AsaruSim requires:
--trained_model
This allows the user to choose one of the built-in error models within the Badread database. The possible values are:
nanopore2023
: a model trained on ONT R10.4.1 reads from 2023 (the default).
nanopore2020
: a model trained on ONT R9.4.1 reads from 2020.
nanopore2018
: a model trained on ONT R9.4/R9.4.1 reads from 2018.
random
: a random error model with a 1/3 chance each of insertion, deletion, and substitution. a file path for a trained model.
--badread_identity
Badread uses the Beta distribution to sample read identities. The distribution is defined with three parameters: mean
, standard deviation
, and maximum
value.
To pass these parameters to AsaruSim, use three comma-delimited values (identity mean, max, stdev). default : --badread_identity 95,99,2.5
.
--build_model
To internally train a personalized read identity, Qscore, and error models, AsaruSim requires a real FASTQ read file that can be provided using --model_fastq
and a reference genome (.FASTA) file using --ref_genome
.
AsaruSim also accepts pre-built model files
Users can use --error_model
and --qscore_model
to provide Badread pre-built models in file format.