Bakta CLI
Contents
Installation
Bakta can be installed via BioConda, Docker, Singularity and Pip. However, we encourage to use Conda or Docker/Singularity to automatically install all required 3rd party dependencies.
In all cases a mandatory database must be downloaded.
BioConda
$ conda install -c conda-forge -c bioconda bakta
Docker
$ sudo docker pull oschwengers/bakta
$ sudo docker run oschwengers/bakta --help
Installation instructions and get-started guides: Docker docs
For further convenience, we provide a shell script (bakta-docker.sh
) handling Docker related parameters (volume mounting, user IDs, etc):
$ bakta-docker.sh --db <db-path> --output <output-path> <input>
Singularity
$ singularity build bakta.sif docker://oschwengers/bakta:latest
$ singularity run bakta.sif --help
Installation instructions, get-started and guides: Singularity docs
Pip
$ python3 -m pip install --user bakta
Bacta requires the following 3rd party executables which must be installed & executable:
tRNAscan-SE (2.0.8) https://doi.org/10.1101/614032 http://lowelab.ucsc.edu/tRNAscan-SE
Aragorn (1.2.38) http://dx.doi.org/10.1093/nar/gkh152 http://130.235.244.92/ARAGORN
INFERNAL (1.1.4) https://dx.doi.org/10.1093%2Fbioinformatics%2Fbtt509 http://eddylab.org/infernal
PILER-CR (1.06) https://doi.org/10.1186/1471-2105-8-18 http://www.drive5.com/pilercr
Prodigal (2.6.3) https://dx.doi.org/10.1186%2F1471-2105-11-119 https://github.com/hyattpd/Prodigal
Hmmer (3.3.2) https://doi.org/10.1093/nar/gkt263 http://hmmer.org
Diamond (2.0.14) https://doi.org/10.1038/nmeth.3176 https://github.com/bbuchfink/diamond
Blast+ (2.12.0) https://www.ncbi.nlm.nih.gov/pubmed/2231712 https://blast.ncbi.nlm.nih.gov
AMRFinderPlus (3.10.23) https://github.com/ncbi/amr
DeepSig (1.2.5) https://doi.org/10.1093/bioinformatics/btx818
Database download
Bakta requires a mandatory database which is publicly hosted at Zenodo: Further information is provided in the database section below.
List available DB versions:
$ bakta_db list
...
Download the most recent compatible database version we recommend to use the internal database download & setup tool:
$ bakta_db download --output <output-path>
Of course, the database can also be downloaded manually:
$ wget https://zenodo.org/record/5215743/files/db.tar.gz
$ tar -xzf db.tar.gz
$ rm db.tar.gz
$ amrfinder_update --force_update --database db/amrfinderplus-db/
In this case, please also download the AMRFinderPlus database as indicated above.
Update an existing database:
$ bakta_db update --db <existing-db-path> [--tmp-dir <tmp-directory>]
The database path can be provided either via parameter (--db
) or environment variable (BAKTA_DB
):
$ bakta --db <db-path> genome.fasta
$ export BAKTA_DB=<db-path>
$ bakta genome.fasta
For system-wide setups, the database can also be copied to the Bakta base directory:
$ cp -r db/ <bakta-installation-dir>
As Bakta takes advantage of AMRFinderPlus for the annotation of AMR genes, AMRFinder is required to setup its own internal databases in a <amrfinderplus-db>
subfolder within the Bakta database <db-path>
, once via amrfinder_update --force_update --database <db-path>/amrfinderplus-db/
. To ease this process we recommend to use Bakta’s internal download procedure.
Examples
Simple:
$ bakta --db <db-path> genome.fasta
Expert: verbose output writing results to results directory with ecoli123 file prefix
and eco634 locus tag
using an existing prodigal training file, using additional replicon information and 8 threads:
$ bakta --db <db-path> --verbose --output results/ --prefix ecoli123 --locus-tag eco634 --prodigal-tf eco.tf --replicons replicon.tsv --threads 8 genome.fasta
Input and Output
Input
Bakta accepts bacterial genomes and plasmids (complete / draft assemblies) in (zipped) fasta format. For a full description of how further genome information can be provided and workflow customizations can be set, please have a look at the Usage section.
Replicon meta data table:
To fine-tune the very details of each sequence in the input fasta file, Bakta accepts a replicon meta data table provided in csv
or tsv
file format: --replicons <file.tsv>
. Thus, complete replicons within partially completed draft assemblies can be marked & handled as such, e.g. detection & annotation of features spanning sequence edges.
Table format:
original sequence id |
new sequence id |
type |
topology |
name |
---|---|---|---|---|
|
[ |
[ |
[ |
[ |
For each input sequence recognized via the original locus id
a new locus id
, the replicon type
and the topology
as well a name
can be explicitly set.
Shortcuts:
chromosome
:c
plasmid
:p
circular
:c
linear
:l
<empty>
values (-
/ ``) will be replaced by defaults. If new locus id is empty
, a new contig name will be autogenerated.
Defaults:
type:
contig
topology:
linear
Example:
original locus id |
new locus id |
type |
topology |
name |
---|---|---|---|---|
NODE_1 |
chrom |
|
|
|
NODE_2 |
p1 |
|
|
|
NODE_3 |
p2 |
|
|
|
NODE_4 |
special-contig-name-xyz |
|
- |
|
NODE_5 |
`` |
|
- |
User provided protein sequences
Bakta accepts user provided trusted protein sequences via --proteins
in either GenBank (CDS features) or Fasta format. Using the Fasta format, each reference sequence can be provided in a short or long format:
# short:
>id gene~~~product~~~dbxrefs
MAQ...
# long:
>id min_identity~~~min_query_cov~~~min_subject_cov~~~gene~~~product~~~dbxrefs
MAQ...
Allowed values:
field |
value(s) |
example |
---|---|---|
min_identity |
|
80, 90.3 |
min_query_cov |
|
80, 90.3 |
min_subject_cov |
|
80, 90.3 |
gene |
|
msp |
product |
|
my special protein |
dbxrefs |
|
|
Protein sequences provided in short Fasta or GenBank format are searched with default thresholds of 90%, 80% and 80% for minimal identity, query and subject coverage, respectively.
Output
Annotation results are provided in standard bioinformatics file formats:
<prefix>.tsv
: annotations as simple human readble TSV<prefix>.gff3
: annotations & sequences in GFF3 format<prefix>.gbff
: annotations & sequences in (multi) GenBank format<prefix>.embl
: annotations & sequences in (multi) EMBL format<prefix>.fna
: replicon/contig DNA sequences as FASTA<prefix>.ffn
: feature nucleotide sequences as FASTA<prefix>.faa
: CDS/sORF amino acid sequences as FASTA<prefix>.hypotheticals.tsv
: further information on hypothetical protein CDS as simple human readble tab separated values<prefix>.hypotheticals.faa
: hypothetical protein CDS amino acid sequences as FASTA<prefix>.txt
: summary as TXT
The <prefix>
can be set via --prefix <prefix>
. If no prefix is set, Bakta uses the input file prefix.
Additionally, Bakta provides detailed information on each annotated feature in a standardized machine-readable JSON file <prefix>.json
:
{
"genome": {
"genus": "Escherichia",
"species": "coli",
...
},
"stats": {
"size": 5594605,
"gc": 0.497,
...
},
"features": [
{
"type": "cds",
"contig": "contig_1",
"start": 971,
"stop": 1351,
"strand": "-",
"gene": "lsoB",
"product": "type II toxin-antitoxin system antitoxin LsoB",
...
},
...
],
"sequences": [
{
"id": "c1",
"description": "[organism=Escherichia coli] [completeness=complete] [topology=circular]",
"sequence": "AGCTTT...",
"length": 5498578,
"complete": true,
"type": "chromosome",
"topology": "circular"
...
},
...
]
}
Exemplary annotation result files for several genomes (mostly ESKAPE species) are hosted at Zenodo:
Usage
Usage:
usage: bakta [--db DB] [--min-contig-length MIN_CONTIG_LENGTH] [--prefix PREFIX] [--output OUTPUT]
[--genus GENUS] [--species SPECIES] [--strain STRAIN] [--plasmid PLASMID]
[--complete] [--prodigal-tf PRODIGAL_TF] [--translation-table {11,4}] [--gram {+,-,?}] [--locus LOCUS]
[--locus-tag LOCUS_TAG] [--keep-contig-headers] [--replicons REPLICONS] [--compliant] [--proteins PROTEINS]
[--skip-trna] [--skip-tmrna] [--skip-rrna] [--skip-ncrna] [--skip-ncrna-region]
[--skip-crispr] [--skip-cds] [--skip-sorf] [--skip-gap] [--skip-ori]
[--help] [--verbose] [--threads THREADS] [--tmp-dir TMP_DIR] [--version]
<genome>
Rapid & standardized annotation of bacterial genomes, MAGs & plasmids
positional arguments:
<genome> Genome sequences in (zipped) fasta format
Input / Output:
--db DB, -d DB Database path (default = <bakta_path>/db). Can also be provided as BAKTA_DB environment variable.
--min-contig-length MIN_CONTIG_LENGTH, -m MIN_CONTIG_LENGTH
Minimum contig size (default = 1; 200 in compliant mode)
--prefix PREFIX, -p PREFIX
Prefix for output files
--output OUTPUT, -o OUTPUT
Output directory (default = current working directory)
Organism:
--genus GENUS Genus name
--species SPECIES Species name
--strain STRAIN Strain name
--plasmid PLASMID Plasmid name
Annotation:
--complete All sequences are complete replicons (chromosome/plasmid[s])
--prodigal-tf PRODIGAL_TF
Path to existing Prodigal training file to use for CDS prediction
--translation-table {11,4}
Translation table: 11/4 (default = 11)
--gram {+,-,?} Gram type for signal peptide predictions: +/-/? (default = '?')
--locus LOCUS Locus prefix (default = 'contig')
--locus-tag LOCUS_TAG
Locus tag prefix (default = autogenerated)
--keep-contig-headers
Keep original contig headers
--replicons REPLICONS, -r REPLICONS
Replicon information table (tsv/csv)
--compliant Force Genbank/ENA/DDJB compliance
--proteins PROTEINS Fasta file of trusted protein sequences for CDS annotation
Workflow:
--skip-trna Skip tRNA detection & annotation
--skip-tmrna Skip tmRNA detection & annotation
--skip-rrna Skip rRNA detection & annotation
--skip-ncrna Skip ncRNA detection & annotation
--skip-ncrna-region Skip ncRNA region detection & annotation
--skip-crispr Skip CRISPR array detection & annotation
--skip-cds Skip CDS detection & annotation
--skip-pseudo Skip pseudogene detection & annotation
--skip-sorf Skip sORF detection & annotation
--skip-gap Skip gap detection & annotation
--skip-ori Skip oriC/oriT detection & annotation
General:
--help, -h Show this help message and exit
--verbose, -v Print verbose information
--threads THREADS, -t THREADS
Number of threads to use (default = number of available CPUs)
--tmp-dir TMP_DIR Location for temporary files (default = system dependent auto detection)
--version show program's version number and exit
Annotation Workflow
RNAs
tRNA genes: tRNAscan-SE 2.0
tmRNA genes: Aragorn
rRNA genes: Infernal vs. Rfam rRNA covariance models
ncRNA genes: Infernal vs. Rfam ncRNA covariance models
ncRNA cis-regulatory regions: Infernal vs. Rfam ncRNA covariance models
CRISPR arrays: PILER-CR
Bakta distinguishes ncRNA genes and (cis-regulatory) regions in order to enable the distinct handling thereof during the annotation process, i.e. feature overlap detection.
ncRNA gene types:
sRNA
antisense
ribozyme
antitoxin
ncRNA (cis-regulatory) region types:
riboswitch
thermoregulator
leader
frameshift element
Coding sequences
The structural prediction is conducted via Prodigal and complemented by a custom detection of sORF < 30 aa.
To rapidly identify known protein sequences with exact sequence matches and to conduct a comprehensive annotations, Bakta utilizes a compact read-only SQLite database comprising protein sequence digests and pre-assigned annotations for millions of known protein sequences and clusters.
Conceptual terms:
UPS: unique protein sequences identified via length and MD5 hash digests (100% coverage & 100% sequence identity)
IPS: identical protein sequences comprising seeds of UniProt’s UniRef100 protein sequence clusters
PSC: protein sequences clusters comprising seeds of UniProt’s UniRef90 protein sequence clusters
PSCC: protein sequences clusters of clusters comprising annotations of UniProt’s UniRef50 protein sequence clusters
CDS:
Prediction via Prodigal respecting sequences’ completeness (distinct prediction for complete replicons and uncompleted contigs)
Discard spurious CDS via AntiFam
Detect translational exceptions (selenocysteines)
Detection of UPSs via MD5 digests and lookup of related IPS and PCS
Sequence alignments of remainder via Diamond vs. PSC (query/subject coverage=0.8, identity=0.5)
Assignment to UniRef90 or UniRef50 clusters if alignment hits achieve identities larger than 0.9 or 0.5, respectively
Execution of expert systems:
AMR: AMRFinderPlus
Expert proteins: NCBI BlastRules, VFDB
User proteins (optionally via
--proteins <Fasta/GenBank>
)
Prediction of signal peptides (optionally via
--gram <+/->
)Combination of IPS, PSC, PSCC and expert system information favouring more specific annotations and avoiding redundancy
CDS without IPS or PSC hits as well as those without gene symbols or product descriptions different from hypothetical
will be marked as hypothetical
.
Such hypothetical CDS are further analyzed:
Detection of Pfam domains, repeats & motifs
Calculation of protein sequence statistics, i.e. molecular weight, isoelectric point
sORFs:
Custom sORF detection & extraction with amino acid lengths < 30 aa
Apply strict feature type-dependent overlap filters
discard spurious sORF via AntiFam
Detection of UPS via MD5 hashes and lookup of related IPS
Sequence alignments of remainder via Diamond vs. an sORF subset of PSCs (coverage=0.9, identity=0.9)
Exclude sORF without sufficient annotation information
Prediction of signal peptides (optionally via
--gram <+/->
)
sORF not identified via IPS or PSC will be discarded. Additionally, all sORF without gene symbols or product descriptions different from hypothetical
will be discarded.
Due due to uncertain nature of sORF prediction, only those identified via IPS / PSC hits exhibiting proper gene symbols or product descriptions different from hypothetical
will be included in the final annotation.
Miscellaneous
Gaps: in-mem detection & annotation of sequence gaps
oriC/oriV/oriT: Blast+ (cov=0.8, id=0.8) vs. MOB-suite oriT & DoriC oriC/oriV sequences. Annotations of ori regions take into account overlapping Blast+ hits and are conducted based on a majority vote heuristic. Region edges are fuzzy - use with caution!
Database
The Bakta database comprises a set of AA & DNA sequence databases as well as HMM & covariance models. At its core Bakta utilizes a compact read-only SQLite db storing protein sequence digests, lengths, pre-assigned annotations and dbxrefs of UPS, IPS and PSC from:
UPS: UniParc / UniProtKB (213,429,161)
IPS: UniProt UniRef100 (198,494,206)
PSC: UniProt UniRef90 (91,455,850)
PSCC: UniProt UniRef50 (12,323,024)
This allows the exact protein sequences identification via MD5 digests & sequence lengths as well as the rapid subsequent lookup of related information. Protein sequence digests are checked for hash collisions while the db creation process. IPS & PSC have been comprehensively pre-annotated integrating annotations & database dbxrefs from:
NCBI nonredundant proteins (IPS: 153,166,049)
NCBI COG db (PSC: 3,383,871)
SwissProt EC/GO terms (PSC: 335,068)
NCBI AMRFinderPlus (IPS: 6,534, PSC: 48,931)
ISFinder db (IPS: 45,820, PSC: 10,351)
Pfam families (PSC: 3,917,555)
To provide high quality annotations for distinct protein sequences of high importance (AMR, VF, etc) which cannot sufficiently be covered by the IPS/PSC approach, Bakta provides additional expert systems. For instance, AMR genes, are annotated via NCBI’s AMRFinderPlus. An expandable alignment-based expert system supports the incorporation of high quality annotations from multiple sources. This currenlty comprises NCBI’s BlastRules as well as VFDB and will be complemented with more expert annotation sources over time. Internally, this expert system is based on a Diamond DB comprising the following information in a standardized format:
source: e.g. BlastRules
rank: a precedence rank
min identity
min query coverage
min model coverage
gene lable
product description
dbxrefs
Rfam covariance models:
ncRNA: 798
ncRNA cis-regulatory regions: 270
ori sequences:
oriC/V: 10,878
oriT: 502
To provide FAIR annotations, the database releases are SemVer versioned (w/o patch level), i.e. <major>.<minor>
. For each version we provide a comprehensive log file tracking all imported sequences as well as annotations thereof. The db schema is represented by the <major>
digit and automatically checked at runtime by Bakta in order to ensure compatibility. Content updates are tracked by the <minor>
digit.
All database releases (latest 3.1, 28 Gb zipped, 53 Gb unzipped) are hosted at Zenodo:
Genome Submission
Most genomes annotated with Bakta should be ready-to-submid to INSDC member databases GenBank and ENA. As a first step, please register your BioProject (e.g. PRJNA123456) and your locus_tag prefix (e.g. ESAKAI).
# annotate your genome in `--compliant` mode:
$ bakta --db <db-path> -v --genus Escherichia --species "coli O157:H7" --strain Sakai --complete --compliant --locus-tag ESAKAI test/data/GCF_000008865.2.fna.gz
GenBank
Genomes are submitted to GenBank via Fasta (.fna
) and SQN files. Therefore, .sqn
files can be created via .gff3
files and NCBI’s new table2asn_GFF tool.
Please have all additional files (template.txt) prepared:
# download table2asn_GFF for Linux
$ wget https://ftp.ncbi.nih.gov/toolbox/ncbi_tools/converters/by_program/table2asn_GFF/linux64.table2asn_GFF.gz
$ gunzip linux64.table2asn_GFF.gz
# or MacOS
$ https://ftp.ncbi.nih.gov/toolbox/ncbi_tools/converters/by_program/table2asn_GFF/mac.table2asn_GFF.gz
$ gunzip mac.table2asn_GFF.gz
$ chmod 755 linux64.table2asn_GFF.gz mac.table2asn_GFF.gz
# create the SQN file:
$ linux64.table2asn_GFF -M n -J -c w -t template.txt -V vbt -l paired-ends -i GCF_000008865.2.fna -f GCF_000008865.2.gff3 -o GCF_000008865.2.sqn -Z
ENA
Genomes are submitted to ENA as EMBL (.embl
) files via EBI’s Webin-CLI tool.
Please have all additional files (manifest.tsv, chrom-list.tsv) prepared as described here.
# download ENA Webin-CLI
$ wget https://github.com/enasequence/webin-cli/releases/download/v4.0.0/webin-cli-4.0.0.jar
$ gzip -k GCF_000008865.2.embl
$ gzip -k chrom-list.tsv
$ java -jar webin-cli-4.0.0.jar -submit -userName=<EMAIL> -password <PWD> -context genome -manifest manifest.tsv
Exemplarey manifest.tsv and chrom-list.tsv files might look like:
$ cat chrom-list.tsv
STUDY PRJEB44484
SAMPLE ERS6291240
ASSEMBLYNAME GCF
ASSEMBLY_TYPE isolate
COVERAGE 100
PROGRAM SPAdes
PLATFORM Illumina
MOLECULETYPE genomic DNA
FLATFILE GCF_000008865.2.embl.gz
CHROMOSOME_LIST chrom-list.tsv.gz
$ cat chrom-list.tsv
contig_1 contig_1 circular-chromosome
contig_2 contig_2 circular-plasmid
contig_3 contig_3 circular-plasmid
Protein bulk annotation
For the direct bulk annotation of protein sequences aside from the genome, Bakta provides a dedicated CLI entry point bakta_proteins
:
Examples:
$ bakta_proteins --db <db-path> input.fasta
$ bakta_proteins --db <db-path> --prefix test --output test --proteins special.faa --threads 8 input.fasta
Output
Annotation results are provided in standard bioinformatics file formats:
<prefix>.tsv
: annotations as simple human readble TSV<prefix>.faa
: protein sequences as FASTA<prefix>.hypotheticals.tsv
: further information on hypothetical proteins as simple human readble tab separated values
The <prefix>
can be set via --prefix <prefix>
. If no prefix is set, Bakta uses the input file prefix.
Usage
usage: bakta_proteins [--db DB] [--output OUTPUT] [--prefix PREFIX] [--proteins PROTEINS] [--help] [--threads THREADS] [--tmp-dir TMP_DIR] [--version] <input>
Rapid & standardized annotation of bacterial genomes, MAGs & plasmids
positional arguments:
<input> Protein sequences in (zipped) fasta format
Input / Output:
--db DB, -d DB Database path (default = <bakta_path>/db). Can also be provided as BAKTA_DB environment variable.
--output OUTPUT, -o OUTPUT
Output directory (default = current working directory)
--prefix PREFIX, -p PREFIX
Prefix for output files
Annotation:
--proteins PROTEINS Fasta file of trusted protein sequences for annotation
Runtime & auxiliary options:
--help, -h Show this help message and exit
--threads THREADS, -t THREADS
Number of threads to use (default = number of available CPUs)
--tmp-dir TMP_DIR Location for temporary files (default = system dependent auto detection)
--version, -V show program's version number and exit
FAQ
AMRFinder fails If AMRFinder constantly crashes even on fresh setups and Bakta’s database was downloaded manually, then AMRFinder needs to setup its own internal database. This is required only once:
amrfinder_update --force_update --database <bakta-db>/amrfinderplus-db
. You could also try Bakta’s internal database download logic automatically taking care of this:bakta_db download --output <bakta-db>
DeepSig not found in Conda environment For the prediction of signal predictions, Bakta uses DeepSig that is currently not available for MacOS. Therefore, we decided to exclude DeepSig from Bakta’s default Conda dependencies because otherwise it would not be installable on MacOS systems. On Linux systems it can be installed via
conda install -c conda-forge -c bioconda python=3.8 deepsig
.Nice, but I’m mising XYZ… Bakta is quite new and we’re keen to constantly improve it and further expand its feature set. In case there’s anything missing, please do not hesitate to open an issue and ask for it!
Bakta is running too long without CPU load… why? Bakta takes advantage of an SQLite DB which results in high storage IO loads. If this DB is stored on a remote / network volume, the lookup of IPS/PSC annotations might take a long time. In these cases, please, consider moving the DB to a local volume or hard drive.
Issues and Feature Requests
Bakta is brand new and like in every software, expect some bugs lurking around. So, if you run into any issues with Bakta, we’d be happy to hear about it.
Therefore, please, execute bakta in verbose mode (-v
) and do not hesitate to file an issue including as much information as possible:
a detailed description of the issue
command line output
log file (
<prefix>.log
)result file (
<prefix>.json
) if possiblea reproducible example of the issue with an input file that you can share if possible