Input and Output
Input
Bakta accepts bacterial genomes and plasmids (complete / draft assemblies) in (zipped) fasta format. For a full description of how further genome information can be provided and workflow customizations can be set, please have a look at the Usage section or this manual.
Replicon meta data table
To fine-tune the very details of each sequence in the input fasta file, Bakta accepts a replicon meta data table provided in csv or tsv file format: --replicons <file.tsv>. Thus, complete replicons within partially completed draft assemblies can be marked & handled as such, e.g. detection & annotation of features spanning sequence edges.
Table format:
| original sequence id | new sequence id | type | topology | name |
|---|---|---|---|---|
old id | new id, <empty> | chromosome, plasmid, contig, <empty> | circular, linear, <empty> | name, <empty> |
For each input sequence recognized via the original locus id a new locus id, the replicon type and the topology as well a name can be explicitly set.
Shortcuts:
chromosome:cplasmid:pcircular:clinear:l
<empty> values (- / ``) will be replaced by defaults. If new locus id is empty, a new contig name will be autogenerated.
Defaults:
- type:
contig - topology:
linear
Example:
| original locus id | new locus id | type | topology | name |
|---|---|---|---|---|
| NODE_1 | chrom | chromosome | circular | - |
| NODE_2 | p1 | plasmid | c | pXYZ1 |
| NODE_3 | p2 | p | c | pXYZ2 |
| NODE_4 | special-contig-name-xyz | - | - | - |
| NODE_5 | `` | - | - | - |
User-provided regions
Bakta accepts pre-annotated (a priori), user-provided feature regions via --regions in either GFF3 or GenBank format. These regions supersede all de novo-predicted regions, but are equally subject to the internal functional annotation process. Currently, only CDS are supported. A maximum overlap with de novo-predicted CDS of 30 bp is allowed. If you would like to provide custom functional annotations, you can provide these via --proteins which is described in the following section.
User-provided protein sequences
Bakta accepts user-provided trusted protein sequences via --proteins in either GenBank (CDS features) or Fasta format which are used in the functional annotation process. Using the Fasta format, each reference sequence can be provided in a short or long format:
# short:
>id gene~~~product~~~dbxrefs
MAQ...
# long:
>id min_identity~~~min_query_cov~~~min_subject_cov~~~gene~~~product~~~dbxrefs
MAQ...
Allowed values:
| field | value(s) | example |
|---|---|---|
| min_identity | int, float | 80, 90.3 |
| min_query_cov | int, float | 80, 90.3 |
| min_subject_cov | int, float | 80, 90.3 |
| gene | <empty>, string | msp |
| product | string | my special protein |
| dbxrefs | <empty>, db:id, , separated list | VFDB:VF0511 |
Protein sequences provided in short Fasta or GenBank format are searched with default thresholds of 90%, 80% and 80% for minimal identity, query and subject coverage, respectively.
User-provided HMMs
Bakta accepts user-provided trusted HMMs via --hmms in HMMER's text format. If set, Bakta will adhere to the trusted cutoff specified in the HMM header. In addition, a max. evalue threshold of 1e-6 is applied. By default, Bakta uses the HMM description line as a product description. Further information can be provided via the HMM description line using the short format as explained above in the User-provided protein sequences section.
# default
HMMER3/f [3.1b2 | February 2015]
NAME id
ACC id
DESC product
LENG 435
TC 600 600
# short
NAME id
ACC id
DESC gene~~~product~~~dbxrefs
LENG 435
TC 600 600
Output
Annotation results are provided in standard bioinformatics file formats:
<prefix>.tsv: annotations as simple human readble TSV<prefix>.gff3: annotations & sequences in GFF3 format<prefix>.gbff: annotations & sequences in (multi) GenBank format<prefix>.embl: annotations & sequences in (multi) EMBL format<prefix>.fna: replicon/contig DNA sequences as FASTA<prefix>.ffn: feature nucleotide sequences as FASTA<prefix>.faa: CDS/sORF amino acid sequences as FASTA<prefix>.inference.tsv: inference metrics (score, evalue, coverage, identity) for annotated accessions as TSV<prefix>.hypotheticals.tsv: further information on hypothetical protein CDS as simple human readble tab separated values<prefix>.hypotheticals.faa: hypothetical protein CDS amino acid sequences as FASTA<prefix>.txt: summary as TXT<prefix>.png: circular genome annotation plot as PNG<prefix>.svg: circular genome annotation plot as SVG<prefix>.json: all (internal) annotation & sequence information as JSON
The <prefix> can be set via --prefix <prefix>. If no prefix is set, Bakta uses the input file prefix.
Of note, Bakta provides all detailed (internal) information on each annotated feature in a standardized machine-readable JSON file <prefix>.json:
{
"genome": {
"genus": "Escherichia",
"species": "coli",
...
},
"stats": {
"size": 5594605,
"gc": 0.497,
...
},
"features": [
{
"type": "cds",
"contig": "contig_1",
"start": 971,
"stop": 1351,
"strand": "-",
"gene": "lsoB",
"product": "type II toxin-antitoxin system antitoxin LsoB",
...
},
...
],
"sequences": [
{
"id": "c1",
"description": "[organism=Escherichia coli] [completeness=complete] [topology=circular]",
"nt": "AGCTTT...",
"length": 5498578,
"complete": true,
"type": "chromosome",
"topology": "circular"
...
},
...
]
}
Bakta provides a helper function to create above mentioned output files from the (GNU-zipped) JSON result file, thus helping potential long-term or large-scale annotation projects to reduce overall storage requirements.
bakta_io --output <output-path> --prefix <prefix> result.json.gz
bakta_io --help
Exemplary annotation result files for several genomes (mostly ESKAPE species) are hosted at Zenodo: