License: GPL v3 PyPI - Python Version GitHub release PyPI PyPI - Status Conda DOI

Welcome to Bakta’s documentation!

Bakta is a tool for the rapid & standardized annotation of bacterial genomes & plasmids. It provides dbxref-rich and sORF-including annotations in machine-readable JSON & bioinformatics standard file formats for automatic downstream analysis.

This documentations provides information how to install and use Bakta via CLI, web and REST-API.

Features

  • Bacteria & plasmids only Bakta was designed to annotate bacteria and plasmids, only. This decision by design has been made in order to tweak the annotation process regarding tools, preferences & databases and to streamline further development & maintenance of the software.

  • FAIR annotations To provide standardized annotations adhearing to FAIR principles, Bakta utilizes a comprehensive & versioned custom annotation database based on UniProt’s UniRef100 & UniRef90 protein clusters (FAIR -> DOI/DOI) enriched with dbxrefs (GO, COG, EC) and annotated by specialized niche databases. For each db version we provide a comprehensive log file of all imported sequences and annotations.

  • Protein sequence identification Fostering the FAIR aspect, Bakta identifies identical protein sequences (IPS) via MD5 hash digests which are annotated with database cross-references (dbxref) to RefSeq (WP_*), UniRef100 (UniRef100_*) and UniParc (UPI*). By doing so, IPS allow the surveillance of distinct gene alleles and streamlining comparative analysis as well as posterior (external) annotations of putative & hypothetical protein sequences which can be mapped back to existing CDS via these exact & stable identifiers (E. coli gene ymiA …more). Currently, Bakta identifies ~198 mio, ~185 mio and ~150 mio distinct protein sequences from UniParc, UniRef100 and RefSeq, respectively. Hence, for certain genomes, up to 99 % of all CDS can be identified this way, skipping computationally expensive sequence alignments.

  • Small proteins / short open reading frames Bakta detects and annotates small proteins/short open reading frames (sORF) which are not predicted by tools like Prodigal.

  • Fast Bakta can annotate a typical bacterial genome in 10 ±5 min on a laptop, plasmids in a couple of seconds/minutes.

  • Expert annotation systems To provide high quality annotations for certain proteins of higher interest, e.g. AMR & VF genes, Bakta includes & merges different expert annotation systems. Currently, Bakta uses NCBI’s AMRFinderPlus for AMR gene annotations as well as a generalized protein sequence expert system with distinct coverage, identity and priority values for each sequence, currenlty comprising the VFDB as well as NCBI’s BlastRules.

  • Comprehensive workflow Bakta annotates ncRNA cis-regulatory regions, oriC/oriV/oriT and assembly gaps as well as standard feature types: tRNA, tmRNA, rRNA, ncRNA genes, CRISPR, CDS.

  • GFF3 & INSDC conform annotations Bakta writes GFF3 and INSDC-compliant (Genbank & EMBL) annotation files ready for submission (checked via GenomeTools GFF3Validator and ENA Webin-CLI for GFF3 and EMBL file formats, respectively for representative genomes of all ESKAPE species).

  • Reasoning By annotating bacterial genomes in a standardized, taxon-independent, high-throughput and local manner, Bakta aims at a well-balanced tradeoff between fully-featured but computationally demanding pipelines like PGAP and rapid highly-customizable offline tools like Prokka. Indeed, Bakta is heavily inspired by Prokka (kudos to Torsten Seemann) and many command line options are compatible for the sake of interoperability and user convenience. Hence, if Bakta does not fit your needs, please try Prokka.

Citation

Schwengers O., Jelonek L., Dieckmann M. A., Beyvers S., Blom J., Goesmann A. (2021). Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification. Microbial Genomics, 7(11). https://doi.org/10.1099/mgen.0.000685

Bakta is standing on the shoulder of giants taking advantage of many publicly available databases. If you find any of those used within Bakta useful, please credit these primary sources, as well: