nf-core/createtaxdb
Edit

Parallelised and automated construction of metagenomic classifier databases of different tools

databasedatabase-buildermetagenomic-profilingmetagenomicsprofilingtaxonomic-profiling

This is the development version of the pipeline.

Launch development version https://github.com/nf-core/createtaxdb

Tutorials

Convert NCBI assembly_summary file to nf-core/createtaxdb samplesheet

A common source of reference genomes to build taxonomic classification databases is the NCBI suite of databases.

Conveniently, NCBI provides ‘assembly summary’ tables for different taxonomic groups that contain all the information that is needed for a nf-core/createtaxdb samplesheet. Using this file as a source of reference FASTAs can provide two primary benefits:

The genomes will be automatically compatible with NCBI taxonomy files
They provide URLs that can be directly used by Nextflow to download the genome FASTA files for you

The goals of this tutorial are:

Use standard terminal commands to convert an NCBI assembly_summary.txt file to a nf-core/createtaxdb compatible samplesheet
Build DNA-based Kraken2 and an Amino Acid-based Kaiju databases with the pipeline using the generated samplesheet

Info

This tutorial is tested with NCBI assembly_summary files from January 2026 using nf-core/createtaxdb v2.0.0.

You may need to modify commands if NCBI changes the format of these files in the future.

Prerequisites

Internet connection
A Unix terminal (Linux or macOS)
Software installed:
1. curl (tested version: 8.5.0)
2. awk (tested version: mawk 1.3.4 20240123)
3. sed (tested version: GNU sed 4.9)
4. nextflow (tested version: 25.10.2)
5. A Nextflow compatible environment system (for example conda, singularity, docker) (tested version: docker 27.2.1, build 9e34c9b)

Download, filter, and convert the assembly_summary file

Download the assembly_summary file for your taxonomic group of interest.

As an example, we will use the Genome RefSeq database’s fungi assembly summary:
```
curl -O https://ftp.ncbi.nlm.nih.gov/genomes/refseq/fungi/assembly_summary.txt
```
Optionally filter the assembly_summary file to only include certain genomes of interest.

For example, you might want to only include assemblies built to a “Complete” or “Chromosome” level. You could do this with command line tools, or in a spreadsheet program.

Here is an example with awk to filter for:
- Only “Complete Genome”-level assemblies
- First three genomes
```
awk -F '\t' 'NF>2; $12 == "Complete Genome" {print}' assembly_summary.txt | head -n 4 > assembly_summary_filtered.txt
```

Simplify the assembly_summary file to only include the columns we need for the nf-core/createtaxdb samplesheet, namely, # assembly_accession, taxid, ftp_path. Additionally, replace the first line to have the expected nf-core/createtaxdb samplesheet headers: id, taxid, fasta_dna, fasta_aa.

cut -f 1,7,20 assembly_summary_filtered.txt | sed 's/#assembly_accession.*/id\ttaxid\tfasta_dna\tfasta_aa/' > assembly_summary_simplified.txt

This results in:

id	taxid	fasta_dna	fasta_aa
GCF_041956525.1	4840	https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/041/956/525/GCF_041956525.1_Rhipu1
GCF_000002945.2	4896	https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/945/GCF_000002945.2_ASM294v3
GCF_003054445.1	4909	https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/054/445/GCF_003054445.1_ASM305444v1

Reconstruct the complete URLs of the relevant FASTA files to make them downloadable.

awk 'BEGIN { FS="\t"; OFS="," }
NR>1 {
base=$3; sub(".*/","",base)
$4=$3 "/" base "_protein.faa.gz"
$3=$3 "/" base "_genomic.fna.gz"
}
{ print $1,$2,$3,$4 }' assembly_summary_simplified.txt > samplesheet.csv

Explanation of `awk` command

This awk command works as follows:

Specify tab as the delimiter
Print the header line
Extract the base URL column and in the variable n, split the elements on / into an array called p
Construct a new protein FASTA URL column based on base URL, but append the last element of the array p (called by the length of n) plus _protein.faa.gz
Replace the existing base URL column with a new DNA FASTA URL constructed in the same way as the protein FASTA URL, but instead append _genomic.fna.gz
Print the four columns, separated by commas to create the expected nf-core/createtaxdb CSV file

This results in:

id,taxid,fasta_dna,fasta_aa
GCF_041956525.1,4840,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/041/956/525/GCF_041956525.1_Rhipu1/GCF_041956525.1_Rhipu1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/041/956/525/GCF_041956525.1_Rhipu1/GCF_041956525.1_Rhipu1_protein.faa.gz
GCF_000002945.2,4896,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/945/GCF_000002945.2_ASM294v3/GCF_000002945.2_ASM294v3_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/945/GCF_000002945.2_ASM294v3/GCF_000002945.2_ASM294v3_protein.faa.gz
GCF_003054445.1,4909,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/054/445/GCF_003054445.1_ASM305444v1/GCF_003054445.1_ASM305444v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/054/445/GCF_003054445.1_ASM305444v1/GCF_003054445.1_ASM305444v1_protein.faa.gz

Tip

If you only want to build DNA-based databases (for example, Kraken2), omit the $4 variable definition and printing.

awk 'BEGIN { FS="\t"; OFS="," }
NR>1 {
base=$3; sub(".*/","",base)
$3=$3 "/" base "_genomic.fna.gz"
}
{ print $1,$2,$3 }' assembly_summary_simplified.txt > samplesheet.csv

This results in:

id,taxid,fasta_dna
GCF_041956525.1,4840,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/041/956/525/GCF_041956525.1_Rhipu1/GCF_041956525.1_Rhipu1_genomic.fna.gz
GCF_000002945.2,4896,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/945/GCF_000002945.2_ASM294v3/GCF_000002945.2_ASM294v3_genomic.fna.gz
GCF_003054445.1,4909,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/054/445/GCF_003054445.1_ASM305444v1/GCF_003054445.1_ASM305444v1_genomic.fna.gz

If you only want to build amino acid-based databases (for example, Kaiju), omit the $4 variable definition and printing, and replace the $3 field with _protein.faa.gz. You will also need to replace the header:

awk 'BEGIN { FS="\t"; OFS="," }
NR>1 {
base=$3; sub(".*/","",base)
$3=$3 "/" base "_protein.faa.gz"
}
{ print $1,$2,$3 }' assembly_summary_simplified.txt > samplesheet.csv

This results in:

id,taxid,fasta_aa
GCF_041956525.1,4840,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/041/956/525/GCF_041956525.1_Rhipu1/GCF_041956525.1_Rhipu1_protein.faa.gz
GCF_000002945.2,4896,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/945/GCF_000002945.2_ASM294v3/GCF_000002945.2_ASM294v3_protein.faa.gz
GCF_003054445.1,4909,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/054/445/GCF_003054445.1_ASM305444v1/GCF_003054445.1_ASM305444v1_protein.faa.gz

Download taxonomy files

Download the necessary NCBI taxonomy files required by Kraken2 with:

curl -O https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdmp.zip
unzip taxdmp.zip
rm taxdmp.zip
curl -O https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz
gunzip nucl_gb.accession2taxid.gz

Kaiju does not require taxonomy files for database construction.

Run the pipeline

Run the nf-core/createtaxdb pipeline to build our databases as normal. Specify the samplesheet and taxonomy files we created and downloaded with their respective parameters. Here we use docker as our environment manager:

nextflow run nf-core/createtaxdb \
-r 2.0.0 \
-profile docker \
--input samplesheet.csv \
--outdir ./results \
--dbname ncbi_fungi \
--nodesdmp nodes.dmp \
--namesdmp names.dmp \
--accession2taxid nucl_gb.accession2taxid \
--build_kraken2 --build_kaiju

Note

By default the pipeline assumes you need 72 GB of RAM to build a Kraken2 database. However, the test run can fit within approximately 8 GB of RAM due to the small number of genomes.

If running on a smaller machine, you may get an error such as Process requirement exceeds available memory -- req: 36 GB; avail: 31 GB. To create a custom config file (for example, custom_config.config) with the following contents to reduce the memory requirement. In this case, my machine has 16GB RAM:

process {
    resourceLimits = [
        cpus: 4,
        memory: '15.GB',
        time: '1.h',
    ]
}

Append to the end of the Nextflow command the parameter -c custom_config.config.

Once completed successfully, you can check the database files in the results/ directory with

ls results/{kaiju,kraken2}/*

And we can see the Kaiju .fmi file and the Kraken2 database directory:

results/kaiju/ncbi_fungi-kaiju.fmi
 
results/kraken2/ncbi_fungi-kraken2:
hash.k2d  opts.k2d  taxo.k2d

Bonus: NCBI assembly_summary to samplesheet one-liner

In fact, we can execute all the commands to generate the samplesheet described above in one go as single UNIX one-liner command:

curl --silent https://ftp.ncbi.nlm.nih.gov/genomes/refseq/fungi/assembly_summary.txt |
awk -F '\t' 'NF>2; $12 == "Complete Genome" {print}' | \
head -n 4 | cut -f 1,7,20 | \
sed 's/#assembly_accession.*/id\ttaxid\tfasta_dna\tfasta_aa/' | \
awk 'BEGIN { FS="\t"; OFS="," } NR>1 { base=$3; sub(".*/","",base); $4=$3 "/" base "_protein.faa.gz"; $3=$3 "/" base "_genomic.fna.gz" } { print $1,$2,$3,$4 }' > samplesheet.csv

We have to set curl to ‘silent’ to prevent false-positive error messages of Failure writing output to destination. Note that this also has the side-effect of hiding of other potentially valid errors!

Summary

In this tutorial we went through how to convert an NCBI assembly_summary file to a nf-core/createtaxdb samplesheet.

We used standard command line tools to download, filter, and reformat the assembly_summary file in a reproducible manner and use this file to generate databases for two different taxonomic classification tools with nf-core/createtaxdb.

Use these steps to quickly build custom taxonomic classification databases for your metagenomic analyses from one of the most popular source of reference genomes.

Note: The awk command in step 4 was partly written with the assistance of AI (Claude Haiku 4.5) and improved by @dialvarezs. Documentation style review with GPT-5.1-Codex-Max

On this page

nf-core/createtaxdb Edit

Tutorials

Convert NCBI assembly_summary file to nf-core/createtaxdb samplesheet

Prerequisites

Download, filter, and convert the assembly_summary file

Download taxonomy files

Run the pipeline

Bonus: NCBI assembly_summary to samplesheet one-liner

Summary

nf-core/createtaxdb
Edit