VEME_2024_NGS_Denovo_Assembly

VEME 2024 NGS De novo Assembly Tutorial

Open with Gitpod

Taylor K. Paisie

CDC - Bacterial Special Pathogens Branch (BSPB)

Email: ltj8@cdc.gov

https://taylorpaisie.github.io/VEME_2024_NGS_Denovo_Assembly/

Directory visualization

How the structure of our directories should look

Command line cheat sheet

1. Introduction to sequence assembly

What does “sequence assembly” mean?

Assembly is a “catch-all” term used to describe methods where we combine shorter individual reads into longer contiguous sequences called contigs

Because the sequencing process works by breaking the original DNA into smaller fragments, the assembly process is conceptually similar to putting together an image puzzle from its many pieces

The software that performs the assembly is called the assembler

We will learn how to de novo assemble reads sequenced by the Illumina sequencing platform using SPAdes, an assembly toolkit containing various assembly pipelines

De novo assembly usually includes the following steps:

  1. Improving of the reads quality (remove adapters, trim reads, etc..)
  2. De novo assembly of the overlapping reads into contigs
  3. Joining contigs into scaffolds
  4. Comparison with other known genomes
  5. Filling the gaps
  6. Annotation of the assembled genome
  7. Visualize genome annotation

Challenges of de novo assembly

Sequence assembly is perhaps the application domain of bioinformatics where skill and expertise are the most difficult to identify and define

Assemblers are quite unlike any other software tool you will ever use

Most come with a bewildering array of parameters - the purpose of which are not explained, yet many will have profound effects on the results that they produce

Trial and error are one of the most commonly used strategies - you will have to keep tuning the parameters and rerun the entire process hoping that the results improve

Assembling a large genome may take weeks and substantial computational resources

Thus any expertise built on trial and error will have to be accumulated over a much more extended period

Finally, even when assembly appears to work, almost always it will contain several severe and substantial errors. That is where, in our opinion, bioinformatics expertise matters more

The ability to understand, visualize and correct the mistakes of an assembly has a utility that will outlast the present and is more valuable than knowing the exact invocation of a tool by heart

N50: length for which the collection of all contigs of that length or longer covers at least 50% of assembly length

Overlapping reads are assembled into contigs. Based on the info about paired-end reads, contigs may be further assembled into scaffolds

What is the N50 statistic:

Contig         Length           Sum
XXXXXXXXXX       10              10
XXXXXXXXX         9              19
XXXXXXXX          8              27
XXXXXXX           7              34
XXXXXX            6              40
XXXXX             5              45
XXXX              4              49
XXX               3              52
XX                2              54
X                 1              55

Problems with N50 statistic:

What are k-mers?

Multidrug resistant bacteria have become a major public health threat

Phage therapy may to be used as an alternative to antibiotics or, as a supplementary approach to treat some bacterial infections

Bacteriophages have been applied in clinical practice for the treatment of localized infections in wounds, burns, and trophic ulcers, including diabetic foot ulcers (PMC6083058)

In this study, bacteria were collected from trophic ulcers of the patients

Bacteriophages that were successful in treating diabetic foot disease were sequenced using NGS technology

The sample we will be using is a 2x250 Illumina sequenced bacteriophage

The goal of this exercise is to assemble the genome of a sequenced bacteriophage

Making directories for lesson:

$ mkdir -p ~/denovo_assembly/data/untrimmed_fastq ~/denovo_assembly/data/trimmed_fastq

2. Trimming Fastq files

Download the fastq and adapter files in the untrimmed fastq directory:

$ cd ~/denovo_assembly/data/untrimmed_fastq
$ wget -nv https://figshare.com/ndownloader/files/45571629 -O 169_S7_L001_R1_001.fastq.gz --no-check-certificate
$ wget -nv https://figshare.com/ndownloader/files/45571626 -O 169_S7_L001_R2_001.fastq.gz --no-check-certificate
$ cp /home/gitpod/miniconda/envs/denovo_assembly/share/trimmomatic-0.39-2/adapters/TruSeq3-PE-2.fa .

Running FastQC on the raw fastq files:

$ fastqc *.fastq.gz

Now run Trimmomatic on the raw fastq files:

$ trimmomatic PE \
169_S7_L001_R1_001.fastq.gz 169_S7_L001_R2_001.fastq.gz \
169_S7_L001_R1_001.trim.fastq.gz 169_S7_L001_R1_001un.trim.fastq.gz \
169_S7_L001_R2_001.trim.fastq.gz 169_S7_L001_R2_001un.trim.fastq.gz \
SLIDINGWINDOW:4:20 MINLEN:25 ILLUMINACLIP:TruSeq3-PE-2.fa:2:40:15

Run FastQC on newly trimmed fastq files:

$ fastqc *trim.fastq.gz

FastQC graph output for trimmed forward reads
FastQC graph output for trimmed reverse reads

Move trimmed fastq files to the trimmed fastq directory:

$ mv *trim* ../trimmed_fastq

Change to the trimmed_fastq directory:

$ cd ../trimmed_fastq

3. Sequence Assembly

We will be using the program SPades for de novo assembly

Spades will automatically make the final scaffolds:

$ spades.py --isolate -1 169_S7_L001_R1_001.trim.fastq.gz -2 169_S7_L001_R2_001.trim.fastq.gz -o spades_output

$ ls -l spades_output

Notice in our spades_output directory we have both a contigs.fasta and a scaffolds.fasta

Use QUAST to analyze the SPades output scaffolds fasta file:

$ quast.py -o quast_output spades_output/scaffolds.fasta

Result from running Quast on our scaffolds

QUAST output contains:

SPades makes both files, but we will be using the scaffolds.fasta for this exercise

Create and move scaffolds from SPades to results directory:

$ mkdir -p ~/denovo_assembly/results/scaffolds
$ mv ~/denovo_assembly/data/trimmed_fastq/spades_output/scaffolds.fasta ~/denovo_assembly/results/scaffolds
$ cd ~/denovo_assembly/results/scaffolds

We now want to be at the denovo_assembly directory

4. Comparing the scaffolds to other known genomes

We will know take our scaffolds and use NCBI BLAST to compare our newly assembled genome to other genomes

BLAST results from our scaffolds

BLAST found similar genomes

The closest is Pseudomonas phage CMS1, complete genome (OM937766.1), with coverage of 99% and identity of 98.53%

Examination of the GenBank record for OM937766 finds that organism is “Pseudomonas phage CMS1” and the taxon ID is 2926659

5. Filling the gaps

Now we will take our scaffolds and use it as a reference as a

Map the reads back to the scaffold as reference

Set up BWA reference mapping with the scaffold scaffolds.fasta as reference and add the trimmed fastq files

Make sure you are in the denovo_assembly directory and make results directories (helps with file organization, super important!!)

$ cd ~/denovo_assembly
$ mkdir -p results/sam results/bam

Index our scaffolds.fasta file we made with SPades:

$ bwa index results/scaffolds/scaffolds.fasta

Run BWA-MEM reference mapping with the indexed scaffolds.fasta as the reference and the original trimmed fastq files as the reads:

$ bwa mem results/scaffolds/scaffolds.fasta data/trimmed_fastq/169_S7_L001_R1_001.trim.fastq.gz data/trimmed_fastq/169_S7_L001_R2_001.trim.fastq.gz > results/sam/169.aligned.sam

Convert SAM file to BAM format:

$ samtools view -S -b results/sam/169.aligned.sam > results/bam/169.aligned.bam

Sort BAM file by coordinates:

$ samtools sort -o results/bam/169.aligned.sorted.bam results/bam/169.aligned.bam

Index new sorted BAM file:

$ samtools index results/bam/169.aligned.sorted.bam

Visualizing our new BAM file with IGV

We will use our scaffolds.fasta as the reference genome in IGV and the 169.aligned.sorted.bam BAM file

IGV visualization of our genome assembly

Now we run the program Pilon

Pilon is a software tool which can be used to automatically improve draft assemblies

It attempts to make improvements to the input genome, including:

Pilon outputs a FASTA file containing an improved representation of the genome from the read data

$ pilon --genome results/scaffolds/scaffolds.fasta --frags results/bam/169.aligned.sorted.bam --output results/scaffolds/169_improved

This command will give us the file 169_improved.fasta

After running this commmand, each fasta input in 169_improved.fasta has _pilon

We want to remove this _pilon after each fasta input

Open 169_improved.fasta in a text editor

$ code results/scaffolds/169_improved.fasta

We want to “replace all” _pilon with nothing

How to edit the improved fasta file (output from pilon)

6. Annotation of the assembled genome

We will use PROKKA on the improved sequence assembly

After you have de novo assembled your genome sequencing reads into scaffolds, it is useful to know what genomic features are on those contigs

The process of identifying and labelling those features is called genome annotation

Prokka is a “wrapper”; it collects together several pieces of software (from various authors), and so avoids “re-inventing the wheel”

Prokka finds and annotates features (both protein coding regions and RNA genes, i.e. tRNA, rRNA) present on on a sequence

Prokka uses a two-step process for the annotation of protein coding regions:

  1. Protein coding regions on the genome are identified using Prodigal
  2. The function of the encoded protein is predicted by similarity to proteins in one of many protein or protein domain databases

    Prokka is a software tool that can be used to annotate bacterial, archaeal and viral genomes quickly, generating standard output files in GenBank, EMBL and gff formats

Once Prokka has finished, examine each of its output files:

First we want to download the protein coding regions of the Pseudomonas phage PEV2 (NC_031063.1) genome, we can do this from NCBI

How to download a set of proteins from NCBI

Running prokka on the improved alignment with our downloaded protein set for annotation:

$ mkdir -p results/annotation
$ wget -nv https://raw.githubusercontent.com/taylorpaisie/VEME_2024_NGS_Denovo_Assembly/main/NC_031063.1.faa -O ~/denovo_assembly/results/annotation/NC_031063.1.faa

For this tutorial we will copy the protein set we will use for annotation

Need to activate the Prokka conda environment (we can talk about this later if we have time):

$ prokka --outdir results/annotation/prokka_output --kingdom Viruses --proteins results/annotation/NC_031063.1.faa results/scaffolds/169_improved.fasta

7. Visualize genome annotation

We will use the program Artemis to visualize the genome annotation we made with PROKKA using Artemis

Artemis is a free genome browser and annotation tool that allows visualisation of sequence features, next generation data and the results of analyses within the context of the sequence, and also its six-frame translation

Artemis is written in Java, and is available for UNIX, Macintosh and Windows systems

It can read EMBL and GENBANK database entries or sequence in FASTA, indexed FASTA or raw format

Using the GFF file made from PROKKA, we will open it with Artemis:

$ /usr/local/share/artemis/art results/annotation/prokka_output/PROKKA_08222023.gff

Visualizaing the genome annotation with Artemis

Some tips for using Artemis:

  1. There are 3 panels: feature map (top), sequence (middle), feature list (bottom)
  2. Click right-mouse-button on bottom panel and select Show products
  3. Zooming is done via the verrtical scroll bars in the two top panels