https://taylorpaisie.github.io/VEME_2024_NGS_Variant_Calling/
$ mkdir -p ~/variant_calling/data/untrimmed_fastq
$ cd ~/variant_calling/data/untrimmed_fastq
$ curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR197/007/SRR1972917/SRR1972917_1.fastq.gz
$ curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR197/007/SRR1972917/SRR1972917_2.fastq.gz
$ curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR197/008/SRR1972918/SRR1972918_1.fastq.gz
$ curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR197/008/SRR1972918/SRR1972918_2.fastq.gz
$ curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR197/009/SRR1972919/SRR1972919_1.fastq.gz
$ curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR197/009/SRR1972919/SRR1972919_2.fastq.gz
$ curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR197/000/SRR1972920/SRR1972920_1.fastq.gz
$ curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR197/000/SRR1972920/SRR1972920_2.fastq.gz
$ gunzip SRR1972917_1.fastq.gz
We can view the first complete read in one of the files our dataset by using head to look at the first four lines:
$ head -n 4 SRR1972917_1.fastq
@SRR1972917.1 1/1
TCCGTGGGGCTGGTACGACAGTATCGATGAGGGTGGACGCTTCAAGGTCAAGCGTATACAGGTCAACCCCAAAGCTAGCCTGAGCCTTCAGAAACACCACC
+
@CCFDDFFHGHHHGIJJIIJJJJGJJJJGIJIJJFIJJJIIJJJJHHFHHFFFFAADEFEDDDDDDDDDDDD??CCCDDDDDDCCCDDCCCDD:?CABDDB
Quality encoding: !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJ
| | | | |
Quality score: 01........11........21........31........41
$ gzip SRR1972917_1.fastq
$ fastqc -h
$ fastqc *.fastq*
application/gzip
Started analysis of SRR1972917_1.fastq.gz
Approx 5% complete for SRR1972917_1.fastq.gz
Approx 10% complete for SRR1972917_1.fastq.gz
Approx 15% complete for SRR1972917_1.fastq.gz
Approx 20% complete for SRR1972917_1.fastq.gz
Approx 25% complete for SRR1972917_1.fastq.gz
Approx 30% complete for SRR1972917_1.fastq.gz
Approx 35% complete for SRR1972917_1.fastq.gz
Approx 40% complete for SRR1972917_1.fastq.gz
Approx 45% complete for SRR1972917_1.fastq.gz
Approx 50% complete for SRR1972917_1.fastq.gz
Approx 55% complete for SRR1972917_1.fastq.gz
Approx 60% complete for SRR1972917_1.fastq.gz
Approx 65% complete for SRR1972917_1.fastq.gz
Approx 70% complete for SRR1972917_1.fastq.gz
Approx 75% complete for SRR1972917_1.fastq.gz
Approx 80% complete for SRR1972917_1.fastq.gz
Approx 85% complete for SRR1972917_1.fastq.gz
Approx 90% complete for SRR1972917_1.fastq.gz
Approx 95% complete for SRR1972917_1.fastq.gz
Analysis complete for SRR1972917_1.fastq.gz
Lets now look at the files created by FastQC:
$ ls -al
$ cd ~/variant_calling
$ mkdir -p results/fastqc_untrimmed_reads
$ mv ~/variant_calling/data/untrimmed_fastq/*.zip ~/variant_calling/results/fastqc_untrimmed_reads
$ mv ~/variant_calling/data/untrimmed_fastq/*.html ~/variant_calling/results/fastqc_untrimmed_reads
$ cd ~/variant_calling/results/fastqc_untrimmed_reads
Usually not.
When FastQC runs, it generates “stoplight” icons for each analysis having “pass,” “warning,” and “error” symbols. Most of the time, these symbols are not meaningful. They were developed for only a particular class of samples and library preparation methods and just for certain types of instruments.
Although if most or all your stop-light icons are red, then you probably have a data problem.
The simplest way to snapshot the quality of a sequencing run is via a chart that plots the error likelihood at each position averaged over all measurements.
The vertical axis is the FASTQ scores that represent error probabilities:
The three-colored bands illustrate the typical labels that we assign to these measures: reliable (30-40, green), less reliable (20-30, yellow) and error-prone (1-20, red). The yellow boxes contain 50% of the data, the whiskers indicate the 75% outliers.
The sequence length distribution shows how many sequences of each length the data contains. For fixed read length instruments, like the Illumina sequencer, all read lengths are the same. For long read technologies like the PacBio and MinION, the distribution can be a lot more varied.
Another way to visualize data quality is to generate histograms of the average qualities. The horizontal scales are the quality scores; the vertical axis indicates the number of reads of that quality.
$ trimmomatic
Usage:
PE [-version] [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] [-summary <statsSummaryFile>] [-quiet] [-validatePairs] [-basein <inputBase> | <inputFile1> <inputFile2>] [-baseout <outputBase> | <outputFile1P> <outputFile1U> <outputFile2P> <outputFile2U>] <trimmer1>...
or:
SE [-version] [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] [-summary <statsSummaryFile>] [-quiet] <inputFile> <outputFile> <trimmer1>...
or:
-version
While, in single end mode, Trimmomatic will expect 1 file as input, after which you can enter the optional settings and lastly the name of the output file
Running Trimmomatic:
Move to the correct directory with untrimmed fastq files we downloaded:
$ cd ~/variant_calling/data/untrimmed_fastq
Copy Illumina adapters from Trimmomatic into working directory:
$ cp /home/gitpod/miniconda/envs/variant_calling/share/trimmomatic-0.39-2/adapters/NexteraPE-PE.fa .
Run Trimmomatic:
$ trimmomatic PE SRR1972917_1.fastq.gz SRR1972917_2.fastq.gz SRR1972917_1.trim.fastq.gz SRR1972917_1un.trim.fastq.gz SRR1972917_2.trim.fastq.gz SRR1972917_2un.trim.fastq.gz SLIDINGWINDOW:4:20 MINLEN:25 ILLUMINACLIP:NexteraPE-PE.fa:2:40:15
TrimmomaticPE: Started with arguments:
SRR1972917_1.fastq.gz SRR1972917_2.fastq.gz SRR1972917_1.trim.fastq.gz SRR1972917_1un.trim.fastq.gz SRR1972917_2.trim.fastq.gz SRR1972917_2un.trim.fastq.gz SLIDINGWINDOW:4:20 MINLEN:25 ILLUMINACLIP:NexteraPE-PE.fa:2:40:15
Multiple cores found: Using 4 threads
Using PrefixPair: 'AGATGTGTATAAGAGACAG' and 'AGATGTGTATAAGAGACAG'
Using Long Clipping Sequence: 'GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG'
Using Long Clipping Sequence: 'TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG'
Using Long Clipping Sequence: 'CTGTCTCTTATACACATCTCCGAGCCCACGAGAC'
Using Long Clipping Sequence: 'CTGTCTCTTATACACATCTGACGCTGCCGACGA'
ILLUMINACLIP: Using 1 prefix pairs, 4 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Quality encoding detected as phred33
Input Read Pairs: 4377867 Both Surviving: 1241328 (28.35%) Forward Only Surviving: 2133670 (48.74%) Reverse Only Surviving: 18680 (0.43%) Dropped: 984189 (22.48%)
TrimmomaticPE: Completed successfully
List out files created by Trimmomatic:
$ ls SRR1972917*
Trimmed files should be smaller in size than our untrimmed fastq files
Running a for loop on all fastq files
$ for infile in *_1.fastq.gz
do
base=$(basename ${infile} _1.fastq.gz)
trimmomatic PE ${infile} ${base}_2.fastq.gz \
${base}_1.trim.fastq.gz ${base}_1un.trim.fastq.gz \
${base}_2.trim.fastq.gz ${base}_2un.trim.fastq.gz \
SLIDINGWINDOW:4:20 MINLEN:25 ILLUMINACLIP:NexteraPE-PE.fa:2:40:15
done
$ cd ~/variant_calling/data/untrimmed_fastq
$ mkdir -p ~/variant_calling/data/trimmed_fastq
$ mv *trim* ~/variant_calling/data/trimmed_fastq
$ cd ~/variant_calling/data/trimmed_fastq
$ ls -al
$ fastqc *trim.fastq.gz
KJ660346.2
and download fasta file$ mkdir -p ~/variant_calling/data/ref_genome
$ cd ~/variant_calling/data/ref_genome
$ wget -nv https://raw.githubusercontent.com/taylorpaisie/VEME_2024_NGS_Variant_Calling/main/KJ660346.2.fasta -O KJ660346.2.fasta
$ cd ~/variant_calling/
$ mkdir -p results/sam results/bam results/bcf results/vcf
$ bwa index data/ref_genome/KJ660346.2.fasta
We will use the BWA-MEM algorithm, which is the latest and is generally recommended for high-quality queries as it is faster and more accurate
$ bwa mem data/ref_genome/KJ660346.2.fasta data/trimmed_fastq/SRR1972917_1.trim.fastq.gz data/trimmed_fastq/SRR1972917_2.trim.fastq.gz > results/sam/SRR1972917.aligned.sam
Convert SAM file to BAM format
$ samtools view -S -b results/sam/SRR1972917.aligned.sam > results/bam/SRR1972917.aligned.bam
Sort BAM file by coordinates
$ samtools sort -o results/bam/SRR1972917.aligned.sorted.bam results/bam/SRR1972917.aligned.bam
$ samtools flagstat results/bam/SRR1972917.aligned.sorted.bam
Now lets view the header of our BAM file:
$ samtools view -H results/bam/SRR1972917.aligned.sorted.bam
@HD VN:1.5 SO:coordinate
@SQ SN:KJ660346.2 LN:18959
@PG ID:bwa PN:bwa VN:0.7.18-r1243-dirty CL:bwa mem /Users/tpaisie/variant_calling/data/ref_genome/KJ660346.2.fasta /Users/tpaisie/variant_calling/data/trimmed_fastq/SRR1972917_1.trim.fastq.gz /Users/tpaisie/variant_calling/data/trimmed_fastq/SRR1972917_2.trim.fastq.gz
@PG ID:samtools PN:samtools PP:bwa VN:1.20 CL:samtools view -S -b /Users/tpaisie/variant_calling/results/sam/SRR1972917.aligned.sam
@PG ID:samtools.1 PN:samtools PP:samtools VN:1.20 CL:samtools sort -o /Users/tpaisie/variant_calling/results/bam/SRR1972917.aligned.sorted.bam /Users/tpaisie/variant_calling/results/bam/SRR1972917.aligned.bam
@PG ID:samtools.2 PN:samtools PP:samtools.1 VN:1.20 CL:samtools view -H results/bam/SRR1972917.aligned.sorted.bam
Now lets view a small part of our BAM file:
$ samtools view results/bam/SRR1972917.aligned.sorted.bam | head -1
Mapping Quality (MAPQ) and Compact Idiosyncratic Gapped Alignment Representation (CIGAR):
$ samtools view results/bam/SRR1972917.aligned.sorted.bam | cut -f 5,6 | grep -v "101" | head -5
60 32M
60 33M
49 31M
60 2S98M
60 3S98M
The values in the MAPQ (Mapping Quality) column here are all 60. This column was designed to indicate the likelihood of the alignment being placed incorrectly. It is the same Phred score that we encountered in the FASTQ files. And we read it the same way, 60/10 = 6 so the chance of seeing this alignment being wrong is 10^-6 or 1/1,000,000 one in a million.
The numbers that an aligner puts into the MAPQ field are typically estimates. It is not possible to mathematically compute this value. What this field does is to inform us on a guess by the aligner’s algorithm. This guess is more of a hunch that should not be treated as a continuous, numerical value. Instead, it should be thought of as an ordered label, like “not so good” or “pretty good”. Aligner developers even generate unique MAPQ qualities to mark individual cases. For example, bwa will create a MAPQ=0 if a read maps equally well to more than one location.
The CIGAR string is a different beast altogether. It is meant to represent the alignment via numbers followed by letters:
M
match of mismatchI
insertionD
deletionS
soft clipH
hard clipN
skippingThese are also meant to be “readable”; the 2S98M says that 2 bases are soft clipped and the next 98 are a match or mismatch.
The CIGAR representation is a neat concept, alas it was developed for short and well-matching reads. As soon as the reads show substantial differences the CIGAR representations are much more difficult to read.
There are also different variants of CIGAR. The “default” CIGAR encoding describes both the match and mismatch in the same way, with an M. There are some unfortunate rationales and explanations for having adopted this choice - one that complicates analysis even more.
The extended CIGAR encoding adopted by others uses the symbols of = and X to indicate matches and mismatches.
Calculate the read coverage of positions in the genome
$ bcftools mpileup -O b -o results/bcf/SRR1972917_raw.bcf -f data/ref_genome/KJ660346.2.fasta results/bam/SRR1972917.aligned.sorted.bam
--ploidy
, which is one for our Ebola genome. -m
allows for multiallelic and rare-variant calling, -v
tells the program to output variant sites only (not every site in the genome), and -o
specifies where to write the output file:$ bcftools call --ploidy 1 -m -v -o results/vcf/SRR1972917_variants.vcf results/bcf/SRR1972917_raw.bcf
$ vcfutils.pl varFilter results/vcf/SRR1972917_variants.vcf > results/vcf/SRR1972917_final_variants.vcf
Explore the VCF format:
$ less -S results/vcf/SRR1972917_final_variants.vcf
Or you can use the code
command to open the vcf file with VS code:
$ code results/vcf/SRR1972917_final_variants.vcf
$ samtools index results/bam/SRR1972917.aligned.sorted.bam
tview
In order to visualize our mapped reads, we use tview, giving it the sorted bam file and the reference file:
$ samtools tview results/bam/SRR1972917.aligned.sorted.bam data/ref_genome/KJ660346.2.fasta
.
indicates a match to the reference sequence, so we can see that the consensus from our sample matches the reference in most locations?
for a help menug
Ctrl^C
or q
to exit tviewViewing with IGV
#### IGV is a stand-alone browser, which has the advantage of being installed locally and providing fast access. Web-based genome browsers, like Ensembl or the UCSC browser, are slower, but provide more functionality
#### They not only allow for more polished and flexible visualization, but also provide easy access to a wealth of annotations and external data sources
#### This makes it straightforward to relate your data with information about repeat regions, known genes, epigenetic features or areas of cross-species conservation, to name just a few
$ for infile in *_1.fastq.gz
do
base=$(basename ${infile} _1.fastq.gz)
trimmomatic PE ${infile} ${base}_2.fastq.gz
${base}_1.trim.fastq.gz ${base}_1un.trim.fastq.gz
${base}_2.trim.fastq.gz ${base}_2un.trim.fastq.gz
SLIDINGWINDOW:4:20 MINLEN:25 ILLUMINACLIP:NexteraPE-PE.fa:2:40:15
done
for
loop lesson)=
$variable_name
infile
, which was defined in the for statement, and base
, which was created from the filename during each iteration of the loop$ mkdir scripts
$ cd scripts/
$ touch run_variant_calling.sh
Instead of using the touch
command you can use code
to open in VS code:
$ code run_variant_calling.sh
#! bin/bash/
set -e
cd ~/variant_calling/results
genome=~/variant_calling/data/ref_genome/KJ660346.2.fasta
bwa index $genome
# makes directories (should already be made)
# mkdir -p sam bam bcf vcf
for fq1 in ~/variant_calling/data/trimmed_fastq/*_1.trim.fastq.gz
do
echo "working with file $fq1"
base=$(basename $fq1 _1.trim.fastq.gz)
echo "base name is $base"
fq1=~/variant_calling/data/trimmed_fastq/${base}_1.trim.fastq.gz
fq2=~/variant_calling/data/trimmed_fastq/${base}_2.trim.fastq.gz
sam=~/variant_calling/results/sam/${base}.aligned.sam
bam=~/variant_calling/results/bam/${base}.aligned.bam
sorted_bam=~/variant_calling/results/bam/${base}.aligned.sorted.bam
raw_bcf=~/variant_calling/results/bcf/${base}_raw.bcf
variants=~/variant_calling/results/vcf/${base}_variants.vcf
final_variants=~/variant_calling/results/vcf/${base}_final_variants.vcf
bwa mem $genome $fq1 $fq2 > $sam
samtools view -S -b $sam > $bam
samtools sort -o $sorted_bam $bam
samtools index $sorted_bam
bcftools mpileup -O b -o $raw_bcf -f $genome $sorted_bam
bcftools call --ploidy 1 -m -v -o $variants $raw_bcf
vcfutils.pl varFilter $variants > $final_variants
done
$ cd ~/variant_calling/results
$ genome=~/variant_calling/data/ref_genome/KJ660346.2.fasta
$ bwa index $genome
results
that are neededfor fq1 in ~/variant_calling/data/trimmed_fastq/*_1.trim.fastq.gz
do
echo "working with file $fq1"
base=$(basename $fq1 _1.trim.fastq.gz)
echo "base name is $base"
.# input fastq files
fq1=~/variant_calling/data/trimmed_fastq/${base}_1.trim.fastq.gz
fq2=~/variant_calling/data/trimmed_fastq/${base}_2.trim.fastq.gz
# output files
sam=~/variant_calling/results/sam/${base}.aligned.sam
bam=~/variant_calling/results/bam/${base}.aligned.bam
sorted_bam=~/variant_calling/results/bam/${base}.aligned.sorted.bam
raw_bcf=~/variant_calling/results/bcf/${base}_raw.bcf
variants=~/variant_calling/results/vcf/${base}_variants.vcf
final_variants=~/variant_calling/results/vcf/${base}_final_variants.vcf
$ bwa mem $genome $fq1 $fq2 > $sam
$ samtools view -S -b $sam > $bam
$ samtools sort -o $sorted_bam $bam
$ samtools index $sorted_bam
$ bcftools mpileup -O b -o $raw_bcf -f $genome $sorted_bam
$ bcftools call --ploidy 1 -m -v -o $variants $raw_bcf
$ vcfutils.pl varFilter $variants > $final_variants
$ bash run_variant_calling.sh