 
https://taylorpaisie.github.io/VEME_2024_NGS_Variant_Calling/ 
     
    $ mkdir -p ~/variant_calling/data/untrimmed_fastq$ cd ~/variant_calling/data/untrimmed_fastq$ curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR197/007/SRR1972917/SRR1972917_1.fastq.gz $ curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR197/007/SRR1972917/SRR1972917_2.fastq.gz$ curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR197/008/SRR1972918/SRR1972918_1.fastq.gz$ curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR197/008/SRR1972918/SRR1972918_2.fastq.gz$ curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR197/009/SRR1972919/SRR1972919_1.fastq.gz$ curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR197/009/SRR1972919/SRR1972919_2.fastq.gz$ curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR197/000/SRR1972920/SRR1972920_1.fastq.gz$ curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR197/000/SRR1972920/SRR1972920_2.fastq.gz$ gunzip SRR1972917_1.fastq.gzWe can view the first complete read in one of the files our dataset by using head to look at the first four lines:
  $ head -n 4 SRR1972917_1.fastq
@SRR1972917.1 1/1
TCCGTGGGGCTGGTACGACAGTATCGATGAGGGTGGACGCTTCAAGGTCAAGCGTATACAGGTCAACCCCAAAGCTAGCCTGAGCCTTCAGAAACACCACC  
+  
@CCFDDFFHGHHHGIJJIIJJJJGJJJJGIJIJJFIJJJIIJJJJHHFHHFFFFAADEFEDDDDDDDDDDDD??CCCDDDDDDCCCDDCCCDD:?CABDDB
Quality encoding: !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJ
             |         |         |         |         |  
Quality score: 01........11........21........31........41
$ gzip SRR1972917_1.fastq
$ fastqc -h
 $ fastqc *.fastq*
application/gzip   
 Started analysis of SRR1972917_1.fastq.gz  
 Approx 5% complete for SRR1972917_1.fastq.gz  
 Approx 10% complete for SRR1972917_1.fastq.gz  
 Approx 15% complete for SRR1972917_1.fastq.gz  
 Approx 20% complete for SRR1972917_1.fastq.gz  
 Approx 25% complete for SRR1972917_1.fastq.gz  
 Approx 30% complete for SRR1972917_1.fastq.gz  
 Approx 35% complete for SRR1972917_1.fastq.gz  
 Approx 40% complete for SRR1972917_1.fastq.gz  
 Approx 45% complete for SRR1972917_1.fastq.gz  
 Approx 50% complete for SRR1972917_1.fastq.gz  
 Approx 55% complete for SRR1972917_1.fastq.gz  
 Approx 60% complete for SRR1972917_1.fastq.gz  
 Approx 65% complete for SRR1972917_1.fastq.gz  
 Approx 70% complete for SRR1972917_1.fastq.gz  
 Approx 75% complete for SRR1972917_1.fastq.gz  
 Approx 80% complete for SRR1972917_1.fastq.gz  
 Approx 85% complete for SRR1972917_1.fastq.gz  
 Approx 90% complete for SRR1972917_1.fastq.gz  
 Approx 95% complete for SRR1972917_1.fastq.gz  
 Analysis complete for SRR1972917_1.fastq.gz
Lets now look at the files created by FastQC:
 $ ls -al
$ cd ~/variant_calling
$ mkdir -p results/fastqc_untrimmed_reads
$ mv ~/variant_calling/data/untrimmed_fastq/*.zip ~/variant_calling/results/fastqc_untrimmed_reads
$ mv ~/variant_calling/data/untrimmed_fastq/*.html ~/variant_calling/results/fastqc_untrimmed_reads
$ cd ~/variant_calling/results/fastqc_untrimmed_reads
 
    Usually not.
When FastQC runs, it generates “stoplight” icons for each analysis having “pass,” “warning,” and “error” symbols. Most of the time, these symbols are not meaningful. They were developed for only a particular class of samples and library preparation methods and just for certain types of instruments.
Although if most or all your stop-light icons are red, then you probably have a data problem.
 
    The simplest way to snapshot the quality of a sequencing run is via a chart that plots the error likelihood at each position averaged over all measurements.
The vertical axis is the FASTQ scores that represent error probabilities:
The three-colored bands illustrate the typical labels that we assign to these measures: reliable (30-40, green), less reliable (20-30, yellow) and error-prone (1-20, red). The yellow boxes contain 50% of the data, the whiskers indicate the 75% outliers.
 
    The sequence length distribution shows how many sequences of each length the data contains. For fixed read length instruments, like the Illumina sequencer, all read lengths are the same. For long read technologies like the PacBio and MinION, the distribution can be a lot more varied.
 
    Another way to visualize data quality is to generate histograms of the average qualities. The horizontal scales are the quality scores; the vertical axis indicates the number of reads of that quality.
$ trimmomatic  
    Usage:  
       PE [-version] [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] [-summary <statsSummaryFile>] [-quiet] [-validatePairs] [-basein <inputBase> | <inputFile1> <inputFile2>] [-baseout <outputBase> | <outputFile1P> <outputFile1U> <outputFile2P> <outputFile2U>] <trimmer1>...  
   or:  
       SE [-version] [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] [-summary <statsSummaryFile>] [-quiet] <inputFile> <outputFile> <trimmer1>...  
   or:  
       -version 
While, in single end mode, Trimmomatic will expect 1 file as input, after which you can enter the optional settings and lastly the name of the output file
Running Trimmomatic:
Move to the correct directory with untrimmed fastq files we downloaded:
 $ cd ~/variant_calling/data/untrimmed_fastq
Copy Illumina adapters from Trimmomatic into working directory:
 $ cp /home/gitpod/miniconda/envs/variant_calling/share/trimmomatic-0.39-2/adapters/NexteraPE-PE.fa .
Run Trimmomatic:
$ trimmomatic PE SRR1972917_1.fastq.gz SRR1972917_2.fastq.gz SRR1972917_1.trim.fastq.gz SRR1972917_1un.trim.fastq.gz SRR1972917_2.trim.fastq.gz SRR1972917_2un.trim.fastq.gz SLIDINGWINDOW:4:20 MINLEN:25 ILLUMINACLIP:NexteraPE-PE.fa:2:40:15
 TrimmomaticPE: Started with arguments:
 SRR1972917_1.fastq.gz SRR1972917_2.fastq.gz SRR1972917_1.trim.fastq.gz SRR1972917_1un.trim.fastq.gz SRR1972917_2.trim.fastq.gz SRR1972917_2un.trim.fastq.gz SLIDINGWINDOW:4:20 MINLEN:25 ILLUMINACLIP:NexteraPE-PE.fa:2:40:15
 Multiple cores found: Using 4 threads
 Using PrefixPair: 'AGATGTGTATAAGAGACAG' and 'AGATGTGTATAAGAGACAG'
 Using Long Clipping Sequence: 'GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG'
 Using Long Clipping Sequence: 'TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG'
 Using Long Clipping Sequence: 'CTGTCTCTTATACACATCTCCGAGCCCACGAGAC'
 Using Long Clipping Sequence: 'CTGTCTCTTATACACATCTGACGCTGCCGACGA'
 ILLUMINACLIP: Using 1 prefix pairs, 4 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
 Quality encoding detected as phred33
 Input Read Pairs: 4377867 Both Surviving: 1241328 (28.35%) Forward Only Surviving: 2133670 (48.74%) Reverse Only Surviving: 18680 (0.43%) Dropped: 984189 (22.48%)
 TrimmomaticPE: Completed successfully
List out files created by Trimmomatic:
 $ ls SRR1972917*
Trimmed files should be smaller in size than our untrimmed fastq files
Running a for loop on all fastq files
$ for infile in *_1.fastq.gz
    do
        base=$(basename ${infile} _1.fastq.gz)
        trimmomatic PE ${infile} ${base}_2.fastq.gz \
        ${base}_1.trim.fastq.gz ${base}_1un.trim.fastq.gz \
        ${base}_2.trim.fastq.gz ${base}_2un.trim.fastq.gz \
        SLIDINGWINDOW:4:20 MINLEN:25 ILLUMINACLIP:NexteraPE-PE.fa:2:40:15
    done
$ cd ~/variant_calling/data/untrimmed_fastq
 $ mkdir -p ~/variant_calling/data/trimmed_fastq
 $ mv *trim* ~/variant_calling/data/trimmed_fastq
 $ cd ~/variant_calling/data/trimmed_fastq
 $ ls -al
$ fastqc *trim.fastq.gzKJ660346.2 and download fasta file$ mkdir -p ~/variant_calling/data/ref_genome
 $ cd ~/variant_calling/data/ref_genome
 $ wget -nv https://raw.githubusercontent.com/taylorpaisie/VEME_2024_NGS_Variant_Calling/main/KJ660346.2.fasta -O KJ660346.2.fasta
$ cd ~/variant_calling/$ mkdir -p results/sam results/bam results/bcf results/vcf$ bwa index data/ref_genome/KJ660346.2.fastaWe will use the BWA-MEM algorithm, which is the latest and is generally recommended for high-quality queries as it is faster and more accurate
 $ bwa mem data/ref_genome/KJ660346.2.fasta data/trimmed_fastq/SRR1972917_1.trim.fastq.gz data/trimmed_fastq/SRR1972917_2.trim.fastq.gz > results/sam/SRR1972917.aligned.sam
Convert SAM file to BAM format
 $ samtools view -S -b results/sam/SRR1972917.aligned.sam > results/bam/SRR1972917.aligned.bam
Sort BAM file by coordinates
 $ samtools sort -o results/bam/SRR1972917.aligned.sorted.bam results/bam/SRR1972917.aligned.bam
$ samtools flagstat results/bam/SRR1972917.aligned.sorted.bamNow lets view the header of our BAM file:
$ samtools view -H results/bam/SRR1972917.aligned.sorted.bam
@HD	VN:1.5	SO:coordinate
@SQ	SN:KJ660346.2	LN:18959
@PG	ID:bwa	PN:bwa	VN:0.7.18-r1243-dirty	CL:bwa mem /Users/tpaisie/variant_calling/data/ref_genome/KJ660346.2.fasta /Users/tpaisie/variant_calling/data/trimmed_fastq/SRR1972917_1.trim.fastq.gz /Users/tpaisie/variant_calling/data/trimmed_fastq/SRR1972917_2.trim.fastq.gz
@PG	ID:samtools	PN:samtools	PP:bwa	VN:1.20	CL:samtools view -S -b /Users/tpaisie/variant_calling/results/sam/SRR1972917.aligned.sam
@PG	ID:samtools.1	PN:samtools	PP:samtools	VN:1.20	CL:samtools sort -o /Users/tpaisie/variant_calling/results/bam/SRR1972917.aligned.sorted.bam /Users/tpaisie/variant_calling/results/bam/SRR1972917.aligned.bam
@PG	ID:samtools.2	PN:samtools	PP:samtools.1	VN:1.20	CL:samtools view -H results/bam/SRR1972917.aligned.sorted.bam
Now lets view a small part of our BAM file:
$ samtools view results/bam/SRR1972917.aligned.sorted.bam | head -1
Mapping Quality (MAPQ) and Compact Idiosyncratic Gapped Alignment Representation (CIGAR):
$ samtools view results/bam/SRR1972917.aligned.sorted.bam | cut -f 5,6 | grep -v "101" | head -5
60	32M
60	33M
49	31M
60	2S98M
60	3S98M
The values in the MAPQ (Mapping Quality) column here are all 60. This column was designed to indicate the likelihood of the alignment being placed incorrectly. It is the same Phred score that we encountered in the FASTQ files. And we read it the same way, 60/10 = 6 so the chance of seeing this alignment being wrong is 10^-6 or 1/1,000,000 one in a million.
The numbers that an aligner puts into the MAPQ field are typically estimates. It is not possible to mathematically compute this value. What this field does is to inform us on a guess by the aligner’s algorithm. This guess is more of a hunch that should not be treated as a continuous, numerical value. Instead, it should be thought of as an ordered label, like “not so good” or “pretty good”. Aligner developers even generate unique MAPQ qualities to mark individual cases. For example, bwa will create a MAPQ=0 if a read maps equally well to more than one location.
The CIGAR string is a different beast altogether. It is meant to represent the alignment via numbers followed by letters:
M match of mismatchI insertionD deletionS soft clipH hard clipN skippingThese are also meant to be “readable”; the 2S98M says that 2 bases are soft clipped and the next 98 are a match or mismatch.
The CIGAR representation is a neat concept, alas it was developed for short and well-matching reads. As soon as the reads show substantial differences the CIGAR representations are much more difficult to read.
There are also different variants of CIGAR. The “default” CIGAR encoding describes both the match and mismatch in the same way, with an M. There are some unfortunate rationales and explanations for having adopted this choice - one that complicates analysis even more.
The extended CIGAR encoding adopted by others uses the symbols of = and X to indicate matches and mismatches.
Calculate the read coverage of positions in the genome
$ bcftools mpileup -O b -o results/bcf/SRR1972917_raw.bcf -f data/ref_genome/KJ660346.2.fasta results/bam/SRR1972917.aligned.sorted.bam 
--ploidy, which is one for our Ebola genome. -m allows for multiallelic and rare-variant calling, -v tells the program to output variant sites only (not every site in the genome), and -o specifies where to write the output file:$ bcftools call --ploidy 1 -m -v -o results/vcf/SRR1972917_variants.vcf results/bcf/SRR1972917_raw.bcf$ vcfutils.pl varFilter results/vcf/SRR1972917_variants.vcf  > results/vcf/SRR1972917_final_variants.vcfExplore the VCF format:
 $ less -S results/vcf/SRR1972917_final_variants.vcf
Or you can use the code command to open the vcf file with VS code:
 $ code results/vcf/SRR1972917_final_variants.vcf
 
 
 
$ samtools index results/bam/SRR1972917.aligned.sorted.bamtview
    In order to visualize our mapped reads, we use tview, giving it the sorted bam file and the reference file:
 $ samtools tview results/bam/SRR1972917.aligned.sorted.bam data/ref_genome/KJ660346.2.fasta
. indicates a match to the reference sequence, so we can see that the consensus from our sample matches the reference in most locations? for a help menugCtrl^C or q to exit tviewViewing with IGV
 #### IGV is a stand-alone browser, which has the advantage of being installed locally and providing fast access. Web-based genome browsers, like Ensembl or the UCSC browser, are slower, but provide more functionality
 #### They not only allow for more polished and flexible visualization, but also provide easy access to a wealth of annotations and external data sources
 #### This makes it straightforward to relate your data with information about repeat regions, known genes, epigenetic features or areas of cross-species conservation, to name just a few
 
 $ for infile in *_1.fastq.gz
    do
        base=$(basename ${infile} _1.fastq.gz)  
        trimmomatic PE ${infile} ${base}_2.fastq.gz  
        ${base}_1.trim.fastq.gz ${base}_1un.trim.fastq.gz  
        ${base}_2.trim.fastq.gz ${base}_2un.trim.fastq.gz  
        SLIDINGWINDOW:4:20 MINLEN:25 ILLUMINACLIP:NexteraPE-PE.fa:2:40:15   
    done
for loop lesson)=$variable_nameinfile, which was defined in the for statement, and base, which was created from the filename during each iteration of the loop$ mkdir scripts
$ cd scripts/
$ touch run_variant_calling.sh
Instead of using the touch command you can use code to open in VS code:
$ code run_variant_calling.sh
#! bin/bash/
set -e
cd ~/variant_calling/results
genome=~/variant_calling/data/ref_genome/KJ660346.2.fasta
bwa index $genome
# makes directories (should already be made)
# mkdir -p sam bam bcf vcf
for fq1 in ~/variant_calling/data/trimmed_fastq/*_1.trim.fastq.gz
    do
    echo "working with file $fq1"
    base=$(basename $fq1 _1.trim.fastq.gz)
    echo "base name is $base"
    fq1=~/variant_calling/data/trimmed_fastq/${base}_1.trim.fastq.gz
    fq2=~/variant_calling/data/trimmed_fastq/${base}_2.trim.fastq.gz
    sam=~/variant_calling/results/sam/${base}.aligned.sam
    bam=~/variant_calling/results/bam/${base}.aligned.bam
    sorted_bam=~/variant_calling/results/bam/${base}.aligned.sorted.bam
    raw_bcf=~/variant_calling/results/bcf/${base}_raw.bcf
    variants=~/variant_calling/results/vcf/${base}_variants.vcf
    final_variants=~/variant_calling/results/vcf/${base}_final_variants.vcf 
    bwa mem $genome $fq1 $fq2 > $sam
    samtools view -S -b $sam > $bam
    samtools sort -o $sorted_bam $bam 
    samtools index $sorted_bam
    bcftools mpileup -O b -o $raw_bcf -f $genome $sorted_bam
    bcftools call --ploidy 1 -m -v -o $variants $raw_bcf 
    vcfutils.pl varFilter $variants > $final_variants
   
    done
$ cd ~/variant_calling/results
$ genome=~/variant_calling/data/ref_genome/KJ660346.2.fasta
$ bwa index $genome
results that are neededfor fq1 in ~/variant_calling/data/trimmed_fastq/*_1.trim.fastq.gz
    do
    echo "working with file $fq1"
    base=$(basename $fq1 _1.trim.fastq.gz)
    echo "base name is $base"
    .# input fastq files
    fq1=~/variant_calling/data/trimmed_fastq/${base}_1.trim.fastq.gz
    fq2=~/variant_calling/data/trimmed_fastq/${base}_2.trim.fastq.gz
        
    # output files
    sam=~/variant_calling/results/sam/${base}.aligned.sam
    bam=~/variant_calling/results/bam/${base}.aligned.bam
    sorted_bam=~/variant_calling/results/bam/${base}.aligned.sorted.bam
    raw_bcf=~/variant_calling/results/bcf/${base}_raw.bcf
    variants=~/variant_calling/results/vcf/${base}_variants.vcf
    final_variants=~/variant_calling/results/vcf/${base}_final_variants.vcf   
$ bwa mem $genome $fq1 $fq2 > $sam
$ samtools view -S -b $sam > $bam
$ samtools sort -o $sorted_bam $bam
$ samtools index $sorted_bam
$ bcftools mpileup -O b -o $raw_bcf -f $genome $sorted_bam
$ bcftools call --ploidy 1 -m -v -o $variants $raw_bcf
$ vcfutils.pl varFilter $variants > $final_variants
$ bash run_variant_calling.sh