In the mid-1990s, scientists at Cambridge, Shankar Balasubramanian and David Klenerman, developed methods to produce high-quality reads of much greater data size at a reduced cost. In this method single DNA molecules are attached to a flat surface, amplified in situ, and sequenced using fluorescent reversible terminator deoxyribonucleotides.
The florescent signals generated during the reaction are recorded as images. Finally, the images of the surface are analyzed and processed to generate high-quality sequence data (Bentley et al. 2008). Researchers later on founded Solexa company which was later acquired by Illumina in 2008, and the technology was then referred to as Illumina sequencing technology.
These were further commercialized by Solexa as Ilumina/Solexa Genome Analyzer (GA) (Balasubramanian 2015; Shendure and Ji 2008). Presently, the company owns MiSeq, NextSeq 500, and HiSeq 2500 platforms that produce 15 Gb, 120 Gb, and 1000 Gb of sequencing data per run and have maximum 2 × 300 bp, 2 × 150 bp, and 2 × 125 bp read length, respectively.
Illumina sequencing is currently the most popular technology in the NGS market and is responsible for more than 90% of the world’s sequencing data generated (Illumina 2017). Illumina method uses sequencing-by-synthesis chemistry joining bridge amplification on a solid surface (Adessi et al. 2000) developed by Manteia Predictive Medicine and reverses termination chemistry and engineered polymer-ases (Bennett 2004) established by Solexa.
The general mechanism of Illumina sequencing has four basic steps: library preparation, cluster generation, sequencing, and data analysis. During the library preparation step, the DNA or cDNA samples are randomly fragmented into sequences followed by 5′ and 3′ adapter ligation and index sequences. The adaptor ligated fragments are PCR amplified and gel purified. For cluster generation, the library of adaptor-ligated fragments is loaded onto a flow cell where these fragments are hooked on a slide of surface-bound oligos complementary to them.
Each attached adaptor fragment is then amplified by “PCR bridge amplification” into several distinct copies, each representing the same original sequence called as the “clonal clusters.” The cluster generation enables production of sufficient signal during imaging process.
Next the templates are sequenced by the technique called as sequencing by synthesis based on reversible terminator method that detects single bases as they are incorporated into DNA template strands. Here DNA polymerase adds one of four different fluorescently modified nucleotides to a growing DNA chain (Bentley et al. 2008). The modified nucleotides also contain an inactive 3′ hydroxyl group referred as the blocking group to ensure that only one nucleotide is incorporated growing DNA chain.
Clusters are excited by laser for emitting a characteristic light signal specific to each nucleotide incorporated. The optic signals are detected by a coupled-charge device (CCD) camera, and computer programs translate these signals into a nucleotide sequence. Subsequently, 3′ blocking group and fluorescent dye are removed from nucleotide structure to enable the addition of nucleotides in the next cycles. The number of cycles determines the length of the read.
The emission wavelength along with the signal intensity governs the base calling. For a given cluster, all identical strands are read simultaneously, and hundreds of clusters are sequenced in a massively parallel process. The entire process generates millions of reads representing all the fragments (Reuter et al. 2015; Heo 2015).
For data analysis sequences from the sample libraries are separated based on the
unique indices presented during the library preparation. For unique samples reads with analogous strings of base calls are locally clustered, forward and reverse reads are matched to result in contiguous sequences. These contiguous sequences are aligned to the reference genome for species/variant identification.
The pioneer sequencers Illumina/Solexa GA have been capable of making very short reads ~35 bp and had a selective advantage of producing paired-end (PE) short reads, with the sequences at both ends of each DNA cluster is documented. Further refinements and optimization led to the manufacturing of the latest generation of Illumina SBS technology-based instruments which can generate multiple terabases (Tb) of data per run.
The latest Illumina sequencers produces an output data greater than 600 Gb and short read length of about 125 bp. Illumina platforms are reported to have 99.9% accuracy, and with standard reagents, barcoding of 96 samples per run can be performed (Morey et al. 2013).
Illumina sequencing technology has its own advantages and disadvantages. The library preparation time less than 90 min is compared to earlier platforms (Illumina 2014). It has significantly improved the high-throughput data while reducing the cost and time for each run (Buermans and Den Dunnen 2014).
The error rates of Illumina method are very low attributed to the increased competition of all four reversible terminator-bound dNTPs present during each sequencing cycle. The highly accurate base-by-base sequencing is made possible through the use of blocking groups which also eliminates the possibility of errors, even in homopolymer regions (Ross et al. 2013; Bentley et al. 2008).
Therefore, Illumina sequencing platforms are better for sequencing homopolymeric regions than other platforms (Mardis 2013). Irrespective of its superior performance, theIllumina/Solexa platform also has some limitations.
One of the major problems with Illumina/Solexa platform is sample loading control as overloading may result in overlapping clusters leading to poor sequencing quality. The error rate of the sequencing technology is about 1%, and substitution errors of nucleotides are the most frequent type of error (Dohm et al. 2008; Hutchison 2007).
The efficiency of the sequencing reactions can be reduced due to the contamination of proteins and the altered nucleotide structure resulting from the errors in cleavage of blocking group (Chen et al. 2013). The bridge amplification is also sensitive to GC content variation of the DNA. It is also evident from the fact that GC-rich regions of heterogenous genomes are underrepresented in sequences obtained using Illumina method (Tilak et al. 2018).
The major error in the Illumina sequencing is known as phasing. Briefly, phasing occurs when the blocker of a nucleotide is not properly removed after signal detection. It will block the binding of new nucleotide onto the DNA fragment in the next cycle, and the old nucleotide is detected again, whereby the fluorescence signal of this old nucleotide differs from the synchronous signal of the other nucleotides. This miscorporated DNA fragment will be one cycle behind the rest (out of phase), generating asynchronous light signals that get read by the camera.
Since the signal intensity is the measure to calculate the quality scores, the “out of sync” signal results in a decreasing sequence quality score. This creates the major flaw in the Illumina sequencing, i.e., read length limitation and compromised quality. This presents perceptible hurdles in various applications especially in de novo sequencing (Chen et al. 2013).