Genomics Data: A Never-Ending Treasure
Digital Information has dominated the first two decades of the 21st century, and Genomics is emerging to join the race and take the lead.
The Beginning
Since the conception of the Human Genome Project (HGP) in 1990, there has been an immense interest among the public in the human genome, which culminated at the beginning of this century. A decade later, the International Human Genome Consortium announced and published the first draft of the human genome- the first significant scientific achievement of the 21st century. Since then, scientists have accessed, distributed, manipulated, and analyzed the human genome in millions of ways, resulting in massive data every year. Now, at the beginning of the third decade, genomics data is on its way to leading the “Big Data” world!
A Few Housekeeping Terms
DNA: Deoxyribonucleic acid (DNA) is a molecule that contains the biological instructions that make each species unique. It’s the substance of heredity.
Gene: A DNA segment containing the code for making a specific protein or RNA molecule. It’s a unit of heredity;
Genome: A genome is the complete set of genetic information in an organism. It provides all of the information the organism requires to function.
Big Data: A Modern-Age Gold Mine
Big data are massive, complex data sets acquired from new sources, and traditional data processing software can’t manage and analyze them.
Colloquially, we say that data is the new gold; however, mining that gold is becoming a cumbersome challenge for many scientific communities and industries because of a lack of adequate resources, expertise, and scalability.
According to SAS- a US-based analytics software company- there are three defining dimensions (the three V’s) of big data:
Volume: The amount of data collected from different sources.
Velocity: The data is obtained in real-time at an unprecedented speed.
Variety: The data is multifaceted and multi-formatted, i.e., audios, videos, texts, images, structured, unstructured, semi-structured, etc.
The Big Data Generators: Astronomy, YouTube, Twitter vs. Genomics
Traditionally, astronomy has always been the most extensive data generator in this world until the advent of social media platforms. If you glance for a second at the Internet Live Stats, you will understand in real-time how social media dominate the internet world.
Undoubtedly, these platforms produce the most significant chunk of data globally; nonetheless, considering the life-cycle of a data set, i.e., data acquisition, storage, distribution, and analysis- genomics is emerging as what they call- a ‘four-headed beast.’
Data Acquisition: Scientists have predicted that Astronomy, YouTube, and Twitter will generate about ~1 zettabyte, 1-2 exabytes, and ~1.5 petabytes of data per year by 2025, respectively. In contrast, Human Genome alone will produce ~20 zettabytes of data by 2025.
10^6 Million (Mega)
10^9 Billion (Giga)
10^12 Trillion (Tera)
10^15 Quadrillion (Peta)
10^18 Quintillion (Exa)
10^21 Sextillion (Zetta)
Data Storage: Astronomy, YouTube, and Twitter will store 1 exabyte, 1-2 exabyte, and 1-17 petabytes, respectively, by 2025, whereas genomics will require 2-40 exabytes storage capacity just for the human genome.
Data Distribution: The primary bandwidth requirement for Astronomy, YouTube, and Twitter vary from 600terabytes/second, 204petabytes/day, and ~0.5 gigabytes/hour, respectively. On the other hand, genomics data are distributed at a rate of many small 10MB/s up to many massive 10TB/s data movements.
Data Analysis: The non-genomic platforms mainly analyze data in real-time parallel to thousands of cores. However, a forecast shows that about 2.5 million species will amount to 50-100 trillion sequence alignments by 2025.
A Promising Future Ahead
Among many, Astronomy, YouTube, and Twitter are three dominating big data sources.
Yet, due to remarkable advances in sequencing technologies, we have already generated enormous data in the past decades. And, given the recent completion of the gapless human genome sequence, ‘Genomics’ is emerging as the next leader of “Big Data” faster than ever.
It's awesome.. many new information I have got from your this Newsletter. U r nice. And ur words too 👍🏼👍 keep continuing. I'm interested on ur all Newsletter 😊👍from @prachi