Video
Accelerating Next-Generation Genomics Pipelines with Modern Flash

13:21 Video

Accelerating Next-Generation Genomics Pipelines with Modern Flash

Watch this session to see how organisations can speed up their genomics pipelines and reduce turnaround time for clinical sequencing by up to 24x.

00:01

Mm. Hello and welcome to my accelerate session on accelerating next generation genomics pipelines with modern flesh. My name is Lucas Bobrov and I am a principal data architect at pure storage, whether it is genomics sequencing or next generation sequencing, they are all targeting similar work clubs.

00:31

An individual's genome is roughly 100 gigabytes in raw data size. That grows to a total data footprint of over 225GB. After analysis which is a lot larger than one would assume the main supplier sequences. Alumina has many models. Customers use The Nova C 6000 can generate roughly 300 terabytes of data every

00:57

year and customers sometimes have multiple sequences running at once. Secondary analysis requires hundreds of core hours and high throughput from storage for processing. Next generation sequencing is getting faster and cheaper. It requires modern architecture for these massively parallel DNA sequencing methods. Mhm.

01:24

And given the nature of more data being processed, they require a higher throughput payload. The workflow of genomics allows researchers to put powerful computational queries to analyze and decode information in D N A and R. N A. If you think about the data sets and their ever growing complexity and size,

01:49

we are now capable of sequencing that material faster than we can actually decipher the information we're looking for in it. Looking at the next decade. They're estimating that researchers will generate around 40 exabytes of data. That's an astonishing number because if you think about it,

02:08

a genome sequence is only 100 gigabytes in size. You may be curious where the large uptick in capacity has been generated around in a large portion in recent time has been around the coronavirus testing and the results from that According to some analysts, Genomics data is doubling every seven months now. Let's look at the basic genomics workflow that

02:35

you can find as part of an analysis pipeline. There are a couple of stages throughout the process. The primary analysis which analyzes a sample biochemically and produces raw data, generates millions of small files over a period of time. The secondary analysis, which takes the raw data aligns it to a reference and identifies

02:57

variants from the reference is very CPU and memory intensive with lots of sequential reads and writes, usually through distributed applications. In most cases there are homegrown computational queries that are used in this phase that are specific to the researcher environment. The tertiary analysis which analyzes the variants.

03:21

It adds annotations to help interpret their biological and clinical implications. They can utilize complex algorithms and applications such as deep learning and machine learning For customers such as page 80 to get insights into basic biology, potential causes of genetic diseases and possible treatment or prevention options. Now, what does a normal customer environment look like when they're working their way

03:51

through their genomics pipelines? Well, it has a few different stages and components that support the workflow. The primary analysis stage usually is what is coming off of the sequencer in the environment being written to an sMB share on a device. Sometimes storage, sometimes not the secondary analysis stages supported with storage and other

04:12

infrastructure components such as high performance computing clusters. The tertiary analysis stage again has different components in some environments. This is a lot to maintain and scale with this ever growing environment of data. Your genome is a magazine and the only way to read the lettering on the magazine is by tearing up the magazine pages into millions of small pieces because the reading machine can

04:36

only read small portions of the alphabet at a time. The raw files that come out of the sequencing instrument are called BCL files and these contain the lettering written on one small shred of paper per file. There are millions of these small files corresponding to the millions of shreds from the big magazine itself.

04:56

The sequencing instrument generates BCL files and the process up until this point is called. The primary analysis. These BCL files get converted into fast cue files which are more organized and can be used for further analysis, bam or sand files are stitched together versions of the magazine and are larger in size bam and saM files are aligned to a reference genome or a reference

05:24

to the magazine to then spot the differences. The true information when doing genomic study is to find out the difference in your genome compared to the standard and reference genome Keep in mind that every human human genome is 99.9% the same. The whole science of Genomics is about finding out that .01% difference.

05:51

The process of finding out the differences is called variant calling. This leads to small files called vcF files that are essentially marking what's different the process up until this point is called the secondary analysis. Once a scientist has this information, they can then go do tertiary analysis where they can put the genetic variations into context with other clinically meaningful information such as

06:17

imaging records, patient history, other similar cases etcetera. To make sense of that information with all this data transformation, managing this all on multiple platforms can be a really big challenge. Also finding one single platform that can handle the variety of different I. O sizes and file sizes can also be challenging to find as well.

06:49

Now imagine a world where you can simplify this complex storage paradigm that's involved with the pipeline analysis stages. Flash blade from pure storage enables you to easily deploy high performance file storage to meet the demands of the different stages of the analysis. Flash blade has proven deployments in the field supporting critical genomics deployments at

07:12

customers worldwide. Some of the key value propositions with using flash blade in this pipeline analysis stage, our consolidation of NFS and SMB storage on the same platform. Multi protocol shares of data allowing different components of pipeline to access files, scalable storage to meet growing performance and capacity requirements,

07:38

improve speeds through the pipeline due to the performance of flash blade but also because the data movement between platforms and shares is eliminated as a requirement. Now ask yourself, have you ever thought about improving and deploying a scalable solution such as flash blade to meet the demanding changing workload of your genomics pipeline. Now don't take my word for it. See what flashlight has done to improve a

08:03

couple of customers genomics pipelines. McMaster University's Macarthur lab oversees a global database that curates data models and algorithms associated with superbugs and other threats to human health. Flashlight is the underlying infrastructure for a new gene sequencing system that helps researchers generate insights that lead to faster identification of global health threats.

08:29

Flash blade speeds Drug discovery by analyzing select datasets 24 times faster than their prior solution. They're faster pipeline analysis allows team members to monitor global health threats more closely. Flash blade scales to support additional research and clinical partnerships. Their analysis pipeline generates results in three hours or less for insights that save

08:54

lives. Another customer pipeline that flashlight helped improve is that the center for Research and agricultural genomics Craig was able to update its HPC environment to achieve the speed scalability and stability required by the research projects of its users by leveraging the pure storage flashlight solution. Now you may be scratching your head about agricultural genomics and how it uses the same

09:22

pipe on analysis as human genome but that is the case. The installation of pure flash blade has satisfied the expectations of the crag I. T. Department by providing more speed, less latency, new features and above all reliability. Writing speed of the flashlight in the crack environment has been able to achieve numbers

09:44

3-4 times faster than the prior solution they utilized this performance has made it possible to eliminate the bottlenecks that appeared when the cluster was under heavy load. Is also made it possible to streamline other tasks such as backup copies or occupation sampling taking place every two hours to invoice I.

10:06

T. Services to the internal users Pure storage flash blade really does add a lot of value to customers genomics, pipeline application analysis, You can manage your your capacity calculations to ensure you will overrun capacity using the great tooling built into the platform and Pier one which is pure storage is software as a service management portal.

10:36

If you could run your sequencing analytics up to 26 times faster compared to the solution you have today. What would that yield for your business? Can that open doors to other revenue? The business hasn't been able to appear before. Mhm evergreen From a business model perspective of pure storage is how we enable you as a customer to eliminate forklift upgrades.

10:58

That caused pain to your end users where we can offer disruption free upgrades for the ever growing genomics datasets as your business needs change having the peace of mind without having to migrate the data ever again is valuable today, your genomics pipeline, maybe CQ heavy the typical deployment but as software innovation happens, maybe you start to consume gpu accelerated workloads.

11:26

Most existing deployments that customers have are spinning disk based and they would not survive the demanding needs gPU s need from them having a simple deployment model where you can scale and modernize flash blade without too much effort, can pay dividends to the business all while delivering speed where it matters most to users in the business itself.

11:57

So why would you choose pure storage for genomics in your environment, simplicity where you can capture the data from the sequences directly to the storage that you do your analysis on consistent performance as your storage needs change as well as your datasets grow Scale from as little as 50 terabytes all the way up to petabytes of capacity in the same name space improve your operational efficiency

12:27

where you're not moving data between platforms and you're avoiding copy errors that can hinder analysis results, high availability where you can maximize your sequence of run time and have a resilient storage platform and of course, evergreen business model enabling you to never have to read by capacity because as you've learned throughout this presentation itself,

12:56

genomics, data only grows and it was usually retained indefinitely. Thank you for attending my session on accelerating next generation genomics pipelines with modern flash, I hope you enjoy the rest of the accelerate sessions you attend

Healthcare
Video
Pure//Accelerate

Genomics data is growing exponentially, with data storage demands doubling every seven months. As life sciences and healthcare organisations move to precision medicine, they need hyper-scalable, flexible, cost-effective data storage to match their growing needs. Near-real time genomics analytics requires infrastructure that is both low latency and high bandwidth.

By choosing a modern flash storage architecture from Pure Storage, organisations can speed up their genomics pipelines and reduce turnaround time for clinical sequencing by up to 24x. With multi-protocol support, Pure FlashBlade can capture data directly from sequencers to conduct primary, secondary and tertiary analyses with a single, scale-out dynamic storage solution that handles small and large files equally well. The scalability, high availability and resiliency of Pure helps maximise sequencer runtime, and combined with the best of on-premises and the cloud, organisations can truly focus on delivering the promise of precision medicine, drive a faster time to science, and improve patient outcomes.

Continue Watching

We hope you found this preview valuable. To continue watching this video please provide your information below.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Your Browser Is No Longer Supported!

Older browsers often represent security risks. In order to deliver the best possible experience when using our site, please update to any of these latest browsers.