00:03
So, hey folks, thanks for joining. Um This session is around building and co piloting the A I Center of Excellence for research. Um If this is not the session you would like to attend, then uh you're in the wrong place, but I hope you stick around because this is really excellent stuff. Um Let me introduce our distinguished panel of speakers here,
00:33
Andy Siegel Global Alliance Manager from NVIDIA. He is our partner at pure Storage on the NVIDIA side, Andy VP of Strategy and CTO with Mark three Systems, a pure uh elite partner and uh an company that does excellent consulting around A I and full stack development. My name is Mila Sky.
00:59
I'm the Global Practice Lead for analytics and A I. And um you know, we're going to talk about A I Centers of Excellence and how to get started and rolling. Um So to make that happen, let me bring Andy Siegel up here and hand over the handy mic. Thank you. Thanks Andy. Great.
01:22
So before the recent explosion of generative A I and large language models that NVIDIA is certainly enjoying right now. Accenture did a survey where they found 84% of executives were concerned that by not having an A I strategy, it would impact their ability to meet business and research goals.
01:46
In that same survey, 30 74% or three quarters of the executives also mentioned that they were struggling to bring A I strategy into their organization. In fact, A I was seen as a bigger challenge than even budget or attracting talent. Oh, I didn't even say that I'm sorry about that. So what we see in many organizations is that on Prem A I,
02:15
resources are not readily, where they're not readily available, many organizations or many units within that organization will go to the cloud for quick and easy access to compute storage and tools. And we call that shadow A I and that's when you have redundant instances of cloud A I being brought up within the same organization. And when that happens, uh oftentimes it finance
02:42
and operations have no idea that's going on independently. Each one of these research units don't see the justification to request Capex to bring resources on prem. But cumulatively, the Opex impact of having these redundant silos going on at the same time can be significant. So why don't organizations have an on prem A I strategy?
03:11
There are three main challenges the first unless your organization already has extensive A I experience designing predictable and scalable architecture with the right balance of compute storage, networking and tools can be very challenging. Secondly, once you have an A I AN on Prem design, it can take months to procure install, optimize this design with a multi vendor
03:42
technology that you have to deal with. And third, once you have your environment on Prem, now it's your responsibility to maintain it, update it, service it and troubleshoot when things go wrong and who do you call when something breaks. So this is an opportunity for the cio to be strategic and impact the company's strategy by integrating cost effective and
04:09
performance on prem A I infrastructure. It leadership can be on the leading edge of business transformation and not just a cost center allowing chat A I to continue and silos to sprawl will only delay the ability for the organization to develop their AI expertise and make it part of their culture. So to solve this NVIDIA in partnership with pure storage in Mark three have co developed a
04:37
it proven uh infrastructure that can scale called A I centers of excellence. So an A I center of excellence is a shared centralized and it proven infrastructure that consolidates hardware expertise, tools and can scale. It dramatically reduces the timeline from development to deployment and drives on total cost of ownership with far more efficient utilization of shared centralized resources.
05:09
This is why NVIDIA partnered with pure storage to develop areas which you'll hear more about shortly. Companies that implement A I centers of excellence are far more successful in incorporating A I into their culture in a meaningful and cost effective way, resulting in improved business growth and research outcomes. At the same time, these organizations attract the world's best data scientists because it
05:34
gives them the resources to accelerate their, their work and their life's most important research. So it's my privilege now to introduce Andy Lin VP of Strategy and CTO of Mark three with uh we've worked with enterprises and research institutions on their journeys to build A I centers of excellence another
06:01
Andy. So anyway, um now, uh appreciate you guys joining me. I'm Andy Land VP Strategy Innovation CTR type guy at Mark three. And I'm gonna talk a little bit about kind of what we're seeing out in the market today. Um This is maybe one way to look at things, but I'm gonna talk about it more just from like a
06:21
industry observation perspective. Um So what is an A I center of excellence, right? Uh It, it's a really generically used term. Um But I'm gonna walk through exactly what that means, at least from our perspective. And also how uh one way you could, you know, attack this, right? This is the only way.
06:38
Um But we've seen this work well. Uh And the core of my section of this is going to talk more about the shared learnings about what we've seen out there, what we've seen work over a period of time. Um So what, what is an A center of excellence is basically a centralized compute resource managed by one team at one point. But that can serve 15,
06:56
10, 100 even 1000 plus data scientists, developers, researchers, et cetera who all have separate needs on I DS frameworks models. They deploy in libraries, they use et cetera apps, they deploy in um who can all vote with their feet on where to build, right? Just because you spin this up does not actually mean researchers have to use it as interestingly enough as it is,
07:13
right? So it really is all about what I call half people process culture and half technology. And that's what the center of excellence is fundamentally all about, right? Obviously, I'm a technology guy. Most of you guys aren't here, right? But if you don't focus on the community aspect around education,
07:30
around developing a process, around these types of things, around building a hub strategy, you can't and you can't attract researchers to use it and then maintain and and get them to use it like a product manager. Long term, it probably won't go well in the long term. So a lot of these things I'm gonna talk about are actually really fundamentally in that
07:47
bucket. Um Just out of curiosity, you show hands um how many of you guys are sort of it operations focused in here? I I would think most, OK. How many of you guys are researchers, data scientists building models? OK. One guy raised his hand twice. You're very busy. Um Awesome. OK.
08:09
So the rest of you guys are somewhere in between, right? Just learning. So, all good. Um So what we've generally seen is so I've sort of painted the picture of what this journey looks like. Um And to get there, what we found over the last, you know, and we got on this journey essentially because we incubated innovation
08:24
units are in 2015 around the time one p toch et cetera and um a lot of our shared learnings of the fact we built every day over the last seven or eight years, right? This community is moving so fast and everything comes more or less broken out of the box, right? Um A lot of being able to stand these stacks up is all about sort of living that,
08:42
right? And um what we found is that in order for organizations to build a center of excellence, right, the centralized resource that serves everybody, right? Is there are three main areas or three main groups within any organization that you need to be able to serve relative to what they need, right? And they're really denoted by the three rows
08:58
that you see on the screen, right? The top is all around builders, researchers, R and D folks, developers, people building the models right in the middle, you sort of have dev ops and MS folks in charge of taking those models, scaling the production at various levels of scale, depending on where they run at the
09:13
bottom, you have platforms, right? Great performance technology runs on top of this is where, you know, a comes in, the play, flash play comes in the play DJ comes to the play, et cetera. But a lot of what you know, I think we focus on is really on the top two layers, right around being able to first educate the communities. Um you know, we have a concept called the Education series,
09:33
but that's one way to do it. That's not the only way. Um then you have to, you know, help roll a set of power users as far as like enabling and giving them tool sets and ID ES and things that they can't do themselves, right, in order to get them to fundamentally use it, right? And then scale them out over time or grow the
09:49
platform organically by attracting users just like you would operate a platform, right? And it's sort of those three steps that we've seen work and then repeating that, you know, really doing it continuously. Um What we find is that with a center of excellence, right? It's a living breathing concept that's all around, you know,
10:04
just continuing to iterate around that. So we've sort of developed a basically a five step sort of framework around, you know, really what I call the five. The idea is that, you know, you have um you know, educate experiment, evaluate enable and engage right at each level, right? Or each phase, there's a different thing that
10:27
you do, not only on the technology stack side, but also fundamentally really, I think on the people process culture side, this doesn't mean that everyone will fall into this, you know, really neatly. But the point of this is really to show like think about technology equally as you are enabling the user, right? Because there are so many different ways you
10:44
can build models that you have to be able to incent builders from their point of view to be able to do it. So this really goes into sort of four key learnings um that we've just found over the last, you know, 567 years and having the privilege to work and support with so many institutions, right? Um The first is really all around education um
11:05
even before you even have a technology stack, right? You know, people are already building stuff, right? Regardless if you know it or not. Um a lot of times you already have, you know, many pockets of, you know, departments or groups or research areas, whether that might be um that are already sort of building models on their own.
11:22
Right? I would say the first step is really build a hub around education, right? And I think that's this is obvious, you might think this is obvious, but build a process um build a workshop series, um do it regularly. Um And enable a form for the community, for you to enable research, research, education. Um A lot of institutions we,
11:42
we work with, they already have that in place, but that's just something I encourage everyone to kind of think about. Um Because all good things come from that. Um That's where you really learn about. Like what's, what are the most important things that the community needs to build and they are organically, you know, sort of power users that sort of drive the center or organically float
11:59
to the top. Um I, I think that another thing that we learned is really focus on the idea of practical A I education, right? Um You know, you can go to a class, you can go to a semester long thing, you can read about the theory of it, right? But most researchers we found that we've worked
12:15
with, right? They're an expert in the area of science or the product they're building, you know, in the commercial space, right? Machine learning and deep learning are really just techniques to get new insights or build a new experience that they haven't before, right? I know a lot of us like to use buzzwords in the
12:29
space, right? And be included. But fundamentally, that's what it is, right? Um They'll learn how these things work over time, but how do you get them started? Right. How do you get them a Jupiter notebook that they can just plug and play and iterate um you know, as an example, right, to, to, you know, classify tumors or predict an anomaly between,
12:47
you know, a series of numbers, things like different telemetry. That's the key. Like how do you focus on really practical education? How do you maximize the number of iterations that somebody's doing? Um I, you know, I know I get asked this question a lot of like, what are the key KPIS, right?
13:02
When you're building an innovation organization, whether it be A I or anything else, right? And a lot of folks talk about the ro I these things, right? Their end goals, right? But what can you control? Right? You cannot control how many iterations necessarily it takes for somebody to get to a
13:15
model that's like of 95% accuracy to predict, you know, so and so, right, um you know, just, just like, you know, it's taken for instance, like those of you guys who are into sort of three t three protein folding, stuff like that with alpha fold, right? That's a huge thing now, right? How many years did it take to get the 90%
13:32
accuracy? Right. It's hard to hard to predict, right? So the one thing that you can really control with education is how do you minimize sort of the the the period of an iteration? How do you enable your team and your process to get to starting to get data, to train a model and then iterate as quickly as possible.
13:49
That's really the one KP I, in my opinion that you can control as an innovation leader. Like how do you with a combination of people, process culture and technology and tool, how do you get that mix to enable that? And that's the one thing that you can, you know, fundamentally control and then, you know, also think about like mainstream frameworks, right? Obviously, we know your pie torches and intense
14:08
flows are really your big too, right? Markets probably moving more toward pie to just because of the size of transformer models led by large language models and generative models, et cetera. Um But you can kind of see like, you know, that maps on to what community you're in, what type of research you're doing, you always do due diligence ahead of time to make sure
14:24
that you start with tool sets and communities that are very mainstream to what you're doing, right? Um Because that will help you later on a year or two years down the line. When you're trying to get help, you're trying to build community with folks, right? If you started off, you know, with this because it seemed easier or on paper seem,
14:39
you know, better, right? And you don't think about like the ramifications of going mainstream in your particular area of research, it can be more difficult for you later, right? So I'm just trying to think of like these are things that we've learned like we've seen people do, I don't know if they lucked out or, you know, it's hard to predict the future,
14:52
but that's one thing to kind of think about. And then also like, you know, build all star teams. Um definitely on the machine learning, deep learning space, right? Obviously, you need folks that understand the data, you need technology people to sort of pull the data together. You need data, scientists sort of like there's, there's a collection of people that in the past
15:10
have not worked together, get comfortable with that, you know. Um Another sort of sub area that I've been working a lot with in video around is specifically around like three simulation, digital twins omniverse, that's even more disparate, you know, you got 10 different types of people working together. So and these, these megatrends feed on each other.
15:25
So get comfortable with that idea. I mean, I've, I've called more people from my personal role at X than I ever have before in the past. Uh It's kind of weird honestly when you call somebody from high school and you're like, hey, I need your help on this thing, right? That's what I'm noticing is happening out there right now.
15:41
So just to get comfortable with, you know, working with a lot of different types of people. Um sort of key learning number two, right? MLS is a key, right? Um I'll sort of define that machine learning operations. So just like DEV ops it's kind of a sort of an over used word, right? I have kind of mixed feelings about this at
15:59
this point in time, but it's the idea of basically automating and connecting together processes that are used to build an iteration of a model over time. Right? There are a lot of software platforms out there that do different things, right, that talk about this. But this is really the next step, right? Once you have education,
16:15
once you assemble the people together, once they start organically iterating and building models, and you have a process to get the community on the same page and have people talk to each other from a people perspective. This is really where the technology starts coming into play, right? You probably don't want to do this too early if you don't, if you skip sort of step one, right? And you go straight to this,
16:34
I think you'll find that you're optimizing something that nobody is using, you know, and, and it really is an organic sort of pace on how this happens. So, you know, it's really around what's the one thing you control, right? How quickly iteration happens, you really can't control the research, you're doing the discovery, you're making the accuracy of the
16:52
model you're building to make that discovery, right? These are things that just happen over time that have varying lengths. It's sort of like the innovators dilemma when you build something that you've never built before how can you predict how long and how it's going to happen? You can't, right. So, the one thing you can control is how do you
17:06
actually address the iteration? How quickly does your team work together to figure out? Is this a good idea? OK. Repeat. And, and the faster you can do that, I think as an innovator, as a leader, the better you do, right. So, you know, velocity is important. So, um some ideas about how you're connecting
17:21
to other things. Um I found that if you sort of observe what your teams are naturally doing and then you, you, you lead that to determine your strategy on how you connect together processes when it's ready. Uh Maybe not the whole thing at one time. That's probably the better way to go about it. Um You know, everything in the space I've noticed just from not everything but one
17:43
approach that's obviously significant obviously is Sti's. Um you know, I know we have one person that does nothing but, you know, sort of GP centric uh scheduling tti, right? It's its own art in and of itself because there's just so many nuances and how you can do that. Um that, you know, you really want to focus on
18:01
that. But most of the platforms out there obviously focus, focus on deployment on that side of things, right? Um Outside of obviously, you have your, you know, your bride slams your H BC centric uh way of doing it as well. But this is definitely what we're seeing, you know, sort of grow. And then obviously,
18:15
obviously, I think user experience always comes first. Um I think there's always a temptation to want to optimize or, you know, focus on performance or focus on these things, right? But it really comes down to the human side, right? You know, um can the person using it feel like
18:32
they're running locally, you know, can you serve them just like they were to spin up any, you know, ID or any model that they were on their own, you know, laptop or workstation in their own office? Can you give them the same feeling? Right? If it feels like, you know, you're changing their experience at the benefit of optimization,
18:51
there's a really good chance that you're gonna have an unsatisfied us and they're gonna leave, right. So you always want to make sure that you engage obviously your users as part of that process. Um I think it's sort of common sense, you know, like in the sense that, you know, if they're going to eat your meal, that you're going to serve up to them, you probably want to involve them in the shopping
19:08
trip. Right. And that's, that's kind of the one rule and the one principle that I've always found, right? If, if you're going, there's so many people involved here and sometimes it's easy to get past it because we're all so busy but you always want to involve, you know, everyone and sort of the, the overall strategy as much as you can,
19:24
you want everyone to be bought in um sort of key tips and learning is number three engage and activate. I mentioned this earlier. But have technology, have people process culture, right? Um Technology is one way to shorten iteration cycle. I know obviously you've got all these things we're gonna talk about,
19:44
right? And, and Andy obviously represent um you know, two amazing companies in this space. Uh But you, you, it is just one means there's just one part of the innovation cycle, there's people, there's data, there's a lot of things, other things, right? So you want to be able to make sure that you
19:58
always think about that equally, in my opinion. Um Once you have the platform up, right, assuming that you've got all the users already queued up and you did a good job of education ahead of time and you evolve them and sort of the process, right? Think like a product manager, right? How do I frictionless on board them?
20:15
How do I give them, you know, you know, great knowledge bases. How do I give them great workshops? Um How do I amplify the work they're doing? Right. Um These are all key parts of growing these center, it really to the next level. Um You know, obviously you got technology stuff, make sure it stays up,
20:31
make sure it works well, you know, make sure, you know, the latencies get all that stuff, right? Of course. But this is another key aspect of what we've seen out there. Um A lot of different approaches that I found as far as like activation, right, hackathons, data THS somehow got changed in our great
20:48
um uh you know, any way you can communities to get folks, you know, automatically building and funnel them into the platform that you offer just like a, a platform, right? Um You know, start a use and like, you know, on board, people have given the great as crazy as that sounds, right, you never think to use it in this context.
21:05
But um that's absolutely in my opinion, how you should think. Uh and then also like around partnerships, right? Um You know, obviously you have your community of builders and researchers, but oftentimes um these, you know, these ways of hub education doing all these things um can often be used to build partnerships organically, whether it be public,
21:23
private partnerships, public, public partnerships, partnerships, federal agencies, right? And these, these can obviously affect things like funding. Um You know, whether you are applying for grants or collaboration with industry, if you're more focused on really on the higher education research side, there's a lot of ways that you can use this to your advantage if you
21:41
just keep an open mind about it. Um And that, that's, you know, a lot of times kind of what we think about around the idea of what we call, engage and activate later on. Um, and then lastly, um, I wanted to talk a little bit obviously about the fact that, you know, infrastructure does matter. Right. I spent all my time here,
21:59
uh, really talking, I think about the soft side of things just because I think, you know, especially at a technology conference. Right. Um, it's often overlooked, right? You know, people want to cut to the chase so on and so forth, but it's extremely important to really talk about. And that's sort of like almost like my,
22:15
my uh my uh my representation up here as far as like from a panelists perspective. But at the end of the day, right, obviously, a tech stack does matter. You've got to train that model and use that model to solve the problem or see how much more insights you can get or whatever the discovery might be, right? You've got to be able to use to do that.
22:37
So this is where the idea of obviously DJ and flash blade and area really come into play. Remember what I said, what's the key, what's the metric, what's the KP it's around iterations, right? Whether it be people or technology, right? Obviously, if you've got infrastructure that will allow you to train models faster and feed data pipelines faster,
22:55
it's going to be able to shorten that loop length, right? So you want to be able to really have an idea about what that is. So what does that mean? Obviously, it's, you know, the idea of, you know, if you think about it from a storage perspective, right? You want to have something that's obviously all
23:08
flash, right? Just because these types of workloads are extremely violent, right? As far as like random and sequential, all the worst types of workloads, you could possibly think all in one, right, if you're trying to train a model and infants on a model, um but also think about the other things that are important in the technology sector,
23:23
it's not just the idea of of the media, right? But it's also the idea of modern integrations, right? I talked a lot earlier about a little bit about, right? And how that's sort of the modern one of the modern ways of doing it, right? Obviously, you have sort of the traditional HPC way, but this is something to think about, right?
23:40
Like does the platform have a way you can easily integrate in from AC si driver perspective, right? And obviously this is where the idea of, you know, port works and what peers is doing in the space uh really comes into play. This is never really talked about I found, but it's extremely important when you're co piloting a scenario like this because if something is not easy to use and simple and
23:58
integrates, I mean, it's it can be, it can be difficult right later. Uh and then at the end of the day, right, if it runs faster, right? This is one part of it, right? If it takes you a day to train a model versus, you know, half a day or 1/4 of a day, right, that means you can theory do like that many more
24:12
iterations. So it's all about iterations. Obviously, you got to get your people part of it the process down because if people aren't ready to take advantage of it, right? It may not, may not make a difference, but that's just something to kind of think about. Then obviously, you know, drive up your GP utilization, right?
24:25
And video is doing some amazing things in the market place. Uh not only around the new hopper generation to be able to, to do this for deep learning machine learning, deep learning workloads, but you're seeing in the video is also innovating with things like Grace Hopper on the CPU side, they're, they're taking the memory GP memory mount,
24:42
they're expanding it, right? They're thinking ahead, they're, they're opening up the networking side, right? With, with what uh you know, they're doing with Melan over the last few years and sort of the next generation. So there's a lot of different things that are happening, right? You know, sort of higher up the infrastructure
24:57
stack, but just know that this makes a huge difference. Um I want to emphasize this, this is like I said just one way to do this. Right. Um, you know, start, start an A I education series or a hub way of seining education. Right. I think it's obvious to do education, but what's harder to do is do it repeatedly and sort of like a hub based,
25:16
all inclusive manner, crusher community. Right. A lot of times it takes a person or two to sort of be focused on that this is something that we've sort of figured out over the last few years. Um something that is an offer, but, you know, like I mentioned, it's, this really isn't about us. It's more just the idea of some of the learnings that we found and this is one of the
25:32
best ways we found to just connect together all those people. And then, um you know, this is an example of how it might work, right? In a lot of scenarios where you have a large institution where you sort of bring it all together, right? So, you know, whether you Sprinkle in hackathons, um run an education series,
25:49
right? Um And then use that to sort of invite the community out, get everyone educated, you know, provide tooling and real tools to solve the problems. And then obviously a subset of folks will naturally ride to the front, especially if you're giving them what you want, right? I think one of the things I always remember is,
26:04
you know, like when you're, you know what I found is I'm always intimidated for those guys out there doing some researchers, right? I always feel like I have impo because, you know, we have the good fortune of working with some of the smartest people in the world that are, that are just doing amazing things. I'm thinking like, why, why am I talking to this person? Right?
26:21
Um But, you know, everyone needs some degree of help, right? And, and if you can help them a little bit, you know, my perspective is, you know, we can get folks started and just save them time and de risk, you know, you know, their starting process on the tools they're using and, and the approaches they're taking from machine learning and deep learning perspective,
26:38
we sort of done our job and, and what I found is that everyone's been really great at sort of teaching us, you know, what they, what they're working on and this is sort of what this is showing, right? It is the idea of if you combine all these together and you throw them all together, right? This is, you have the ability to organically build up what the center of excellence fundamentally
26:52
looks like. So that with that, um I wanna thank you all. Um I wanted to hand over to me for the last portion of the day. Um Obviously, infrastructure matters. So no one better to kind of talk you through what peers doing in the space of me Thanks, thanks Andy. So that was a really great context in terms of the value that an A I center of excellence can
27:15
bring and how you build one and put it together. Let me talk a little bit about the infrastructure side of it. How many of you are familiar with our A I ready infrastructure? Cool. So I see a number of folks that are familiar, hopefully this isn't going to be too redundant, but it looks like it'll be new for, for many of you pure was one of the first companies to collaborate with NVIDIA back in
27:42
2018 to create the first instance of A which stands for A I ready infrastructure. Um We worked with NVIDIA launched this and over the years, it's evolved and continued to evolve right now. It's AES which combines NVIDIA DGX A 100 H 100 servers with Melan in band and Ethernet switches
28:10
and flash blade S. We're working to certify air base pod using kind of the latest base pod iteration and evolution of the stack that can take advantage of uh bright compute manager or the the base pod command manager for orchestrating the server. But essentially what A lets you do
28:39
is deploy a turn key infrastructure that can be the infrastructure seed for your A I center of excellence. You can start small, you can get rolling right away because it deploys and it's pretty much ready to go and then you can scale all the way up to the base pod limits of 64 nodes and flash blade scales with you, right. So this is really great for it and
29:07
infrastructure professionals because you get to eliminate a lot of the complexity. You get started right away. You can start small and then scale to massive sizes. And it gives you this like agile platform that can meet a broad range of needs as your A I Center of excellence matures and as you kind of get further into that development stage that Andy was talking about and for the data
29:35
engineers, data scientists at your organization, this is really great because it simplifies the A I at scale. It gives them the foundation infrastructure, they need to do everything they want to do without having to worry about managing that infrastructure since it's chunky and easy to use. So you get to reduce your training time, you get to increase the utilization
30:01
of the infrastructure and you know, kind of increase the number of iterations you can complete in terms of your model as Andy was talking about and you know, you basically get to deploy more GPS at the same space and cost, you get to keep those GP us more efficiently utilized. So in terms of A I infrastructure requirements, there's a couple of
30:28
dimensions that we see our customers talking about that they you know find is meaningful. One is just ease of use and management especially for smaller organizations and kind of some of those smaller teams managing infrastructure is something that data scientists have to do on their own sometimes. And it's nice to minimize that work and air makes that easy
30:55
environmental efficiency. More and more people are talking about, you know, essentially running out of power in their data centers or just the cost and density of power and access to it or just the eg side of things and making sure that your company, you know, is environmentally efficient. So both NVIDIA DGX servers and pure flash Blade have some of the best performance per watt
31:25
and you know what you get for the power that you put into it. So it's a great solution from environmental efficiency and power efficiency perspective, it can perform at any scale. So as you get more traction, more and more value from the A I work you're doing, you can scale and gain performance as well as capacity and ability to handle more work.
31:52
It's an agile infrastructure and platform. So whatever you want to do, like right now, large language models are really hot. There's a lot of gender of A I that's at work and you know, that's sort of turned into the killer app for A I from a practical perspective, but there might be something else that comes along.
32:12
It might be that now that you've built this A I center of excellence, you want to go back and revisit some of your more traditional analytics and HPC and see if you can make it better with deep learning models instead of more traditional machine learning algorithms. And then of course, this comes from trusted industry leaders that have your back,
32:35
they give you the support that you need, that have, you know, consulting and advice. Um NVIDIA has some of the smartest people in the industry that are available as part of your NVIDIA support contract to help you craft those models and make things happen. Um The other aspect of infrastructure that's important and that can
33:01
deliver is elasticity. So when you're building out this infrastructure and A I center of excellence, you know, it's not just about training the models, it's about being able to prove those models out and use them. So with traditional deployments, what tends to happen is you have, you know, a cluster that's for doing your A I model training,
33:27
another cluster that is for more traditional analytics and doing your data cleaning and you know, processing and then a cluster that you use to deploy your A I models for inference on new data coming in where you're actually making sure that your models are doing what they're supposed to do and you know, kind of delivering value for your business. Well, with A E and DX H 100 A 100 the GP
33:56
US themselves have this capability called multi instance GP U. So essentially you can take a single GP U and carve it up into up to seven sub GP us that can be allocated and scheduled to your different jobs. And a single DGX server has eight of these GP US inside of it. So this gives you a lot of flexibility in terms of, you know,
34:20
how you schedule those jobs and what part of that inference, analytics or training you want to emphasize at any point in time. So you know, you can reschedule and like re carve the GP us in a flexible way more or less on the fly and adjust how your infrastructure is allocated to those different parts of the workflow.
34:47
Um OK. That's the wrong button. Um You know, just for the kind of gear heads in the audience, this is sort of what a network diagram looks like. It's pretty straightforward. There's a cluster made up of NVIDIA DGX servers, they are interconnected on a compute network fabric here.
35:10
We're just showing the storage fabric. We build it out of NVIDIA Melan spectrum switches which provide rocky Ethernet. So high speed low latency ethernet fabric for interconnecting to your flash blade. And you know, you deploy it in the reference architecture with redundant network switching to make sure everything is highly available and resilient and it's connected up to the flash
35:39
blade which will automatically load balance across all the blades. The flash blade itself takes care of all of the storage networking inside the chassis and the cluster. So you don't have to build out a separate storage. You know, basically you don't have to build out a separate network to interconnect the nodes of
36:00
your storage system. Um which simplifies things even further. And if you're working with a relatively small cluster, one option is to take advantage of rocky ethernet networking for both the storage fabric and the cluster interconnect fabric. NVIDIA announced spectrum X very recently,
36:25
which is even their next generation networking technology with ever lower latency rocky capabilities that is pushing it closer, closer to infinite band capabilities. So that's a direction to consider in the future. But one way of looking at this is it gives you a certain level of flexibility in starting small. So if you're only deploying a couple of DX servers to start,
36:53
you can take advantage of Ethernet connectivity on those spectrum, Ethernet switches for multiple fabrics. Um One thing that we found and developed for our customers that is very useful in analytics and A I is rapid file tool kit. How many of you have heard about or seen rapid file toolkit before? No. OK. Uh So a couple of folks,
37:21
it's, it's free, it's available on pure support site, you just download it as a tarball. But what it lets you do is significantly speed up the file based operations in your A I and analytic work flows. So all of those scripts, all of the the tooling that you know might enumerate a bunch of different files and that happens a lot in training. So like for A I training,
37:48
you often have a data set that has hundreds of thousands or maybe even millions of files in them. And as you are doing the training, you want to permute the order in which you process those files so that you can um you know, make sure that you don't over fit the model. But essentially what these tools let you do is make it very easy to list a million files and
38:12
then you can randomize their order, right? So that can speed up your work flows. We've implemented a number of different replacement for UNIX utility. So the common UNIX tools that people build their scripts and uh you know, a lot of times you'll see this either in shell script or often called from inside of a Python program. So any time that your, you know, code is
38:38
calling LS or find or uh you know, copy or you know RM, you can replace these and they're pretty much plug and play with the parallel version of that tool. So essentially the rapid file tool kit just implements a lot of user space parallelization. So instead of having a single IO stream that goes to storage and back to access your files, you have up to 64 streams of IO that
39:10
are looking and processing stuff in parallel. And that can significantly reduce the completion time of those jobs and they are pretty much plugged into play. If you were familiar with rapid file tool kit before, when we first launched it, it was essentially a bunch of compiled Python for rapid file tool kit version 2.0. And we're now at 2.1 we've re implemented all
39:34
of that stuff in Gola. So it's a lot faster and capable and it works, it works fairly well and you can just sort of substitute your calls to these programs with the rapid file tool kit version. Cool. So um you know, just to wrap up, let's touch on a couple of different uh success stories places where we've deployed a
40:01
where we've deployed flash blade alongside DGX. We have a number of different companies where that's true. Probably most visible and famous one is with meta where meta chose pure storage to be the storage foundation for their research supercomputer cluster. And they're using a mix of flash blade and flash array to provide that underlying storage.
40:29
And you know, if you want to learn more about building out that A I Center of Excellence, here's the, the link and QR code and you know, we'll have recordings of this session as well to uh refer back to and share with all of your friends and colleagues and high school friends that uh are doing A I these days. So with that, uh let me wrap up and uh thank you.