Skip to Content
9:22 Video

Retrieval Augmented Generation (RAG) with Kubernetes and Portworx

In this lightboard video, we will learn how organizations can run their open-source Large Language Models from Hugging Face on Kubernetes cluster on-premises using Portworx. We also talk about how organizations can avoid hallucinations by building a
Click to View Transcript
00:00
Hello. My name is Fain Shah and I'm a senior technical marketing manager with put works by pure storage In this light board video, we're going to talk about how organisations can build retrieval, augmented generation models or ragged models to make sure that their LMS don't hallucinate. I know that there was a lot of jargon. That's what we are going to break down in this
00:19
like video. So let's start by talking about large language models or LL MS. Right. There are a lot of open source LL MS, and then there are, like, enterprise or or closed source LL MS. Uh, but if you ask them specific questions, they do a good job at answering them based on
00:41
the training data that they were trained on. But if they were not trained on domain specific information, they would give you false or incorrect answers and portray such that they were experts in in in those domains. We definitely don't want that when you're like, let's say, if you are a lawyer and uh, you are looking at Case Law and you were asking it to cite a
01:01
specific case, you don't want it to make up citations you want to You want to make sure that the LLM app that you're using the LLM based app that you're using is, uh, using actual information or actual his historical case data to give you answers. So, uh, one way organisations are doing this is by, uh, fine tuning, Uh, like they're taking an existing open source models and then they're
01:26
fine tuning it down to a specific domain. This can take a lot of time and a lot of GP U resources. And then once the model is published to a production, uh, set up as soon as that happens, the the the accuracy starts deteriorating, right? Like you you won't have information that was
01:44
that's available, uh, in the data set after that model had hit production. Uh, so fine tuning is definitely one way A bit expensive, not 100% accurate or not 100% up to date. Uh, the other way to do this or how organisations are trying to solve this hallucination problem is by using the rag model or the retrieval augmented generation model in a rag,
02:06
model organisations take an open source lm that's available on repositories like hugging face or github or their own reports download them and then they are feeding them information domain specific information from a vector database. So in this video, what we are going to talk about is how you can do all of these operations or build this architecture on a community cluster.
02:28
So for this video, we already have deployed a community cluster. We have a set of compute nos that are responsible for running are non, uh, GP U based workloads. And then we have a set of GP U nodes that will be responsible for serving our large language model because they need a lot of GP U resources or GP U horsepower, uh, on the compute side. Uh, and then we also have port works installed
02:50
right on the computer side. We have installed the N VDR device plug in and this plug in allows, uh, allows users to either to share GP U resources right? And they can do this by one or one or two ways, uh, the first one being time slicing. So instead of breaking down your GP U into multiple smaller chunks, you can give access to the entire GP U to a specific data scientist or a specific
03:17
application or an LLM model, and they can, uh, share it for a specific interval of time. And then other users can do the same for the future time slots. Or, if you wanted to go down the second route, which was, Let's break down the entire GP U that we have and then make sure we can share it across data scientists. So these are some of the capabilities that
03:35
device plug in brings to communities and makes it resource efficient and allows, uh, allows multi-tenant applications or multi-tenant models to be run on communities. We also deploy a Q bra operator. So Q is an open source project that allows organisations or administrators to offer ray clusters or ray jobs or reserve services as a service to their data scientists.
04:02
So, uh, ray clusters allow you to curate your data. Ray jobs allow you to like train or fine tune LL and LLM, and then rayer basically gives you the ability to download a model from a repository like hugging face and then run it inside your communities cluster. So this can be your LM that's been deployed by the QB operator as a custom resource and that,
04:27
um, you're actually pulling down that model from a hugging face repository to build the retrieval augmented generation piece of this architecture. We'll also deploy a vector database, and the vector database allows you to, uh, provide, like, semantic search capabilities to to, uh, for your large language model. So instead of storing relational information or
04:50
document based information, you are actually storing information as vector embeddings. And then you can perform nearest neighbour searches to give you the closest approximation. So let's take an example, right. Let's go back to the same law. Uh, the law firm example. If I'm building a rag model for my law firm, I'm downloading Uh uh uh uh, a chat based LLM model from hugging face and then inside my
05:14
vector database. And again, vector databases can be existing databases like Post sequel with the PG vector. Add on that gives that vector semantic search capabilities or the latest version of Cassandra already has this capability built in without having to install additional extensions or components. But you can run these vector databases on top of your cluster,
05:34
using existing operators and get all the features. Uh, that port brings to the table for running those databases on Q as well. There are also newer, uh, database vendors that have come into the ecosystem that offer services services like pinecone DVVV eight DB that are also getting popular for retrieval augmented generation models.
05:52
So once you have deployed your vector database on port Works on CO, you can connect it to your large language model. So whenever, uh, whenever your LLM application, So let's draw that out as well. Whenever your LM application wants to fetch some data or ask the large language model to generate an output, it sends in a prompt.
06:18
And then again, people already are familiar with how open a I chat GP T works knows what Know what prompts are. It's just a set of input text that you are providing to the large language model, which it uses to, uh give you an output text. And this can be tokenized and break broken down into separate tokens as well.
06:36
Once your LLM receives this prompt prompt instead of generating uh, output based on the training data it had, it actually goes and looks at Vector database for a more domain specific knowledge and treat it as a more domain specific knowledge set. And before returning the output, it finds the the closest answer that, uh, that's relevant to this this specific question and that's what it responds to my LLM.
07:03
So by building this, uh, retrieval augmented generation model organisations can avoid hallucinations and the benefit of doing this on communities and on port works are are a few things. So let's talk about the benefits fast. One being portability. Uh, we all know that GP US are hard to find And even if you have the budget and the resources
07:26
to, uh, acquire those, uh, you you might face a backlog on when those GP US actually get delivered. The the way organisations are working around this issue are by renting GP US from, uh, major cloud service providers like AWS or or co V or azure. Uh, and and using that as a service like infrastructure as a service resources.
07:46
But that's what you're doing right now, right? Maybe six months down the line, you get those GP U resources inside your own data centre and you want to run everything on Prem by building your rag models on top of your community cluster. Uh, you make this portable or you make, uh, you avoid any sort of vendor locking. You can take the same architecture and bring it on prem and continue using your application
08:07
inside your own data centre. Uh, the next one being flexibility and scale. So because of the way, uh, the device plug in from NVIDIA works and how it allows you to share your GP U resources because of the way open source projects like carpenter or insta scale from Red Hat and IBM research work, they allow you to add GP U based nodes and scale
08:28
your compute capacity on demand when you actually need those resources for your LM models. If when When you don't need, uh uh, those GP U resources. Your infrastructure can scale down on demand, thus saving you costs as well. So cost becomes an important benefit. Uh, especially when you are doing this in the cloud, right?
08:46
You don't want to have GP U based nodes that you are paying for and you are only using it maybe a couple of hours a day. It's not the best, uh, resource utilisation, and you'll end up burning through your budget because of benefits like portability, flexibility, scalability and cost savings. Organisations are choosing communities and port works to build retrieval augmented generation
09:08
models where they can avoid hallucinations by using vector databases. That's it for this video. Thank you for watching.
  • Video
07/2024
Pure Storage FlashArray//X | Data Sheet
FlashArray//X provides unified block and file storage with enterprise performance, reliability, and availability to power your critical business services.
Data Sheet
5 pages
Continue Watching
We hope you found this preview valuable. To continue watching this video please provide your information below.
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Your Browser Is No Longer Supported!

Older browsers often represent security risks. In order to deliver the best possible experience when using our site, please update to any of these latest browsers.