February 7, 2025

min read

Speed up RAG Experiments on AWS SageMaker with DeepSeek-R1 & FloTorch

Authors

Kiran George

Senior Software Engineer

Nanda Rajasekharuni

Solution Architect

DeepSeek R1 : What is all the noise about?

The AI world just got a game-changing breakthrough with DeepSeek-R1—a model that dramatically reimagines machine reasoning. DeepSeek unveiled a groundbreaking approach to AI problem-solving: using pure reinforcement learning to train language models without human supervision. Their evaluation metrics claim to have decisively outperformed established reasoning models across key benchmarks, demonstrating the transformative potential of their innovative approach.

That’s great! What are DeepSeek R1 Distilled models?

Distilled models offer an accessible solution for common use cases by lowering the entry barrier, providing advanced AI reasoning without complex deployment challenges. This is in contrast to the full DeepSeek-R1, which requires significant infrastructure to set up and run.

Imagine getting 80% of a supercomputer's performance with a smartphone's processing power. DeepSeek-R1 distilled models do exactly that—delivering near-original reasoning capabilities with dramatically reduced computational overhead.

Awesome! How do I run a DeepSeek R1 Distilled model and validate the claim?

We'll guide you through setting up DeepSeek-R1-Distill-Llama-8B on AWS SageMaker, with a spotlight on FloTorch - the no-code GenAI platform transforming AI experimentation.

Exploring DeepSeek-R1-Distill-Llama-8B capabilities without FloTorch

Prerequisites

1. AWS Requirements

SageMaker and S3 access
IAM execution role with permissions
AWS CLI configured with credentials

2. Model Access

Hugging Face Model ID: DeepSeek-R1-Distill-Llama-8B

3. Python version: 3.10 (recommended)‍

4. Required packages: SageMaker, boto3, huggingface_hub

Key Configuration Steps

‍1. Model Initialization

‍We suggest using the ml.g5.2xlarge instance due to its balance of performance and cost-effectiveness for this model size. We will initialize the model ID, which can be obtained from the HuggingFace website.

model_id = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"

instance_type = "ml.g5.2xlarge"

sagemaker_role_arn=<your-sagemaker-arn>

region = <preferred-aws-region>

2. Create SageMaker Session

‍A SageMaker session is necessary to connect with AWS services and interact with SageMaker models.

session = Session(boto_session=boto3.Session(region_name=region))

3. Model Deployment

‍First, we will create a dictionary named 'hub' with the model ID, which is a requirement for creating a Hugging Face model.

hub = {

'HF_MODEL_ID': model_id,

'SM_NUM_GPUS': json.dumps(1)

}

The SageMaker SDK's HuggingFaceModel class can be utilized to create a Hugging Face model and deploy it on a SageMaker endpoint.

huggingface_model = HuggingFaceModel(

image_uri=get_huggingface_llm_image_uri("huggingface", version="2.3.1",

env=hub,

role=sagemaker_role_arn,

SageMaker_session=session

)

The HuggingFace model is deployed to an endpoint. The deployment specifies the instance type, instance count, and a health check timeout to ensure proper initialization.

predictor = huggingface_model.deploy(

initial_instance_count=1,

instance_type=instance_type,

container_startup_health_check_timeout=300,

)

The endpoint used for inference will be created with a randomly generated name.Alternatively, you can assign a custom name to the endpoint you create using huggingface_model.deploy. Ensure that the endpoint_name you provide is unique and does not match any other existing endpoints within your AWS account.

predictor = huggingface_model.deploy(

initial_instance_count=1,

instance_type=instance_type,

container_startup_health_check_timeout=300,

endpoint_name='deepseek-ai-DeepSeek-R1-Distill-Llama-8B-inferencing-endpoint'

)

4. Inference using the model

‍To prepare for model inference, create a payload dictionary containing two essential elements: 'inputs' for your query and 'parameters' for model configuration. The parameters enable you to precisely adjust key generation settings such as temperature, max tokens, and sampling strategy, providing you with granular control over the model's response characteristics and output quality.

default_params = {

"max_new_tokens": 256,

"temperature": 0.5,

"top_p": 0.9,

"do_sample": True

}

payload = {

"inputs": 'How does DeepSeek R1 distill models differ from DeepSeek R1?',

"parameters": default_params

}

response = predictor.predict(payload)

‍

So, we've got DeepSeek-R1-Distill-Llama-8B up and running on AWS SageMaker – ready to answer questions!

Now, we need to figure out the best way to get answers. Assume, we have a dataset and want to do RAG on it, which means we have a bunch of hyperparameters to tweak. Which chunking strategy, chunk size to select? Which Vector DB to use? What's the best knn value - 5 or 10? Whether a temperature of 0.3 works better or 0.5? And hey, maybe a re-ranker could help organize the results?Usually, figuring all this out takes a ton of time and effort. But what if we could just click a few buttons, grab a coffee, and come back to all the answers? That's where FloTorch swoops in to save the day!‍

Exploring DeepSeek-R1-Distill-Llama-8B capabilities with FloTorch

FloTorch's latest release supports DeepSeek-R1-Distill-Llama-8B, and all you need to try it is an AWS account and the required permissions! It's open-source and runs entirely on your AWS account. You can find it on AWS Marketplace, and the installation steps are outlined on GitHub.

After starting FloTorch, select the desired hyperparameters and combinations for experimentation. FloTorch will evaluate these combinations based on accuracy, latency, and cost.

We carried out RAG experiments on the AWS Bedrock User guide dataset using FloTorch and DeepSeek-R1-Distill-Llama-8B. Our indexing strategy involved a fixed chunk size of 256 tokens, cohere v3 embedding model with a vector dimension of 1024, and provisioned OpenSearch as the vector store. We evaluated the results using KNN of 5 and 10, both with and without cohere reranker on FloTorch against the following metrics:

Context Precision: Evaluates the relevance of the retrieved information to a given user query. Detailed explanation here
Answer Relevancy: Evaluates the relevance of the generated response to a given user query. Detailed explanation here
Faithfulness: Evaluates the factual consistency of a response with its retrieved context on a scale from 0 to 1. Detailed explanation here
Cost: Measures the total cost incurred for running the experiment.
Time: Measures the total time elapsed for running the experiment.

The best part is that these experiments are run at a minimal cost—often under $5. Additionally, support for other powerful DeepSeek-R1 Distilled models will be added soon.

In conclusion, the combination of DeepSeek-R1-Distill-Llama-8B and FloTorch offers a powerful and accessible solution for streamlined AI reasoning. The distilled model brings advanced AI capabilities to a wider audience, and FloTorch simplifies the experimentation process, making it easier to find the optimal configuration for your specific needs.

Hey, if FloTorch got you interested, we'd love your help! We appreciate any contributions - code, ideas, feedback, or even just starring our repo. Let's build this together!

‍