February 7, 2025
4
min read

Speed up RAG Experiments on AWS SageMaker with DeepSeek-R1 & FloTorch

Authors
Kiran George
Senior Software Engineer
Nanda Rajasekharuni
Solution Architect

DeepSeek R1 : What is all the noise about?

The AI world just got a game-changing breakthrough with DeepSeek-R1—a model that dramatically reimagines machine reasoning. DeepSeek unveiled a groundbreaking approach to AI problem-solving: using pure reinforcement learning to train language models without human supervision. Their evaluation metrics claim to have decisively outperformed established reasoning models across key benchmarks, demonstrating the transformative potential of their innovative approach.

That’s great! What are DeepSeek R1 Distilled models?

Distilled models offer an accessible solution for common use cases by lowering the entry barrier, providing advanced AI reasoning without complex deployment challenges. This is in contrast to the full DeepSeek-R1, which requires significant infrastructure to set up and run.

Imagine getting 80% of a supercomputer's performance with a smartphone's processing power. DeepSeek-R1 distilled models do exactly that—delivering near-original reasoning capabilities with dramatically reduced computational overhead.

Awesome! How do I run a DeepSeek R1 Distilled model and validate the claim?

We'll guide you through setting up DeepSeek-R1-Distill-Llama-8B on AWS SageMaker, with a spotlight on FloTorch - the no-code GenAI platform transforming AI experimentation.

Exploring DeepSeek-R1-Distill-Llama-8B capabilities without FloTorch

Prerequisites

1. AWS Requirements

  • SageMaker and S3 access
  • IAM execution role with permissions
  • AWS CLI configured with credentials

2. Model Access

3. Python version: 3.10 (recommended)

4. Required packages: SageMaker, boto3, huggingface_hub

Key Configuration Steps

1. Model Initialization

We suggest using the ml.g5.2xlarge instance due to its balance of performance and cost-effectiveness for this model size. We will initialize the model ID, which can be obtained from the HuggingFace website.

model_id = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"

instance_type = "ml.g5.2xlarge"

sagemaker_role_arn=<your-sagemaker-arn>

region = <preferred-aws-region>

2. Create SageMaker Session

A SageMaker session is necessary to connect with AWS services and interact with SageMaker models.

session = Session(boto_session=boto3.Session(region_name=region))

3. Model Deployment

First, we will create a dictionary named 'hub' with the model ID, which is a requirement for creating a Hugging Face model.

hub = {


        'HF_MODEL_ID': model_id,


        'SM_NUM_GPUS': json.dumps(1)


    }

The SageMaker SDK's HuggingFaceModel class can be utilized to create a Hugging Face model and deploy it on a SageMaker endpoint.

huggingface_model = HuggingFaceModel(


        image_uri=get_huggingface_llm_image_uri("huggingface", version="2.3.1",   


        env=hub,


        role=sagemaker_role_arn,


        SageMaker_session=session


    )

The HuggingFace model is deployed to an endpoint. The deployment specifies the instance type, instance count, and a health check timeout to ensure proper initialization.

predictor = huggingface_model.deploy(

    initial_instance_count=1,

    instance_type=instance_type,

    container_startup_health_check_timeout=300,

)

The endpoint used for inference will be created with a randomly generated name.Alternatively, you can assign a custom name to the endpoint you create using huggingface_model.deploy. Ensure that the endpoint_name you provide is unique and does not match any other existing endpoints within your AWS account.

predictor = huggingface_model.deploy(

    initial_instance_count=1,

    instance_type=instance_type,

    container_startup_health_check_timeout=300,

endpoint_name='deepseek-ai-DeepSeek-R1-Distill-Llama-8B-inferencing-endpoint'


)

4. Inference using the model

To prepare for model inference, create a payload dictionary containing two essential elements: 'inputs' for your query and 'parameters' for model configuration. The parameters enable you to precisely adjust key generation settings such as temperature, max tokens, and sampling strategy, providing you with granular control over the model's response characteristics and output quality.

default_params = {

            "max_new_tokens": 256,

            "temperature": 0.5,

            "top_p": 0.9,

            "do_sample": True

        }


payload = {

            "inputs": 'How does DeepSeek R1 distill models differ from DeepSeek R1?',

            "parameters": default_params

        }


response = predictor.predict(payload)

So, we've got DeepSeek-R1-Distill-Llama-8B up and running on AWS SageMaker – ready to answer questions!

Now, we need to figure out the best way to get answers. Assume, we have a dataset and want to do RAG on it, which means we have a bunch of hyperparameters to tweak. Which chunking strategy, chunk size to select? Which Vector DB to use? What's the best knn value - 5 or 10? Whether a temperature of 0.3 works better or 0.5? And hey, maybe a re-ranker could help organize the results?Usually, figuring all this out takes a ton of time and effort. But what if we could just click a few buttons, grab a coffee, and come back to all the answers? That's where FloTorch swoops in to save the day!

Exploring DeepSeek-R1-Distill-Llama-8B capabilities with FloTorch
DeepSeek-R1-Distill-Llama-8B capabilities with FloTorch

FloTorch's latest release supports DeepSeek-R1-Distill-Llama-8B, and all you need to try it is an AWS account and the required permissions! It's open-source and runs entirely on your AWS account. You can find it on AWS Marketplace, and the installation steps are outlined on GitHub.

After starting FloTorch, select the desired hyperparameters and combinations for experimentation. FloTorch will evaluate these combinations based on accuracy, latency, and cost.

We carried out RAG experiments on the AWS Bedrock User guide dataset using FloTorch and DeepSeek-R1-Distill-Llama-8B. Our indexing strategy involved a fixed chunk size of 256 tokens, cohere v3 embedding model with a vector dimension of 1024, and provisioned OpenSearch as the vector store. We evaluated the results using KNN of 5 and 10, both with and without cohere reranker on FloTorch against the following metrics:

  • Context Precision: Evaluates the relevance of the retrieved information to a given user query. Detailed explanation here
  • Answer Relevancy: Evaluates the relevance of the generated response to a given user query. Detailed explanation here
  • Faithfulness: Evaluates the factual consistency of a response with its retrieved context on a scale from 0 to 1. Detailed explanation here
  • Cost: Measures the total cost incurred for running the experiment.
  • Time: Measures the total time elapsed for running the experiment.
Experiment Comparison

The best part is that these experiments are run at a minimal cost—often under $5. Additionally, support for other powerful DeepSeek-R1 Distilled models will be added soon.

In conclusion, the combination of DeepSeek-R1-Distill-Llama-8B and FloTorch offers a powerful and accessible solution for streamlined AI reasoning. The distilled model brings advanced AI capabilities to a wider audience, and FloTorch simplifies the experimentation process, making it easier to find the optimal configuration for your specific needs.

Hey, if FloTorch got you interested, we'd love your help! We appreciate any contributions - code, ideas, feedback, or even just starring our repo. Let's build this together!

Let's Talk

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Tags
AI Experimentation
AI Future Trends
AI Innovation
AI Trends 2025
DeepSeek-R1 Insights
GenAI Trends
Ethical AI
RAG Optimization

Recent Posts

RAG Benchmarking of Amazon Nova and GPT-4o models
RAG Benchmarking of Amazon Nova and GPT-4o models
We used FloTorch Enterprise software to compare Amazon Nova models vs OpenAI GPT-4o models using the Comprehensive Retrieval Augmented Generation (CRAG) benchmark dataset.
The Ultimate Guide to GenAI RAG
The Ultimate Guide to GenAI RAG
Retrieval-augmented generation (RAG) has emerged as a revolutionary method for improving the precision, efficiency, and applicability of AI systems within the rapidly changing field of artificial intelligence (AI). GenAI RAG Accelerators furnish enterprises with AI-driven RAG systems that provide unparalleled performance and cost-effectiveness. This comprehensive guide delves into every facet of RAG, from its fundamental principles to its transformative potential in enterprise settings.
Generative AI in 2025: Key Trends and How FloTorch is Leading the Way
Generative AI in 2025: Key Trends and How FloTorch is Leading the Way
As we venture into 2025, Generative AI (GenAI) is not just a technological innovation; it’s a transformative force reshaping industries and redefining how businesses, governments, and individuals operate. From creating hyper-personalized customer experiences to revolutionizing content generation and healthcare solutions, GenAI is rapidly emerging as the backbone of the global digital ecosystem.