RAG Benchmarking of Amazon Nova and GPT-4o models

Authors

CTO, FloTorch

Summary

We used FloTorch Enterprise software to compare Amazon Nova models vs OpenAI GPT-4o models using the Comprehensive Retrieval Augmented Generation (CRAG) benchmark dataset. We compared results from these two Large Language Model (LLM) providers for latency, accuracy and costs across 5 different topics. We find the accuracy of GPT-4o slightly higher than Amazon Nova Pro. However, Amazon Nova Pro model performs 21.97% faster than GPT-4o and is 65.26% cheaper than GPT-4o model. Similarly, we benchmarked Amazon Nova Micro and Amazon Nova Lite models and they outperformed GPT4o-mini model on accuracy by 2 percentage points; and Amazon Nova Micro and Lite models are cheaper than GPT4o mini (73.10% and 56.59% respectively) and faster in response times (20.48% and 26.60% respectively).

Motivation

The GPT-4o model was released by OpenAI in May 2024. Amazon Nova models were launched by Amazon at AWS re:Invent event in December 2024. While GPT-4o is fairly popular, Amazon Nova models are interesting to many enterprises for the cost savings and lower latency they offer over OpenAI models. While foundational language models are equipped with answering most common user questions, sometimes they hallucinate or over-generalize the answers. Also, as this case demonstrates, LLMs can also answer questions without limiting its scope to the company it is representing. Retrieval Augmented Generation (RAG) is a common use case across multiple industries and enterprises we work with. Many companies with knowledge base (source data) PDF or some other documents would like their bots to answer all user questions only in the scope of the data provided. We wanted to find out answer to 3 questions for enterprise customers:

How does Amazon Nova Pro compare with GPT-4o for Latency, Cost and Accuracy?
How does Amazon Nova Micro and Lite models compare against GPT-4o mini on Latency, Cost and Accuracy?
How do these models perform for RAG use cases across diverse topics?

What is the CRAG benchmark dataset?

Comprehensive Retrieval Augmented Generation (CRAG) dataset was released by Meta for testing with factual queries across 5 domains, 8 question types and a large number of question-answer pairs. Five domains in CRAG dataset are Finance, Sports, Music, Movie and Open (Miscellaneous). Eight different question types are simple, simple_w_condition, comparison, aggregation, set, false_premise, post-processing, and multi-hop. Here are some example questions with their domain and question types :

Domain Questions Table

Domain	Question	Question Type
Sports	Can you carry less than the maximum number of clubs during a round of golf?	simple
Music	Can you tell me how many Grammies were won by Arlo Guthrie until the 60th Grammy (2017)?	simple_w_condition
Open	Can I make cookies in an air fryer?	simple
Finance	Did Meta have any mergers or acquisitions in 2022?	simple_w_condition
Movie	In 2016, which movie was distinguished for its visual effects at the Oscars?	simple_w_condition

We considered 200 queries from this dataset representing 5 domains and 2 question types, simple and simple_w_condition. Both types of questions are common from users and a typical Google search for the query such as “Can you tell me how many grammies were won by arlo guthrie until 60th grammy (2017)?” will not give you the correct answer (one grammy). We used these queries and their ground truth answers to create a subset benchmark dataset. CRAG dataset also provides top 5 search result pages for each query. These 5 webpages act as knowledgebase (source data) to limit RAG model response. Goal is to index these 5 web pages dynamically using a common embedding algorithm and then use retrieval (and reranking) strategy to retrieve chunks of data from indexed knowledge base to infer the final answer.

Evaluation Setup

Let’s take a look at typical RAG pipeline based evaluation.

Knowledgebase:

We used top 5 HTML web pages provided with CRAG dataset for each query as knowledge base source data. HTML pages were parsed to extract text for the embedding stage.

Chunking strategy:

We used a Fixed chunking strategy with a chunk size of 512 tokens (4 characters is usually around 1 token) and 10% overlap between chunks. We plan to experiment with different chunking strategies, chunk sizes and percent overlap in coming weeks and update the blog.

Embedding Strategy:

We used Amazon Bedrock Titan-Embed-Text-V2 embedding model with output vector size of 1024. With a max input token limit of 8192 for Titan-Embed-Text-V2, we managed to embed chunks from knowledgebase source data as well as short queries from CRAG dataset easily. Bedrock APIs make it easy to use Titan-Embed-Text-V2 for embedding data.

Vector Database :

We chose Amazon OpenSearch as a Vector Database for its high performance metrics. We provisioned 3 node sharded OpenSearch cluster. Each provisioned node was r7g.4xlarge as it is readily available and sufficient for our needs and meets performance requirements. We used HSNW indexing in OpenSearch.

Retrieval (and reranking) strategy:

We used a retrieval strategy with KNN of 5 nearest neighbors for retrieved chunks. We did not use any reranking algorithm for these experiments to make sure retrieved chunks are the same for both models to infer the answer to the provided query. Here is the code snippet that embeds the given query and passes the embeddings to the search function.

def search_results(interaction_ids: List[str], queries: List[str], k: int): """Retrieve search results for queries.""" results = [] embedding_max_length = int(os.getenv("EMBEDDING_MAX_LENGTH", 1024)) normalize_embeddings = os.getenv("NORMALIZE_EMBEDDINGS", "True").lower() == "true" for interaction_id, query in zip(interaction_ids, queries): try: _, _, embedding = create_embeddings_with_titan_bedrock(query, embedding_max_length, normalize_embeddings) results.append(search(interaction_id + '_titan', embedding, k)) except Exception as e: logger.error(f"Error processing query {query}: {e}") results.append(None) return results

def search_results(interaction_ids: List[str], queries: List[str], k: int):
   """Retrieve search results for queries."""
   results = []
   embedding_max_length = int(os.getenv("EMBEDDING_MAX_LENGTH", 1024))
   normalize_embeddings = os.getenv("NORMALIZE_EMBEDDINGS", "True").lower() == "true"

   for interaction_id, query in zip(interaction_ids, queries):
       try:
           _, _, embedding = create_embeddings_with_titan_bedrock(query, embedding_max_length, normalize_embeddings)
           results.append(search(interaction_id + '_titan', embedding, k))
       except Exception as e:
           logger.error(f"Error processing query {query}: {e}")
           results.append(None)
   return results

Inferencing:

We used the GPT-4o model from OpenAI using the API key available. We used the Amazon Nova Pro model with conversation APIs. GPT-4o supports a context window of 128,000 compared with Amazon Nova Pro context window of 300,000. Max output token limit of GPT-4o is 16,384 versus Amazon Nova Pro max output token limit of 5,000. We did not use Amazon Bedrock Guardrails functionality for our benchmarking experiments. We use the universal gateway provided by FloTorch Enterprise version to make sure we can run both API calls using the exact same function and track token count and latency metrics consistently. Let’s look at inference code function:

def generate_responses(dataset_path: str, model_name: str, batch_size: int, api_endpoint: str, auth_header: str,
                        max_tokens: int, search_k: int, system_prompt: str):
   """Generate response for queries."""
   results = []


   for batch in tqdm(load_data_in_batches(dataset_path, batch_size), desc="Generating responses"):
       interaction_ids = [item["interaction_id"] for item in batch]
       queries = [item["query"] for item in batch]
       search_results_list = search_results(interaction_ids, queries, search_k)


       for i, item in enumerate(batch):
           item["search_results"] = search_results_list[i]


       responses = send_batch_request(batch, model_name, api_endpoint, auth_header, max_tokens, system_prompt)


       for i, response in enumerate(responses):
           results.append({
               "interaction_id": interaction_ids[i],
               "query": queries[i],
               "prediction": response.get("choices", [{}])[0].get("message", {}).get("content") if response else None,
               "response_time": response.get("response_time") if response else None,
               "response": response,
           })


   return results

Evaluation:

Both models were evaluated by running batch queries. A batch of 8 was selected to comply with Bedrock quota limits as well as GPT-4o rate limits. Query function code is shown below:

def send_batch_request(batch: List[Dict], model_name: str, api_endpoint: str, auth_header: str, max_tokens: int,
                      system_prompt: str):
   """Send batch queries to the API."""
   headers = {"Authorization": auth_header, "Content-Type": "application/json"}
   responses = []


   for item in batch:
       query = item["query"]
       query_time = item["query_time"]
       retrieval_results = item.get("search_results", [])


       references = "# References \n" + "\n".join(
           [f"Reference {_idx + 1}:\n{res['text']}\n" for _idx, res in enumerate(retrieval_results)])
       user_message = f"{references}\n------\n\nUsing only the references listed above, answer the following question:\nQuestion: {query}\n"


       payload = {
           "model": model_name,
           "messages": [{"role": "system", "content": system_prompt},
                        {"role": "user", "content": user_message}],
           "max_tokens": max_tokens,
       }


       try:
           start_time = time.time()
           response = requests.post(api_endpoint, headers=headers, json=payload, timeout=25000)
           response.raise_for_status()
           response_json = response.json()
           response_json['response_time'] = time.time() - start_time
           responses.append(response_json)
       except requests.RequestException as e:
           logger.error(f"API request failed for query: {query}. Error: {e}")
           responses.append(None)


   return responses

Benchmarking on CRAG dataset

Latency:

We measure latency for each query response as the difference between two timestamps - timestamp when API call is made to the inference LLM and second timestamp when the entire response is received from the inference endpoint. Difference between 2 times is what will be considered as latency. Lower latency means faster performing LLM and thus can be used in applications requiring fast response requirements. We believe latency can be further reduced for both models with some optimizations and caching techniques but we wanted to measure latency out of box for both models.

Accuracy:

We used a modified version of local_evaluation.py script provided with CRAG benchmark for accuracy evaluations. We modified the script to make sure that evaluations categorize correct, incorrect and missing responses appropriately. We replaced the default GPT-4o evaluation LLM in the evaluation script with mixtral-8x7b-instruct-v0:1 model API. We also modified the script to monitor input/output tokens and latency as described above.

Cost:

Cost calculations were straightforward as both Amazon Nova Pro and GPT-4o have published price per Million input and output tokens separately. We multiply input tokens with corresponding rate and do the same for output tokens. We add up input token and output token costs to find out the final cost for running 200 queries. We do not include OpenSearch provisioned cluster costs here as the cost comparison is only at the inference level between Amazon Nova Pro and GPT-4o LLMs.

Here are the results:

Parameters	Amazon Nova Pro	GPT-4o	Observation
Accuracy on subset of CRAG dataset	51.50% (103 correct responses out of 200)	53.00% (106 correct responses out of 200)	GPT-4o outperforms Amazon Nova Pro by 1.5% on accuracy.
Cost for running inference for 200 queries	$0.00030205	$0.000869537	Amazon Nova Pro saves 65.26% in costs compared to GPT-4o.
Average Latency (seconds)	1.682539835	2.15615045	Amazon Nova Pro is 21.97% faster than GPT-4o.
Average of input & output tokens	1946.621359	1782.707547	Typical GPT-4o responses are shorter than Amazon Nova model responses.

‍

For simple queries, Amazon Nova Pro and GPT-4o have very similar accuracies (55 and 56 correct responses respectively) but for simple queries with condition, GPT-4o performs slightly better than Amazon Nova Pro (50 versus 48 correct answers). Imagine you are part of an organization running a chatbot service which handles 1000 questions per month from 10,000 users i.e. 10,000,000 queries per month. Amazon Nova Pro will save your organization $5,674.88 per month, i.e. $68,098 per year as compared to GPT-4o.

Let’s look at similar results for Amazon Nova Micro, Amazon Nova Lite and GPT-4o mini models on the same dataset.

Amazon Nova vs GPT-4o Mini

Parameters	Nova Lite	Nova Micro	GPT-4o Mini	Observation
Accuracy on subset of CRAG dataset	52.00% (104 correct responses out of 200)	54.00% (108 correct responses out of 200)	50.00% (100 correct responses out of 200)	Both Amazon Nova Lite and Micro outperform GPT-4o Mini model by 2 and 4 points, respectively.
Cost for running inference for 200 queries	$0.00002247 (56.59% cheaper than GPT-4o Mini)	$0.000013924 (73.10% cheaper than GPT-4o Mini)	$0.000051768	Amazon Nova Lite and Micro models are cheaper than GPT-4o Mini model by 56.59% and 73.10%, respectively.
Average Latency (seconds)	1.553371465 (26.60% faster than GPT-4o Mini)	1.6828564 (20.48% faster than GPT-4o Mini)	2.116291895	Amazon Nova models are at least 20% faster than GPT-4o Mini.
Average of input & output tokens	1930.980769	1940.166667	1789.54	GPT-4o Mini returns shorter answers.

‍

Amazon Nova Micro is significantly faster and inexpensive as compared to GPT-4o mini while providing more accurate answers. If you are running a service that handles about 10 million queries each month, it will save you on average 73% of what you will be paying for slightly less accurate results from the GPT-4o mini model.

Conclusion

Based on these tests for RAG cases, Amazon Nova models produce comparable or higher accuracy at significantly lower cost and latency as compared to GPT-4o and GPT-4o mini models. We are continuing further experimentation with other relevant LLMs for comparison. We would also like to run more experiments with other types of queries such as comparison, aggregation, set, false_premise, post-processing, and multi-hop queries.

About FloTorch

FloTorch.ai is helping enterprise customers design and manage agentic workflows in a secure and scalable manner. Our mission is to help enterprises make data driven decisions in the end-to-end GenAI pipeline including but not limited to model selection, vector database selection, evaluation strategies, etc. FloTorch offers an opensource version for customers with scalable experimentation with different chunking, embedding, retrieval and inference strategies. The opensource version works on a customer's AWS account so they can experiment on their AWS account with their proprietary data. We encourage readers to try out FloTorch from AWS Marketplace or from github. FloTorch also offers an enterprise version of this product for scalable experimentation with any LLM model, any VectorDB on any cloud platform. Enterprise version also includes a universal gateway with model registry to custom define new LLMs and recommendation engine to suggest new LLMs and agent workflows. For more information contact us at aws@flotorch.ai

‍