Optimize Your AI Prompts: A Deep Dive into Amazon Bedrock's New Tool

Amazon Bedrock just launched Advanced Prompt Optimization, a powerful new feature designed to help you fine-tune your prompts for any model on the platform. Whether you're migrating to a new model or simply want better performance from your current one, this tool lets you compare original and optimized prompts across up to five models simultaneously. It uses a metric-driven feedback loop to refine prompts based on your evaluation criteria, making it easier to avoid regressions and boost results for underperforming tasks. Below, we break down everything you need to know in a handy Q&A.

What is Amazon Bedrock Advanced Prompt Optimization?

Amazon Bedrock Advanced Prompt Optimization is a new tool that automatically improves your prompts for any model available on the Bedrock platform. It works by taking your original prompt template, sample inputs, ground truth answers, and a chosen evaluation metric, then iteratively refines the template to maximize that metric. You can simultaneously compare the original prompt against optimized versions across up to five models, which is especially useful when you're switching models or want to squeeze extra performance from an existing one. The tool also supports multimodal inputs such as images, PDFs, and PNG files, allowing you to optimize prompts for document and image analysis tasks. By providing a clear before-and-after comparison including evaluation scores, cost estimates, and latency, it helps you make data-driven decisions about your prompt engineering.

Optimize Your AI Prompts: A Deep Dive into Amazon Bedrock's New Tool — Source: aws.amazon.com

How does the prompt optimization process work?

The process begins when you upload a prompt template in JSONL format, along with example user inputs, ground truth answers, and an evaluation metric or rewriting guidance. The optimizer then enters a metric-driven feedback loop: it generates candidate prompts, tests them against the provided inputs, measures performance using your chosen metric, and iterates to improve. You can guide this process in three ways: by supplying an AWS Lambda function for custom evaluation, using an LLM-as-a-judge rubric, or providing a short natural language description of what you want. The tool runs all iterations and finally outputs both the original and the optimized prompt templates, complete with evaluation scores, cost estimates, and latency measurements. This transparent workflow lets you see exactly how the optimization improved your results and helps you decide which model and prompt combination best suits your use case.

What inputs do I need to provide for optimization?

To use Advanced Prompt Optimization, you need to prepare a JSONL file where each line is a JSON object containing several required fields: a fixed version string (bedrock-2026-05-14), a templateId, the promptTemplate itself, and an evaluationSamples array with example inputs and the associated ground truth answers. You also need to specify an evaluation method—either by providing a custom LLM judge configuration, an AWS Lambda function ARN, or a natural language description in customEvaluationMetricLabel. Optional fields like steeringCriteria let you add extra guidance. Don't forget that if you use a custom LLM judge or Lambda, you must also supply a customEvaluationMetricLabel. The tool supports multimodal user inputs—PNG, JPG, and PDF files can be included in your prompt templates for tasks like document analysis or image description. All files must be in valid JSONL format, with each JSON object on a single line.

Can I optimize prompts for multimodal inputs like images and PDFs?

Yes! Amazon Bedrock Advanced Prompt Optimization fully supports multimodal inputs. You can include PNG, JPG, and PDF files as part of your prompt templates. This means you can optimize prompts for tasks such as document analysis, image description, or any other vision-language task that requires understanding visual information. Simply encode the images or PDFs as part of your evaluation samples in the JSONL file. The tool will treat these as inputs and optimize the prompt text around them, ensuring that the final prompt works well with the selected model's multimodal capabilities. This is particularly useful if you're migrating from a text-only model to one that supports vision, or if you want to improve the accuracy of your current model on visual tasks. The optimizer will test and refine the prompt using these multimodal examples, giving you confidence that the optimized prompt performs well across different input types.

How can I compare performance across multiple models?

One of the standout features of this tool is the ability to compare original and optimized prompts across up to five models at once. On the Advanced Prompt Optimization page of the Amazon Bedrock console, you select the models you want to test. If you're migrating from one model to another, you can set your current model as a baseline and then choose up to four other candidate models. If you're not switching models, just select your current model alone to see the before-and-after optimization comparison. The tool runs the optimization process for each selected model and shows side-by-side evaluation scores, cost estimates, and latency figures. This makes it easy to identify which model benefits most from prompt optimization, and whether the optimized prompts maintain performance (no regressions) on known use cases while improving underperforming tasks. You can then make an informed decision about which model and prompt combination to deploy.

What evaluation metrics can I use, and how do I provide them?

You can provide evaluation metrics in three flexible ways. First, you can define a custom LLM-as-a-judge by supplying a rubric prompt and a model ID inside the customLLMJConfig field. Second, you can provide an AWS Lambda function ARN via evaluationMetricLambdaArn that computes a custom metric based on the model's response and the ground truth. Third, you can give a short natural language description of what you want to optimize for (e.g., “accuracy”, “helpfulness”, “conciseness”) using the customEvaluationMetricLabel. If you choose the LLM-as-a-judge or Lambda approach, you must also set the customEvaluationMetricLabel. The tool uses your chosen metric to drive the optimization loop, ensuring the final prompt maximizes that specific quality. This allows you to tailor the optimization to your own business requirements, whether you prioritize factual correctness, low latency, or style adherence.

What does the tool output after optimization?

After the optimization process completes, the tool provides a clear comparison between your original prompt template and the optimized one. For each model you selected, you'll see evaluation scores (based on your metric), estimated cost per inference, and latency measurements. These are shown side by side so you can instantly see the improvement—or check for regressions. The output includes both the prompt templates themselves and the metadata that explains why changes were made. This transparency helps you understand the optimization logic and adapt it further if needed. You can then export the optimized prompt and the results for integration into your own workflows. The tool effectively gives you a data-driven report that justifies the new prompt, making it easier to adopt in production or share with your team.

How do I get started with the tool?

Getting started is straightforward. Navigate to the Advanced Prompt Optimization page in the Amazon Bedrock console and click Create prompt optimization. You'll be prompted to select up to five inference models—include your current model as a baseline if you're migrating, or just one if you're optimizing without switching. Next, prepare your prompt templates in JSONL format (as described in the inputs section) with example user data, ground truth answers, and an evaluation metric or rewriting guidance. Upload the file and configure your evaluation method. Then launch the optimization. The tool will run and present results within minutes, depending on the complexity of your prompts and the number of models. No deep machine learning expertise is required—just clear data and a defined evaluation goal. Start small with one model and a simple runner.

Fbhchile