Anthropic introduces cost-saving prompt caching for Claude

Get to know Claude’s new caching feature

Claude API has introduced a feature to streamline processing lengthy prompts with prompt caching. It’s particularly useful for users who frequently reuse specific segments of a long prompt, such as large documents or extensive datasets.

When sections are marked for caching, the API stores them temporarily, making sure subsequent requests within a set time frame do not require reprocessing of the same data.

Users can mark these reusable portions to maintain efficiency in operations that involve complex or large-scale data inputs, which is ideal for applications that require consistent and repeated reference to the same data, such as legal document analysis, financial reporting, or ongoing project management.

Through reducing the need to reprocess large chunks of data, Claude’s prompt caching optimizes both time and computational resources.

Five minutes to faster, cheaper prompts

Claude stores the cached prompts for up to five minutes. During this window, any prompt that reuses the cached data is processed at a speed far greater than if it were uncached. This then directly translates into operational efficiency, particularly in environments where real-time or near-real-time data processing is important.

There’s also a financial upside. Prompts that leverage cached data are billed at approximately 10% of the cost of sending uncached tokens—a large reduction that’s especially useful for organizations processing large volumes of data.

How to enable Claude’s prompt caching

Activating Claude’s prompt caching feature is straightforward but requires a specific HTTP header to be passed during API calls. Users need to include the header “anthropic-beta: prompt-caching-2024-07-31” to enable the caching functionality.

Making sure this header is correctly implemented is essential, giving businesses full access to leveraging the caching feature from the moment they start using Claude.

Maximizing savings or risking costs? Understand prompt caching expenses

What it costs to use Claude’s cache

While prompt caching offers clear advantages in speed and cost, it’s important to recognize that these benefits come with their own costs. Writing data to the cache is not free; it incurs a cost that users need to consider in their budgeting and operational planning. This extra cost must then be balanced against the potential savings from reduced processing times and lower token costs.

Organizations must assess the frequency and scale of their prompt usage to determine whether caching will deliver a net benefit. For high-frequency use cases, the investment in caching can quickly pay off through lower operational costs.

In scenarios where prompts are infrequent, however, the cost of writing to the cache might outweigh the benefits, leading to increased overall expenses.

How frequent prompts can save you big

The Time to Live (TTL) of a cache is reset each time a cached prompt is reused within the five-minute window. As long as a prompt hits the cache within this period, the TTL extends, keeping the cached data alive and ready for reuse.

Applications that prompt more than once every five minutes can see major cost savings, as they repeatedly benefit from the speed and cost efficiency of cached data.

On the other hand, if an application prompts less frequently, the cached data expires, and the benefits of caching are lost—potentially leading to increased costs, as the application continues to incur charges for writing new data to the cache without fully capitalizing on the cost reductions that frequent cache hits would provide.

Businesses must carefully evaluate their usage patterns if they’re to optimize their caching strategy and avoid unnecessary expenses.

How caching costs differ between Claude and Google Gemini

Google Gemini offers a similar context caching feature, but the two systems have key differences that could influence a business’s choice between them. Both systems provide a way to cache reused data segments, helping to speed up processing and reduce costs.

Google Gemini’s caching pricing is tiered based on the version used. For Gemini 1.5 Pro, the cost is $4.50 per million tokens per hour, while Gemini 1.5 Flash offers a lower rate of $1 per million tokens per hour. These pricing tiers come with a quarter price discount on input tokens, providing a clear financial incentive for high-volume users to leverage caching.

Claude’s approach to caching, by contrast, charges approximately 10% of the cost of uncached tokens for cached prompts within the five-minute window. Companies need to consider their specific use cases, data volumes, and prompt frequencies when deciding which system offers the most cost-effective solution.

Claude’s caching won’t cut down on your HTTP traffic

Despite the efficiencies introduced by prompt caching, one area where Claude’s implementation does not reduce overhead is HTTP traffic. Even with cached prompts, the full context must still be transmitted with each API call. For example, if a 1MB context is cached, the application still needs to send a 1MB HTTP request every time the prompt is called.

While prompt caching reduces computational costs, it does not alleviate network load, and any considerations related to bandwidth or data transfer limits should remain part of the planning process.

The impact of this is minimal when compared to the processing overhead saved by using cached prompts. Businesses with heavy API usage, however, should be aware that their HTTP traffic load will remain consistent, even as processing times decrease.

What fine-tuning really means for Claude

One of the more confusing aspects of the announcement surrounding Claude’s prompt caching is the terminology used, particularly the term “fine-tune.” In machine learning, fine-tuning usually refers to the process of adjusting a model’s parameters to improve its performance on a specific task. Claude offers this feature through AWS Bedrock, where users can fine-tune the model based on their unique datasets and requirements.

Prompt caching, on the other hand, is entirely different, typically involving storing and reusing specific prompt data to save on processing time and costs, not altering the model itself. The term “fine-tune” in the context of prompt caching has led to some confusion, as it may mislead users into thinking that they are directly adjusting the model’s behavior.

Paul

August 22, 2024

5 Min