Caching

Overview

Caching enables you to cache model output to reduce the number of API calls made, saving both time and expense. Caching is also often useful during development—for example, when you are iterating on a scorer you may want the model outputs served from a cache to both save time as well as for increased determinism.

There are two types of caching available: Inspect local caching and provider level caching. We’ll first describe local caching (which works for all models) then cover provider caching which currently works only for Anthropic models.

Caching Basics

Use the cache parameter on calls to generate() to activate the use of the cache. The keys for caching (what determines if a request can be fulfilled from the cache) are as follows:

  • Model name and base URL (e.g. openai/gpt-4-turbo)
  • Model prompt (i.e. message history)
  • Epoch number (for ensuring distinct generations per epoch)
  • Generate configuration (e.g. temperature, top_p, etc.)
  • Active tools and tool_choice

If all of these inputs are identical, then the model response will be served from the cache. By default, model responses are cached for 1 week (see Cache Policy below for details on customising this).

For example, here we are iterating on our self critique template, so we cache the main call to generate():

@task
def theory_of_mind():
    return Task(
        dataset=example_dataset("theory_of_mind"),
        plan=[
            chain_of_thought(),
            generate(cache = True),
            self_critique(CRITIQUE_TEMPLATE)
        ]
        scorer=model_graded_fact(),
    )

You can similarly do this with the generate function passed into a Solver:

@solver
def custom_solver(cache):

  async def solve(state, generate):

    # (custom solver logic prior to generate)

    return generate(state, cache)

  return solve

You don’t strictly need to provide a cache argument for a custom solver that uses caching, but it’s generally good practice to enable users of the function to control caching behaviour.

You can also use caching with lower-level generate() calls (e.g. a model instance you have obtained with get_model(). For example:

model = get_model("anthropic/claude-3-opus-20240229")
output = model.generate(input, cache = True)

Model Versions

The model name (e.g. openai/gpt-4-turbo) is used as part of the cache key. Note though that many model names are aliases to specific model versions. For example, gpt-4, gpt-4-turbo, may resolve to different versions over time as updates are released.

If you want to invalidate caches for updated model versions, it’s much better to use an explicitly versioned model name. For example:

$ inspect eval ctf.py --model openai/gpt-4-turbo-2024-04-09

If you do this, then when a new version of gpt-4-turbo is deployed a call to the model will occur rather than resolving from the cache.

Cache Policy

By default, if you specify cache = True then the cache will expire in 1 week. You can customise this by passing a CachePolicy rather than a boolean. For example:

cache = CachePolicy(expiry="3h")
cache = CachePolicy(expiry="4D")
cache = CachePolicy(expiry="2W")
cache = CachePolicy(expiry="3M")

You can use s, m, h, D, W , M, and Y as abbreviations for expiry values.

If you want the cache to never expire, specify None. For example:

cache = CachePolicy(expiry = None)

You can also define scopes for cache expiration (e.g. cache for a specific task or usage pattern). Use the scopes parameter to add named scopes to the cache key:

cache = CachePolicy(
    expiry="1M",
    scopes={"role": "attacker", "team": "red"})
)

As noted above, caching is by default done per epoch (i.e. each epoch has its own cache scope). You can disable the default behaviour by setting per_epoch=False. For example:

cache = CachePolicy(per_epoch=False)

Management

Use the inspect cache command the view the current contents of the cache, prune expired entries, or clear entries entirely. For example:

# list the current contents of the cache
$ inspect cache list

# clear the cache (globally or by model)
$ inspect cache clear
$ inspect cache clear --model openai/gpt-4-turbo-2024-04-09

# prune expired entries from the cache
$ inspect cache list --pruneable
$ inspect cache prune
$ inspect cache prune --model openai/gpt-4-turbo-2024-04-09

See inspect cache --help for further details on management commands.

Cache Directory

By default the model generation cache is stored in the system default location for user cache files (e.g. XDG_CACHE_HOME on Linux). You can override this and specify a different directory for cache files using the INSPECT_CACHE_DIR environment variable. For example:

$ export INSPECT_CACHE_DIR=/tmp/inspect-cache

Provider Caching

Model providers may also provide prompt caching features to optimise cost and performance for multi-turn conversations. Currently, Inspect includes support for Anthropic Prompt Caching and will extend this support to other providers over time as they add caching to their APIs.

Provider prompt caching is controlled by the cache-prompt generation config option. The default value for cache-prompt is "auto", which enables prompt caching automatically if tool definitions are included in the request. Use true and false to force caching on or off. For example:

inspect eval ctf.py --cache-prompt=auto  # enable if tools defined
inspect eval ctf.py --cache-prompt=true  # force caching on
inspect eval ctf.py --cache-prompt=false # force caching off

Or with the eval() function:

eval("ctf.py", cache_prompt=True)

Cache Scope

Providers will typically provide various means of customising the scope of cache usage. The Inspect cache-prompt option will by default attempt to make maximum use of provider caches (in the Anthropic implementation system messages, tool definitions, and all messages up to the last user message are included in the cache).

Currently there is no way to customise the Anthropic cache lifetime (it defaults to 5 minutes)—once this becomes possible this will also be exposed in the Inspect API.

Usage Reporting

When using provider caching, model token usage will be reported with 4 distinct values rather than the normal input and output. For example:

13,684 tokens [I: 22, CW: 1,711, CR: 11,442, O: 509]

Where the prefixes on reported token counts stand for:

I Input tokens
CW Input token cache writes
CR Input token cache reads
O Output tokens

Input token cache writes will typically cost more (in the case of Anthropic roughly 25% more) but cache reads substantially less (for Anthropic 90% less) so for the example above there would have been a substantial savings in cost and execution time. See the Anthropic Documentation for additional details.