Caching
Overview
Caching enables you to cache model output to reduce the number of API calls made, saving both time and expense. Caching is also often useful during development—for example, when you are iterating on a scorer you may want the model outputs served from a cache to both save time as well as for increased determinism.
There are two types of caching available: Inspect local caching and provider level caching. We’ll first describe local caching (which works for all models) then cover provider caching which currently works only for Anthropic models.
Caching Basics
Use the cache
parameter on calls to generate()
to activate the use of the cache. The keys for caching (what determines if a request can be fulfilled from the cache) are as follows:
- Model name and base URL (e.g.
openai/gpt-4-turbo
) - Model prompt (i.e. message history)
- Epoch number (for ensuring distinct generations per epoch)
- Generate configuration (e.g.
temperature
,top_p
, etc.) - Active
tools
andtool_choice
If all of these inputs are identical, then the model response will be served from the cache. By default, model responses are cached for 1 week (see Cache Policy below for details on customising this).
For example, here we are iterating on our self critique template, so we cache the main call to generate()
:
@task
def theory_of_mind():
return Task(
=example_dataset("theory_of_mind"),
dataset=[
solver
chain_of_thought(),= True),
generate(cache
self_critique(CRITIQUE_TEMPLATE)
]=model_graded_fact(),
scorer )
You can similarly do this with the generate
function passed into a Solver
:
@solver
def custom_solver(cache):
async def solve(state, generate):
# (custom solver logic prior to generate)
return generate(state, cache)
return solve
You don’t strictly need to provide a cache
argument for a custom solver that uses caching, but it’s generally good practice to enable users of the function to control caching behaviour.
You can also use caching with lower-level generate()
calls (e.g. a model instance you have obtained with get_model()
. For example:
= get_model("anthropic/claude-3-opus-20240229")
model = model.generate(input, cache = True) output
Model Versions
The model name (e.g. openai/gpt-4-turbo
) is used as part of the cache key. Note though that many model names are aliases to specific model versions. For example, gpt-4
, gpt-4-turbo
, may resolve to different versions over time as updates are released.
If you want to invalidate caches for updated model versions, it’s much better to use an explicitly versioned model name. For example:
$ inspect eval ctf.py --model openai/gpt-4-turbo-2024-04-09
If you do this, then when a new version of gpt-4-turbo
is deployed a call to the model will occur rather than resolving from the cache.
Cache Policy
By default, if you specify cache = True
then the cache will expire in 1 week. You can customise this by passing a CachePolicy
rather than a boolean. For example:
= CachePolicy(expiry="3h")
cache = CachePolicy(expiry="4D")
cache = CachePolicy(expiry="2W")
cache = CachePolicy(expiry="3M") cache
You can use s
, m
, h
, D
, W
, M
, and Y
as abbreviations for expiry
values.
If you want the cache to never expire, specify None
. For example:
= CachePolicy(expiry = None) cache
You can also define scopes for cache expiration (e.g. cache for a specific task or usage pattern). Use the scopes
parameter to add named scopes to the cache key:
= CachePolicy(
cache ="1M",
expiry={"role": "attacker", "team": "red"})
scopes )
As noted above, caching is by default done per epoch (i.e. each epoch has its own cache scope). You can disable the default behaviour by setting per_epoch=False
. For example:
= CachePolicy(per_epoch=False) cache
Management
Use the inspect cache
command the view the current contents of the cache, prune expired entries, or clear entries entirely. For example:
# list the current contents of the cache
$ inspect cache list
# clear the cache (globally or by model)
$ inspect cache clear
$ inspect cache clear --model openai/gpt-4-turbo-2024-04-09
# prune expired entries from the cache
$ inspect cache list --pruneable
$ inspect cache prune
$ inspect cache prune --model openai/gpt-4-turbo-2024-04-09
See inspect cache --help
for further details on management commands.
Cache Directory
By default the model generation cache is stored in the system default location for user cache files (e.g. XDG_CACHE_HOME
on Linux). You can override this and specify a different directory for cache files using the INSPECT_CACHE_DIR
environment variable. For example:
$ export INSPECT_CACHE_DIR=/tmp/inspect-cache
Provider Caching
Model providers may also provide prompt caching features to optimise cost and performance for multi-turn conversations. Currently, Inspect includes support for Anthropic Prompt Caching and will extend this support to other providers over time as they add caching to their APIs.
Provider prompt caching is controlled by the cache-prompt
generation config option. The default value for cache-prompt
is "auto"
, which enables prompt caching automatically if tool definitions are included in the request. Use true
and false
to force caching on or off. For example:
inspect eval ctf.py --cache-prompt=auto # enable if tools defined
inspect eval ctf.py --cache-prompt=true # force caching on
inspect eval ctf.py --cache-prompt=false # force caching off
Or with the eval()
function:
eval("ctf.py", cache_prompt=True)
Cache Scope
Providers will typically provide various means of customising the scope of cache usage. The Inspect cache-prompt
option will by default attempt to make maximum use of provider caches (in the Anthropic implementation system messages, tool definitions, and all messages up to the last user message are included in the cache).
Currently there is no way to customise the Anthropic cache lifetime (it defaults to 5 minutes)—once this becomes possible this will also be exposed in the Inspect API.
Usage Reporting
When using provider caching, model token usage will be reported with 4 distinct values rather than the normal input and output. For example:
13,684 tokens [I: 22, CW: 1,711, CR: 11,442, O: 509]
Where the prefixes on reported token counts stand for:
I | Input tokens |
CW | Input token cache writes |
CR | Input token cache reads |
O | Output tokens |
Input token cache writes will typically cost more (in the case of Anthropic roughly 25% more) but cache reads substantially less (for Anthropic 90% less) so for the example above there would have been a substantial savings in cost and execution time. See the Anthropic Documentation for additional details.