Cache Tuning

Cache behavior and similarity tuning

SemanticCache uses a two-stage retrieval model so you can trade off recall and precision without changing application code.

On cache misses handled by SemanticCacheMiddleware, the embedding computed during get() is now reused by put() for the same request. This removes one embedder call per miss, reducing latency and external embedding API cost.

When you pass a duck-typed cache (not a real SemanticCache), implement put(..., *, query_embedding=...) when you want that reuse. At startup the middleware inspects cache.put and only omits query_embedding if the signature does not accept it, so storage does not rely on fragile runtime TypeError string checks.

Model-scoped storage

SemanticCache.get and SemanticCache.put accept an optional model string (for example an LLM id from JSON or a header). The value is normalized (stripped; None or blank becomes the default bucket, model_key=""). Lookup and writes are scoped:

Postgres: Rows carry a model_key column; ANN search only considers rows for that bucket.
Redis: Response keys include short hashes of the scope and model buckets plus the row id so payloads never collide across tenants or models for the same embedder row. Enable Redis by setting SEMANTIC_CACHE_REDIS_URI and install the redis extra: pip install "fastapi-semcache[redis]" (provides redis>=7.4.0; omitted from the core wheel).

Pass the same model on get and put for a given upstream route.

Tenant and namespace scope (isolation)

Semantic matches are keyed by middleware lookup text and model. In middleware mode, lookup text includes HTTP method, normalized path, model value, and extracted semantic query, then tenant scope is applied separately. This avoids accidental cross-endpoint reuse for semantically similar prompts.

Default (single-tenant): SEMANTIC_CACHE_REQUIRE_CACHE_SCOPE (CacheSettings.require_cache_scope) is false. Lookups and writes use one shared bucket (scope_key = '') without requiring a tenant scope on each request.

Multi-tenant (require_cache_scope=true): Set SEMANTIC_CACHE_REQUIRE_CACHE_SCOPE=true when you partition cache rows per tenant or namespace. Then:

SemanticCache.get(..., scope=...) / put(..., scope=...) require a non-empty normalized scope string. Missing scope yields a cache miss and skips put (no cross-tenant writes).
Pass the same scope on get and put as you use for tenant, org, or user partition (opaque string from your auth layer).
Use resolve_cache_scope to mirror middleware rules in custom integrations.

Middleware: When require_cache_scope is true, pass extract_scope ((request, body) -> str | None) that resolves scope from authenticated identity. Do not use default_extract_scope_from_request_context (header X-Semantic-Cache-Scope and JSON cache_scope / tenant_id) for production multi-tenant APIs; clients can forge those values. A concrete helper is trusted_extract_scope_from_server_side (semanticcache.middleware.core.extractors), which reads only request.state.cache_scope or request.state.tenant_id after your auth middleware populates them:

from semanticcache.middleware.core.extractors import trusted_extract_scope_from_server_side

async def extract_scope(request, body: bytes) -> str | None:
    return await trusted_extract_scope_from_server_side(request)

For privacy and HTTP cache-safety alignment, middleware also skips cache writes when upstream responds with Cache-Control: no-store, Cache-Control: private, or any Set-Cookie header.

Middleware also bypasses cache reads and writes for requests that include an Authorization header unless you explicitly opt in with SEMANTIC_CACHE_CACHE_AUTHORIZED_REQUESTS=true (CacheSettings.cache_authorized_requests). This default reduces accidental reuse of per-user responses across authenticated callers. This is especially important for reverse-proxy deployments because upstream APIs often require Authorization; without this setting those requests always miss and never write cache entries.

Prompt-safe failure logs

When cache reads fail, SemanticCacheMiddleware logs route, scope, request_id, the exception metadata, and a keyed digest of the composed lookup text instead of logging any prompt-derived text directly.

Set SEMANTIC_CACHE_LOG_DIGEST_KEY (CacheSettings.log_digest_key) when you want those digests to remain stable across process restarts. When unset, the library falls back to a per-process random key, which still avoids prompt leakage but limits digest correlation to the current process lifetime.

Request and response body size limits

SemanticCacheMiddleware buffers the full request body and the full downstream response. To cap memory use and reduce abuse from huge payloads, use max_request_body_bytes and max_response_body_bytes. Each defaults to DEFAULT_MAX_BODY_BYTES (10 MiB). When a client request exceeds the request cap, the middleware answers with HTTP 413 before the route runs. When the upstream response would exceed the response cap, the client receives HTTP 502 (the handler may still have run; the middleware does not forward an oversized body). Set either argument to None to disable that limit (not recommended in untrusted or high-concurrency production setups). The same options are accepted by create_semantic_cache_proxy_app via keyword arguments.

Response delivery on miss: SEMANTIC_CACHE_RESPONSE_MODE (CacheSettings.response_mode) is buffered by default (full body buffered before the client sees the response). Set to tee to stream chunks to the client on cache misses while still accumulating the body for a post-stream cache write (when within size limits and validation passes). When the middleware uses SemanticCache.settings for the scope gate, response_mode is read from that same object; otherwise it comes from the middleware cache_settings source (see middleware constructor docs).

Response delivery on hit: SEMANTIC_CACHE_HIT_RESPONSE_MODE (CacheSettings.hit_response_mode) controls how cached responses are emitted to the client. When response_mode="tee" and this field is not explicitly set, it automatically defaults to "stream" so that hit and miss delivery are symmetric. Set to "single" to return the cached body as a plain JSONResponse regardless of response_mode.

When hit_response_mode="stream" (including the automatic tee default):

No content-length header is set (matching streaming miss framing).
The body is serialised to UTF-8 JSON bytes and emitted in one or more chunks.
SEMANTIC_CACHE_HIT_STREAM_CHUNK_SIZE (CacheSettings.hit_stream_chunk_size, default 0) controls splitting. 0 sends the entire body as a single chunk (sufficient for most clients). Set a positive integer to split the body into sequential chunks of at most that many bytes, giving clients that process tokens incrementally a progressive delivery experience.
Security-sensitive headers (set-cookie, authorization, www-authenticate, proxy-authenticate) and content-length are always stripped from the replayed headers, regardless of what was stored.

Minimal symmetric streaming configuration (hit mode is inferred automatically):

SEMANTIC_CACHE_RESPONSE_MODE=tee
# optional: split into small chunks (e.g. 64 bytes) for token-by-token clients
# SEMANTIC_CACHE_HIT_STREAM_CHUNK_SIZE=64

To force single-response hits even in tee mode:

SEMANTIC_CACHE_RESPONSE_MODE=tee
SEMANTIC_CACHE_HIT_RESPONSE_MODE=single

Response shape validation

Middleware stores successful responses only when the body parses as a JSON object. For provider-specific APIs, add validate_response to reject malformed or mismatched objects before they can become cache entries. The validator can be sync or async and receives ResponseValidationContext with the route request, raw request body, upstream response, parsed payload, model, and scope.

from semanticcache import ResponseValidationContext, SemanticCacheMiddleware


def validate_response(context: ResponseValidationContext) -> bool:
    if context.request.url.path == "/v1/chat/completions":
        return (
            context.model == "gpt-5.4-mini"
            and isinstance(context.payload.get("choices"), list)
        )
    return True


app.add_middleware(
    SemanticCacheMiddleware,
    cache=cache,
    validate_response=validate_response,
)

Returning False, or raising from the validator, skips the cache write while still returning the upstream response to the caller.

Unreplayable similarity hits

When ANN search returns a row but the stored JSON is not a replayable response (for example the replay marker is set but body is not a JSON object, or the marker and body / meta envelope are missing), SemanticCacheMiddleware logs a warning, treats the lookup as a miss, and calls downstream. If the cache backend is SemanticCache and CacheResult includes cache_entry_id, the middleware also deletes that Postgres row (and the matching Redis key when Redis is enabled) so one bad row cannot force repeated misses.

Trust boundary: Header and JSON scope values are only safe isolation boundaries when your deployment sets them (for example from verified JWT claims at the edge) or overwrites untrusted client fields before they reach this middleware. Otherwise a client can pick another tenant id and probe for cache hits; always derive scope from authenticated identity in multi-tenant systems.

Settings alignment: SemanticCacheMiddleware applies require_cache_scope, response_mode, log_digest_key, and the gate for “missing scope” using SemanticCache.settings when the cache argument is a real SemanticCache instance. cache_settings still controls circuit breaker, flight-lock limits, and the cache_authorized_requests gate. When both cache_settings and cache.settings are supplied and disagree on require_cache_scope, cache_authorized_requests, or log_digest_key, the middleware raises ValueError at startup. Pass a single aligned CacheSettings object to both SemanticCache and the middleware, or omit cache_settings from the middleware to let cache.settings take full effect.

Integer tenant_id (JSON number) is accepted and normalized to a string for storage keys.

Single-tenant (default): Leave SEMANTIC_CACHE_REQUIRE_CACHE_SCOPE=false (default) when one customer owns the process and dedicated cache storage, or when you intentionally share one global cache bucket.

Middleware in-flight lock keys also include the resolved scope string so concurrent misses for different tenants are not serialized together.

Scope key and Redis layout

scope_key affects Postgres matching and Redis key segments (an extra scope bucket hash appears before the model segment). Rows with scope_key = '' are looked up only when require_cache_scope is false (one shared bucket). When require_cache_scope is true, each normalized scope string is its own partition.

Exact-match fast path (Redis only)

When Redis is enabled, SemanticCache.get checks for an exact text match in Redis before generating an embedding. This is a zero-cost lookup: no embedder call, no pgvector query.

On every put, the library stores a small entry under a separate key:

semanticcache:resp:<embedder_prefix>:exact:<scope_bucket>:<model_bucket>:<sha256_of_query>
{"id": <postgres_row_id>}

On get, if that key is found, the row id is used to look up the response blob directly:

Redis hit: The response blob (semanticcache:resp:<prefix>:<scope>:<model>:<id>) is returned immediately. Embedding is skipped entirely.
Redis blob expired/evicted: Falls back to a single Postgres id lookup (no vector scan).
Both Redis and Postgres miss (row deleted or expired): Proceeds to the normal embedding and ANN search path.
Key absent: No entry was stored for this exact text; proceeds to embedding and ANN search as normal.

The query text is SHA-256 hashed before being used in the key name so raw prompt text is never embedded in Redis key strings.

This fast path is only active when SEMANTIC_CACHE_REDIS_URI is set. In Postgres-only mode the exact-match check is skipped and the flow proceeds directly to embedding.

Similarity reported: Exact-match hits report similarity=1.0 since the match is on the precise composed query text.

Stage 1: nearest-neighbor search (top-k)

The first stage embeds the query and runs a pgvector similarity search:

SEMANTIC_CACHE_THRESHOLD (CacheSettings.threshold):
Primary similarity gate in [0.0, 1.0].
Candidates below this value are discarded in Postgres before LIMIT so the top-k cap applies only among rows that meet the threshold.
SEMANTIC_CACHE_TOP_K_CANDIDATES (CacheSettings.top_k_candidates):
Maximum number of nearest neighbors returned from pgvector after applying the primary threshold.
Defaults to 1 for single-hit behavior.

After this stage you get up to top_k_candidates CacheEntry rows ordered from highest to lowest similarity, all with similarity >= threshold.

pgvector HNSW tuning

SemanticCache now exposes the main pgvector HNSW knobs in two places:

Global defaults through CacheSettings / environment variables.
Per-call query override for hnsw.ef_search through SemanticCache.get(..., hnsw_ef_search=...).

The settings are:

SEMANTIC_CACHE_PGVECTOR_HNSW_M (CacheSettings.pgvector_hnsw_m):
Default 16.
Used when creating a new HNSW index.
Higher values generally improve recall and memory use, but increase index size and build cost.
SEMANTIC_CACHE_PGVECTOR_HNSW_EF_CONSTRUCTION (CacheSettings.pgvector_hnsw_ef_construction):
Default 64.
Used when creating a new HNSW index.
Higher values generally improve recall, but slow index builds.
SEMANTIC_CACHE_PGVECTOR_HNSW_EF_SEARCH (CacheSettings.pgvector_hnsw_ef_search):
Default unset.
Applied at query time as a transaction-local pgvector setting for similarity search.
Higher values generally improve recall, but increase CPU and latency.

Important operational note:

m and ef_construction only affect newly created HNSW indexes. Changing these settings does not rebuild an existing index automatically.
If you change m or ef_construction after the index already exists, the running application keeps using that existing on-disk index with its old build parameters. Restarting the app or changing environment variables alone does not update the index structure.
To make new m or ef_construction values take effect, rebuild or recreate the index outside the library.
ef_search affects query-time behavior and can be changed without rebuilding indexes.
Changing ef_search does not invalidate the HNSW index. It is the safe runtime tuning knob for recall versus latency. Lower values usually reduce latency and recall; higher values usually increase recall and query cost.

Example global configuration:

SEMANTIC_CACHE_PGVECTOR_HNSW_M=16
SEMANTIC_CACHE_PGVECTOR_HNSW_EF_CONSTRUCTION=64
SEMANTIC_CACHE_PGVECTOR_HNSW_EF_SEARCH=80

Example per-query override:

result = await cache.get(
    "what is the refund policy?",
    model="gpt-5.4-mini",
    hnsw_ef_search=120,
)

This override applies only to that lookup. Other requests keep using the configured default or the database default.

Stage 2: optional rejection threshold

The second stage can apply a stricter similarity cutoff on the in-memory candidates:

SEMANTIC_CACHE_REJECTION_THRESHOLD (CacheSettings.rejection_threshold):
When unset (None or empty env var), behavior matches the original single-threshold model: the best remaining candidate is accepted.
When set, it must be greater than or equal to SEMANTIC_CACHE_THRESHOLD; otherwise settings validation fails (a lower value would make the second stage unable to reject anything that passed the primary gate).
If it equals SEMANTIC_CACHE_THRESHOLD, validation still succeeds, but you get a warning at startup: the second stage cannot filter out any candidate that passed the primary gate (same cutoff). Use a strictly higher rejection threshold if you want stage 2 to matter, or leave rejection unset.
When set, the cache scans the candidates in order and selects the first entry whose similarity >= rejection_threshold.
If no candidate passes this stricter bar, the cache returns a miss (is_hit=False).

This is useful when you want to:

Keep SEMANTIC_CACHE_THRESHOLD lower (for example 0.80) to allow more candidates into the first stage.
Enforce a higher bar (for example 0.90) for actually serving a cached response.

Example configurations

Strict, precision-first cache:
SEMANTIC_CACHE_THRESHOLD=0.90
SEMANTIC_CACHE_TOP_K_CANDIDATES=1
SEMANTIC_CACHE_REJECTION_THRESHOLD= (unset)

Only very similar neighbors are considered, and the single best one is accepted if it passes 0.90.

More recall with a second-stage guard:
SEMANTIC_CACHE_THRESHOLD=0.80
SEMANTIC_CACHE_TOP_K_CANDIDATES=5
SEMANTIC_CACHE_REJECTION_THRESHOLD=0.90

Up to five neighbors with similarity at least 0.80 are fetched; the cache only serves a hit if at least one of them has similarity >= 0.90, otherwise it falls back to a miss.

Notes

If SEMANTIC_CACHE_TOP_K_CANDIDATES is less than 1, it is treated as 1 internally.
All thresholds are validated to the inclusive range [0.0, 1.0] by CacheSettings (out-of-range values raise a validation error).
When SEMANTIC_CACHE_REJECTION_THRESHOLD is set, it must satisfy rejection_threshold >= threshold. Equality issues a warning because the second stage has no effect (see above).

Postgres row expiry

Each cache table contains an expires_at TIMESTAMPTZ column (nullable). By default it is NULL and rows never expire.

Set SEMANTIC_CACHE_PG_TTL_DAYS (CacheSettings.pg_ttl_days, fractional float) to enable expiry. When set, every inserted or updated row receives expires_at = NOW() + <ttl_days> days. On conflict the column is overwritten with the new expiry so upserts always refresh the deadline.

Because each embedder configuration writes to its own table (scoped by cache_namespace and vector dimension), autovacuum scheduling is isolated per model. A high-write model accumulates dead tuples and triggers autovacuum independently of quieter models, so a busy embedder does not delay dead-tuple cleanup for others sharing the same relation.

Row removal is out of scope for the library. Schedule cleanup externally, for example with pg_cron:

SELECT cron.schedule(
    'purge-expired-cache',
    '0 * * * *',
    $$DELETE FROM sc_<table_hash> WHERE expires_at IS NOT NULL AND expires_at < NOW()$$
);

Replace sc_<table_hash> with the actual table name (derived from cache_namespace and embedding_dim; visible in Postgres \dt sc_*).

Timeout tuning

Slow embedder providers or storage dependencies can increase request latency and tie up worker capacity. SemanticCache supports fail-fast timeout controls:

SEMANTIC_CACHE_EMBED_TIMEOUT_SECONDS (CacheSettings.embed_timeout_seconds): timeout budget for embedder calls used by get() and put().
SEMANTIC_CACHE_STORE_TIMEOUT_SECONDS (CacheSettings.store_timeout_seconds): timeout budget for Postgres and Redis operations, including initial pool open and schema checks. When this value is set (non-null), the Redis response store also passes it as socket_timeout and socket_connect_timeout to redis.asyncio.from_url so stalled TCP or Redis reads cannot block the event loop beyond that budget at the socket layer. If you disable the store timeout (null/empty env), Redis uses library defaults for those socket options.
SEMANTIC_CACHE_UPSTREAM_TIMEOUT_SECONDS (CacheSettings.upstream_timeout_seconds): timeout budget in seconds for the upstream ASGI call. Applies in both response_mode='buffered' and response_mode='tee'. A slow or hung upstream holds the per-key flight lock open for its full duration, blocking all waiters for that key until the holder releases it. Setting this cap bounds how long the holder runs upstream work: when the budget expires, the middleware cancels the upstream call, releases the flight lock, logs a warning, and returns HTTP 504 to the client. Waiters still block on lock acquisition for the full holder duration unless middleware_flight_lock_acquire_timeout_seconds is also set. Defaults to None (no cap).

In tee mode the timeout is handled in two ways depending on how far the stream has progressed. If the upstream has not yet sent http.response.start, the middleware sends a complete 504 response to the client before releasing the lock, so the client never hangs. If http.response.start has already been forwarded (mid-stream), the HTTP status line is already committed; the middleware closes the connection by sending a terminal empty body chunk, then releases the lock. Because the original (non-504) status line has already been sent, the client receives a truncated stream in this case. To avoid mid-stream timeouts, set upstream_timeout_seconds to a value larger than the expected time-to-first-byte so that the timeout only fires for fully hung upstreams that have not yet responded at all.

When embed_timeout_seconds or store_timeout_seconds are exceeded, the cache raises a timeout exception with operation metadata, emits a warning log entry, and increments an in-process operation timeout counter (SemanticCache.timeout_counts) for observability. Middleware continues to fail open, so requests still execute against upstream handlers.

Middleware in-flight lock registry

SemanticCacheMiddleware keeps an in-memory lock table to serialize concurrent cache misses for the same (composed_query, model, scope_storage) key, where composed_query is the string produced by combining HTTP method, normalized path, model value, and extracted semantic query.

Each registry entry is removed automatically when the flight completes (i.e. when the async with flight: block exits), so the registry only contains genuinely in-flight keys at any point in time. Under normal operation the registry stays small regardless of key cardinality.

To guard against pathological cases (e.g. many concurrent in-flight misses that never complete), the registry is bounded:

SEMANTIC_CACHE_MIDDLEWARE_FLIGHT_LOCK_MAX_ENTRIES (CacheSettings.middleware_flight_lock_max_entries): maximum number of distinct in-flight lock keys retained simultaneously. When the limit is exceeded, the middleware evicts least-recently-used unlocked lock entries. Locks currently coordinating active requests are never evicted.
SEMANTIC_CACHE_MIDDLEWARE_FLIGHT_LOCK_ACQUIRE_TIMEOUT_SECONDS (CacheSettings.middleware_flight_lock_acquire_timeout_seconds): maximum seconds a request may block waiting to acquire the per-key flight lock. When exceeded, the middleware logs a warning and proceeds without deduplication (fail open), so waiters are not held indefinitely while another flight runs a slow embed, store, or upstream call. Unset (null/empty) waits indefinitely. Set this above the expected duration of one coordinated miss (embed + store + upstream budgets combined).

Default is 4096 for max entries. Saturated registry: when every retained lock is held and a new distinct key is inserted, LRU eviction drops that new key’s table entry immediately (the new lock is the last unlocked slot in traversal order, since it was just appended and all older entries are still held). The caller still holds the same lock object, but it is no longer tracked, so concurrent identical keys are not deduplicated until capacity frees. A warning-level log is emitted when this happens.