fastapi-semcache

Ultra-lightweight semantic caching middleware for FastAPI APIs and LLM endpoints.

fastapi-semcache adds semantic response caching as a thin async middleware layer. Vector similarity search runs inside Postgres via pgvector. Python never owns the heavy computation. It works as FastAPI middleware today and can also run as a reverse proxy in front of an upstream API or LLM service.

How it works

When a request arrives, the middleware:

Extracts the semantic query text from the request body (query, prompt, input, messages or your own extractor callable).
Embeds the query using a configurable embedder (OpenAI, Voyage, Hugging Face, Ollama, or your own).
Runs a nearest-neighbor cosine similarity search in Postgres via pgvector.
Returns a cached response if a match passes the similarity threshold, or calls your route handler and stores the new response.

Python is the glue, not the bottleneck. Every expensive operation is offloaded:

What	Where it runs
Cosine / ANN vector similarity	Postgres + pgvector (C, indexed)
Embedding generation	Your provider's API (I/O, not CPU)
Response blob storage and retrieval	Postgres rows or Redis (C clients)
HTTP proxying	`aiohttp.ClientSession` (async I/O, optional `proxy` extra)

Because all meaningful work is either I/O-bound (GIL released) or executing inside a C extension, Python is never the ceiling even under high concurrency with a single uvicorn worker.

Install

pip install fastapi-semcache

SEMANTIC_CACHE_PG_URI (PostgreSQL connection string with pgvector) is the only required environment variable. Everything else has a sensible default.

Optional extras

Extra	Installs	Use when
`proxy`	`fastapi`, `aiohttp`	`create_semantic_cache_proxy_app`
`embed-openai`	`openai`, `tiktoken`	`embedder_type="openai"`
`embed-voyage`	`voyageai`, `aiohttp`	`embedder_type="voyage"`
`embed-cohere`	`cohere`	`embedder_type="cohere"`
`embed-huggingface`	`sentence-transformers`, `torch`	`embedder_type="huggingface"`
`embed-ollama`	`openai`	`embedder_type="ollama"`
`redis`	`redis`	`SEMANTIC_CACHE_REDIS_URI` is set

The core wheel installs Starlette and psycopg only. Declare fastapi in your own project for FastAPI() apps; the middleware is Starlette/ASGI middleware. fastapi is pulled in by the proxy extra for reverse proxy mode.

Extras can be combined:

pip install "fastapi-semcache[redis,embed-openai]"

For GPU (CUDA) PyTorch with the Hugging Face extra, pass PyTorch's wheel index:

pip install "fastapi-semcache[embed-huggingface]" \
  --extra-index-url https://download.pytorch.org/whl/cu124

Quickstart

FastAPI middleware

from typing import Any

from fastapi import FastAPI

from semanticcache import SemanticCache, SemanticCacheMiddleware

app = FastAPI()
cache = SemanticCache()
app.add_middleware(SemanticCacheMiddleware, cache=cache)


@app.post("/v1/chat/completions")
async def chat_completions(body: dict[str, Any]) -> dict[str, Any]:
    return {"choices": [{"message": {"role": "assistant", "content": "Hello"}}]}

uvicorn mymodule:app --host 0.0.0.0 --port 8000

By default only POST requests are intercepted. Successful responses whose body parses as a JSON object are stored. Cache hits replay the original HTTP status and response headers.

Reverse proxy

Use create_semantic_cache_proxy_app when you want a standalone hop in front of another service rather than importing routes into your FastAPI app. Install the proxy extra first (installs fastapi and aiohttp):

pip install "fastapi-semcache[proxy]"

from semanticcache import SemanticCache, create_semantic_cache_proxy_app

cache = SemanticCache()
app = create_semantic_cache_proxy_app(
    upstream="http://127.0.0.1:11434",
    cache=cache,
)

uvicorn mymodule:app --host 0.0.0.0 --port 8080

Key concepts

Similarity thresholds

SemanticCache uses a two-stage retrieval pipeline:

Stage 1 (SEMANTIC_CACHE_THRESHOLD, SEMANTIC_CACHE_TOP_K_CANDIDATES): fetches the top-k nearest neighbors from pgvector that meet the primary similarity gate.
Stage 2 (SEMANTIC_CACHE_REJECTION_THRESHOLD): optionally applies a stricter cutoff on those candidates before serving a hit.

See Cache Tuning for concrete configuration examples.

Embedders

The default factory (get_embedder) reads SEMANTIC_CACHE_EMBEDDER_TYPE and constructs a built-in embedder. You can also subclass BaseEmbedder and pass any custom embedder directly:

cache = SemanticCache(embedder=MyEmbedder(...), settings=get_cache_settings())

See Embedders for the full contract and built-in options.

Cache scope and tenant isolation

By default (SEMANTIC_CACHE_REQUIRE_CACHE_SCOPE=false), the cache uses one shared bucket (single-tenant). For multi-tenant isolation, set SEMANTIC_CACHE_REQUIRE_CACHE_SCOPE=true and supply a server-side extract_scope that derives scope from authenticated identity. Do not rely on client-controlled X-Semantic-Cache-Scope or JSON cache_scope / tenant_id alone; clients can forge those values.

Middleware warning logs also avoid prompt text. Cache read failures log route, scope, request_id, and a keyed digest of the composed lookup text. Set SEMANTIC_CACHE_LOG_DIGEST_KEY when you want those digests to stay stable across process restarts.

from semanticcache.middleware.core.extractors import trusted_extract_scope_from_server_side

async def extract_scope(request, body: bytes) -> str | None:
    return await trusted_extract_scope_from_server_side(request)

app.add_middleware(SemanticCacheMiddleware, cache=cache, extract_scope=extract_scope)
app.add_middleware(YourAuthMiddleware)

Storage

Postgres + pgvector: always required. Each embedder configuration gets its own table (scoped by model id and vector dimension) created automatically on first use.
Redis (optional): TTL-backed response blob cache. Install the redis extra and set SEMANTIC_CACHE_REDIS_URI. If unset, responses are stored in Postgres only.

Environment variables

Variable	Default	Description
`SEMANTIC_CACHE_PG_URI`	(required)	PostgreSQL connection string
`SEMANTIC_CACHE_EMBEDDER_TYPE`	`huggingface`	Embedder backend (`openai`, `cohere`, `voyage`, `huggingface`, `ollama`). `huggingface` loads PyTorch in-process; use hosted backends in production.
`SEMANTIC_CACHE_THRESHOLD`	`0.95`	Primary cosine similarity gate [0.0, 1.0]
`SEMANTIC_CACHE_TOP_K_CANDIDATES`	`1`	Max nearest-neighbor candidates from pgvector
`SEMANTIC_CACHE_REJECTION_THRESHOLD`	(unset)	Optional stricter second-stage cutoff
`SEMANTIC_CACHE_PGVECTOR_HNSW_M`	`16`	HNSW graph connectivity for new pgvector indexes; existing indexes keep the old value until rebuilt
`SEMANTIC_CACHE_PGVECTOR_HNSW_EF_CONSTRUCTION`	`64`	HNSW build candidate list size for new pgvector indexes; existing indexes keep the old value until rebuilt
`SEMANTIC_CACHE_PGVECTOR_HNSW_EF_SEARCH`	(unset)	Optional default query-time HNSW search breadth; safe to change without rebuilding the index
`SEMANTIC_CACHE_REDIS_URI`	(empty)	Redis URI; omit for Postgres-only mode
`SEMANTIC_CACHE_REQUIRE_CACHE_SCOPE`	`false`	Require a non-empty scope on every request (multi-tenant)
`SEMANTIC_CACHE_CACHE_AUTHORIZED_REQUESTS`	`false`	Cache requests that include an `Authorization` header
`SEMANTIC_CACHE_LOG_DIGEST_KEY`	(per-process random)	Secret used to derive HMAC digests for prompt-derived log fields; set explicitly for stable correlation across restarts
`SEMANTIC_CACHE_RESPONSE_MODE`	`buffered`	Miss delivery mode (`buffered` or `tee`)
`SEMANTIC_CACHE_HIT_RESPONSE_MODE`	(auto)	Hit delivery mode (`single` or `stream`)
`SEMANTIC_CACHE_PG_TTL_DAYS`	(unset)	Fractional days before Postgres rows expire
`SEMANTIC_CACHE_EMBED_TIMEOUT_SECONDS`	(unset)	Fail-fast budget for embedder calls
`SEMANTIC_CACHE_STORE_TIMEOUT_SECONDS`	(unset)	Fail-fast budget for Postgres / Redis operations
`SEMANTIC_CACHE_UPSTREAM_TIMEOUT_SECONDS`	(unset)	Fail-fast budget for upstream ASGI calls
`SEMANTIC_CACHE_MAX_BODY_BYTES`	`10485760`	Request and response body size cap (10 MiB)

Package names

The PyPI distribution and GitHub repository are fastapi-semcache. The import package is semanticcache (fastapi_semcache is available as an alias).

Requirements

Python 3.12+. Postgres with the pgvector extension.

License

Apache-2.0.