Custom embedders and minimal installs
The PyPI package fastapi-semcache installs core runtime dependencies only (Starlette and Postgres via psycopg). Optional extras such as proxy, embed-openai, embed-cohere, embed-voyage, embed-ollama, embed-huggingface and redis pull in vendor-specific stacks. FastAPI apps should already declare fastapi in their own project; the proxy extra installs fastapi for create_semantic_cache_proxy_app.
If you want to avoid those stacks, or you already host embeddings elsewhere, implement a small class against BaseEmbedder and pass it into SemanticCache(embedder=...). No embedding extra is required for that path.
Built-in embedders: get_embedder vs constructor arguments
get_embedder(settings) (used automatically when you omit embedder= on SemanticCache) reads CacheSettings.embedder_type (environment SEMANTIC_CACHE_EMBEDDER_TYPE) and the matching settings fields. It constructs SBERTEmbedder or OpenAIEmbedder with no model_name, dimensions, base_url, or other constructor overrides for those two backends (they use class defaults, for example text-embedding-3-small / 1536 on OpenAIEmbedder, or sentence-transformers/all-MiniLM-L6-v2 on SBERTEmbedder).
When embedder_type is ollama, settings must include ollama_embedding_model and ollama_embedding_dimensions (environment SEMANTIC_CACHE_OLLAMA_EMBEDDING_MODEL and SEMANTIC_CACHE_OLLAMA_EMBEDDING_DIMENSIONS). There is no safe library default across Qwen, Nomic, and other models; the declared width must match the running model and your pgvector column.
There is no separate environment variable today for “which embedding model id” on the stock factory path for OpenAIEmbedder / SBERTEmbedder. Changing the embedding model or dimensions for those backends is normal configuration: import OpenAIEmbedder or SBERTEmbedder, pass the constructor arguments you need, and wire SemanticCache(embedder=..., settings=...) as below.
from semanticcache import SemanticCache, get_cache_settings
from semanticcache.embedders import OpenAIEmbedder
cache = SemanticCache(
embedder=OpenAIEmbedder(
model_name="text-embedding-3-large",
dimensions=3072,
api_key=get_cache_settings().openai_api_key,
),
settings=get_cache_settings(),
)
Use the same pattern for SBERTEmbedder(model_name="...", normalize_embeddings=..., api_key=...). cache_namespace (and thus pgvector table routing) incorporates model id and dimensions, so a different model_name or width does not collide with another setup.
When you bypass get_embedder, SEMANTIC_CACHE_EMBEDDER_TYPE no longer selects the implementation; keep embedder_type aligned with reality if you rely on CacheResult.source (see CacheResult.source and settings below).
Install (core only)
pip install fastapi-semcache
You still need Postgres with pgvector and, if you use it, Redis. You do not need embed-openai, embed-huggingface, or similar extras for a custom embedder.
Contract: BaseEmbedder
Import the abstract base from the embedders package:
from semanticcache.embedders import BaseEmbedder
Your subclass must provide:
| Member | Role |
|---|---|
embedding_dim (property) |
Length of each dense vector returned by embed. Must stay fixed for a given cache_namespace so the pgvector table dimension matches. |
cache_namespace (property) |
Stable string that identifies this embedding setup (model id, version, dimension, anything that should not share storage with a different setup). Used to derive the pgvector table name and Redis key prefix. |
embed(texts) (async) |
Given list[str], return list[list[float]]: one vector per string, same order and length as texts. For an empty input list, return an empty list. |
Batching, retries, and timeouts are your responsibility inside embed.
Wiring SemanticCache
Pass any BaseEmbedder instance (built-in OpenAIEmbedder, SBERTEmbedder, OllamaEmbedder, or your own subclass) as the keyword-only argument embedder. The factory get_embedder(settings) is not used when embedder is set, so SEMANTIC_CACHE_EMBEDDER_TYPE does not choose that instance.
from semanticcache import SemanticCache, get_cache_settings
cache = SemanticCache(embedder=MyEmbedder(...), settings=get_cache_settings())
Optional: pass embedding_dim= to assert it matches embedder.embedding_dim (a mismatch raises ValueError).
When you use the built-in Hugging Face backend through get_embedder(settings), SBERTEmbedder receives the token from settings.hugging_face_api_key. SBERTEmbedder does not re-read global settings on its own.
Production note: SBERTEmbedder loads sentence-transformers and PyTorch inside your app process. That adds significant memory and CPU/GPU overhead compared with hosted APIs. It is intended for local development and tests. On first construction, the library emits a one-time UserWarning recommending openai, cohere, voyage, ollama, or a custom BaseEmbedder for deployed workloads.
CacheResult.source and settings
CacheResult.source is still derived from CacheSettings.embedder_type (environment SEMANTIC_CACHE_EMBEDDER_TYPE) for hits and misses, not from the concrete embedder class or model_name. If you rely on that field for metrics, either align embedder_type with the backend you instantiated (huggingface vs openai) or treat source as configuration metadata only when embedder was passed explicitly.
Example: HTTP embedding API
Delegate embeddings to your own HTTP service. Both snippets below assume POST /embed with body {"texts": ["...", ...]} and JSON {"vectors": [[float, ...], ...]}. Rename paths and keys to match your API.
HTTPX
Install httpx in your app (or add it to your project dependencies).
from typing import override
import httpx
from semanticcache.embedders import BaseEmbedder
class HttpExampleHttpxEmbedder(BaseEmbedder):
"""Minimal example: delegate embeddings to your own HTTP service."""
def __init__(
self,
*,
base_url: str,
embedding_dim: int,
cache_namespace: str,
) -> None:
self._base_url = base_url.rstrip("/")
self._embedding_dim = embedding_dim
self._cache_namespace = cache_namespace
@property
@override
def embedding_dim(self) -> int:
return self._embedding_dim
@property
@override
def cache_namespace(self) -> str:
return self._cache_namespace
@override
async def embed(self, texts: list[str]) -> list[list[float]]:
if not texts:
return []
async with httpx.AsyncClient(base_url=self._base_url, timeout=60.0) as client:
response = await client.post("/embed", json={"texts": texts})
response.raise_for_status()
payload = response.json()
vectors: list[list[float]] = payload["vectors"]
if len(vectors) != len(texts):
msg = "embedding API returned wrong number of vectors"
raise RuntimeError(msg)
for row in vectors:
if len(row) != self._embedding_dim:
msg = "embedding vector length does not match embedding_dim"
raise RuntimeError(msg)
return vectors
aiohttp
Install aiohttp in your app, or use an extra that already includes it (for example fastapi-semcache[proxy] or embed-voyage).
from typing import Any, override
import aiohttp
from semanticcache.embedders import BaseEmbedder
class HttpExampleAiohttpEmbedder(BaseEmbedder):
"""Minimal example: delegate embeddings to your own HTTP service (aiohttp)."""
def __init__(
self,
*,
base_url: str,
embedding_dim: int,
cache_namespace: str,
) -> None:
self._base_url = base_url.rstrip("/")
self._embedding_dim = embedding_dim
self._cache_namespace = cache_namespace
@property
@override
def embedding_dim(self) -> int:
return self._embedding_dim
@property
@override
def cache_namespace(self) -> str:
return self._cache_namespace
@override
async def embed(self, texts: list[str]) -> list[list[float]]:
if not texts:
return []
timeout = aiohttp.ClientTimeout(total=60.0)
async with aiohttp.ClientSession(
base_url=self._base_url,
timeout=timeout,
) as session:
async with session.post("/embed", json={"texts": texts}) as response:
response.raise_for_status()
payload: dict[str, Any] = await response.json()
vectors: list[list[float]] = payload["vectors"]
if len(vectors) != len(texts):
msg = "embedding API returned wrong number of vectors"
raise RuntimeError(msg)
for row in vectors:
if len(row) != self._embedding_dim:
msg = "embedding vector length does not match embedding_dim"
raise RuntimeError(msg)
return vectors
Usage
from semanticcache import SemanticCache, get_cache_settings
embedder = HttpExampleAiohttpEmbedder(
base_url="http://127.0.0.1:9000",
embedding_dim=768,
cache_namespace="my-team-embed-v1-d768",
)
# Or: HttpExampleHttpxEmbedder(...)
cache = SemanticCache(embedder=embedder, settings=get_cache_settings())
Pick cache_namespace so it changes whenever model, pooling, or vector width changes; otherwise you risk reading incompatible rows from an old table. If storage is shared, include an application or environment prefix so your namespace cannot match another service’s built-in vendor:model:dimensions string by accident.
OpenAIEmbedder and send_dimensions_to_api
OpenAIEmbedder (optional extra embed-openai) maps the dimensions constructor argument to both storage width and, by default, the OpenAI embeddings.create request body. Some models (for example text-embedding-ada-002) return a fixed vector size and the API may reject a dimensions parameter. In that case, set send_dimensions_to_api=False. The library still uses dimensions for embedding_dim, validation, and cache_namespace - it only omits the field from the API call.
from semanticcache.embedders import OpenAIEmbedder
# Fixed-size model: do not send "dimensions" to the API, but keep local width 1536.
ada = OpenAIEmbedder(
model_name="text-embedding-ada-002",
dimensions=1536,
send_dimensions_to_api=False,
)
For text-embedding-3-small / text-embedding-3-large, the default send_dimensions_to_api=True is appropriate when you want a reduced output width (as supported by that model family).
VoyageEmbedder (embed-voyage)
VoyageEmbedder (optional extra embed-voyage) sends embedding requests asynchronously to https://api.voyageai.com/v1/embeddings using aiohttp, as recommended by Voyage for async workloads. Before each batch, it validates inputs locally via voyageai.Client.tokenize (which uses Voyage's Hugging Face tokenizer - no network call).
Install with:
pip install 'fastapi-semcache[embed-voyage]'
Constructor
VoyageEmbedder(
model_name="voyage-4",
*,
dimensions=1024,
output_dimension=None,
input_type=None,
api_key=None,
)
model_name: Voyage model id. Recommended:voyage-4-large,voyage-4,voyage-4-lite,voyage-3,voyage-3.5,voyage-code-3. Defaults tovoyage-4.dimensions: Storage and validation width. Must match the model's actual output (oroutput_dimensionwhen set).output_dimension: When set, passed asoutput_dimensionin the API request. Only supported byvoyage-4-*,voyage-3-large,voyage-3.5*, andvoyage-code-3(valid values: 256, 512, 1024, 2048). When used, setdimensionsto the same value.input_type: Optional hint:None,"query", or"document". Use"document"when indexing content,"query"for lookup to improve retrieval accuracy. Passed directly to the API.api_key: Voyage API key. Defaults to theVOYAGE_API_KEYenvironment variable when omitted.
Example
from semanticcache import SemanticCache, get_cache_settings
from semanticcache.embedders import VoyageEmbedder
cache = SemanticCache(
embedder=VoyageEmbedder(
model_name="voyage-4-large",
dimensions=1024,
input_type="document",
api_key=get_cache_settings().voyage_api_key,
),
settings=get_cache_settings(),
)
Through get_embedder
Set SEMANTIC_CACHE_EMBEDDER_TYPE=voyage. The following environment variables configure the factory path:
| Variable | Default | Description |
|---|---|---|
VOYAGE_API_KEY / SEMANTIC_CACHE_VOYAGE_API_KEY |
None |
API key |
SEMANTIC_CACHE_VOYAGE_EMBEDDING_MODEL |
voyage-4 |
Model id |
SEMANTIC_CACHE_VOYAGE_EMBEDDING_DIMENSIONS |
1024 |
Vector width |
SEMANTIC_CACHE_VOYAGE_INPUT_TYPE |
None |
query, document, or unset |
Notes
- The
aiohttp.ClientSessionis created lazily on the firstembed()call and shared across all requests. Callawait embedder.aclose()on shutdown, orawait cache.close()which invokesaclose()when the embedder implements it. - Token validation via
voyageai.Client.tokenizeis a local CPU operation - it loads the model's Hugging Face tokenizer on first call. - Batches are capped at 1,000 texts per request (Voyage's documented hard limit).
CohereEmbedder (embed-cohere)
CohereEmbedder (optional extra embed-cohere) calls Cohere's embed API through the official cohere.AsyncClient. By default it uses AsyncClient.embed, which batches requests at 96 texts per call. When output_dimension is set (supported on embed-v4 and newer), it uses AsyncClient.v2.embed with manual batching so the reduced width is sent to the API.
Install with:
pip install 'fastapi-semcache[embed-cohere]'
Constructor
CohereEmbedder(
model_name="embed-v4.0",
*,
dimensions=1536,
input_type="search_document",
output_dimension=None,
truncate=None,
api_key=None,
base_url=None,
)
model_name: Cohere embed model id (for exampleembed-v4.0,embed-english-v3.0,embed-multilingual-v3.0). Defaults toembed-v4.0.dimensions: Storage and validation width. Must match the model's actual output (1536 forembed-v4.0by default, oroutput_dimensionwhen set).input_type: Required hint for embed v3+:search_document,search_query,classification, orclustering. Usesearch_documentwhen indexing content andsearch_queryfor lookup. Defaults tosearch_document.output_dimension: When set, passed to the v2 API. Only supported byembed-v4and newer (256, 512, 1024, 1536). Setdimensionsto the same value.truncate: OptionalNONE,START, orENDfor over-length inputs. When omitted, the API default applies.api_key: Cohere API key. Defaults toCOHERE_API_KEYwhen omitted.base_url: Optional API base URL for enterprise or proxy deployments.
Example
from semanticcache import SemanticCache, get_cache_settings
from semanticcache.embedders import CohereEmbedder
cache = SemanticCache(
embedder=CohereEmbedder(
model_name="embed-v4.0",
dimensions=1536,
input_type="search_document",
api_key=get_cache_settings().cohere_api_key,
),
settings=get_cache_settings(),
)
Through get_embedder
Set SEMANTIC_CACHE_EMBEDDER_TYPE=cohere. The following environment variables configure the factory path:
| Variable | Default | Description |
|---|---|---|
COHERE_API_KEY / SEMANTIC_CACHE_COHERE_API_KEY |
None |
API key |
SEMANTIC_CACHE_COHERE_EMBEDDING_MODEL |
embed-v4.0 |
Model id |
SEMANTIC_CACHE_COHERE_EMBEDDING_DIMENSIONS |
1536 |
Vector width |
SEMANTIC_CACHE_COHERE_INPUT_TYPE |
search_document |
search_document, search_query, etc. |
Notes
- Call
await embedder.aclose()on shutdown, orawait cache.close(), which invokesaclose()when the embedder implements it. - Without
output_dimension, the SDK's built-in batching onAsyncClient.embedapplies (96 texts per request).
OllamaEmbedder (embed-ollama)
OllamaEmbedder uses the official openai AsyncOpenAI client against Ollama’s OpenAI-compatible embeddings endpoint (configure base_url with the /v1 suffix, for example http://127.0.0.1:11434/v1). Install fastapi-semcache[embed-ollama].
You must pass model_name and dimensions explicitly. After each response, the backend checks that every vector length equals dimensions so misconfiguration fails fast.
Optional api_key: when omitted, a placeholder Bearer value is sent so the SDK does not pick up OPENAI_API_KEY from the environment; set api_key when your Ollama deployment requires auth.
Through get_embedder(settings), SEMANTIC_CACHE_OLLAMA_EMBEDDING_MODEL, SEMANTIC_CACHE_OLLAMA_EMBEDDING_DIMENSIONS, SEMANTIC_CACHE_OLLAMA_BASE_URL, and OLLAMA_API_KEY / SEMANTIC_CACHE_OLLAMA_API_KEY populate those fields.
Reusing a long-lived HTTP client
Opening a client per embed call is simple but not ideal under load. You can hold an httpx.AsyncClient or aiohttp.ClientSession on the embedder and close it when your app shuts down (for example in a FastAPI lifespan handler). Implement aclose() on custom embedders with long-lived clients; SemanticCache.close() awaits it when present.
See also
Built-in embedders and optional extras are described in the repository README.md (Install section).