memvector/ext-memvector

High-performance Local Vector Storage & Embedding Engine for PHP

Maintainers

Package info

github.com/memvector/ext-memvector

Language:C++

Type:php-ext

Ext name:ext-memvector

pkg:composer/memvector/ext-memvector

Statistics

Installs: 0

Dependents: 0

Suggesters: 0

Stars: 4

Open Issues: 0

1.0.0 2026-03-01 13:57 UTC

This package is not auto-updated.

Last update: 2026-03-02 12:05:40 UTC


README

A high-performance PHP extension that brings local AI infrastructure and AI API to PHP developers and optimized for AI workload. Vector similarity search, text embedding, and cross-encoder reranking — all running directly in your PHP process. Built in C++17 with AVX2 SIMD acceleration. No external vector database, no embedding API, no network round-trips. Embed, store, search, and rerank in under 10 ms.

Works best with small, specialized local GGUF models (embedding + reranker, 24-636 MB) that run efficiently on CPU, and with long-lived PHP runtimes (OpenSwoole, ReactPHP, RoadRunner, FrankenPHP) where persistent workers keep models loaded in memory across requests — giving you single-digit millisecond latency for the entire embed → search → rerank pipeline with zero external dependencies.

Quick Start

// Semantic search with local embeddings (requires --with-llama)
$emb = new MemVectorEmbedding('/models/all-MiniLM-L6-v2.Q8_0.gguf');
$store = new MemVectorStore(null, ['dimensions' => $emb->dimensions()]);

$store->set('php',     $emb->embed('PHP is a server-side scripting language'),     'lang');
$store->set('python',  $emb->embed('Python is used for machine learning'),         'lang');
$store->set('gravity', $emb->embed('Gravity pulls objects toward the earth'),      'science');
$store->set('dna',     $emb->embed('DNA encodes genetic information'),             'science');

$results = $store->search($emb->embed('programming languages'), 2);
// [['key' => 'php', 'score' => 0.82, 'metadata' => 'lang'],
//  ['key' => 'python', 'score' => 0.79, 'metadata' => 'lang']]
// Two-stage retrieval: vector search + cross-encoder rerank (requires --with-llama)
$rr = new MemVectorReranker('/models/bge-reranker-v2-m3-Q8_0.gguf');
$candidates = $store->search($emb->embed('programming languages'), 50);
$results = $rr->rerank('programming languages', $candidates, 5);
// Or bring your own vectors (no llama.cpp needed)
$store = new MemVectorStore('/data/vectors', ['dimensions' => 1536]);
$store->set('doc_1', $openai_embedding, '{"title": "Introduction"}');
$results = $store->search($query_embedding, 10);

Features

  • Key-value APIset(key, vector), get(key), delete(key), batchSet() with upsert semantics
  • Text embeddings — optional llama.cpp integration for local GGUF model inference
  • Cross-encoder reranking — two-stage retrieval: fast vector search → accurate cross-encoder rerank
  • Three storage modes — memory, disk (mmap), shared memory (cross-process)
  • HNSW index — auto-builds transparently during search for fast approximate nearest neighbor
  • Vector quantization — F16, Int8 (scalar), binary, and product quantization (PQ)
  • Four distance metrics — cosine, dot product, euclidean, manhattan
  • Lock-free concurrency — reads/writes via std::atomic
  • AVX2 SIMD — accelerated cosine/dot distance computation
  • Multi-segment architecture — auto-grows with fixed-capacity segments
  • JSONL import/exportdump() and load() for backup, migration, and interop
  • Memory limits — cap total mmap usage across segments

Performance

MemVector runs entirely in-process — no network calls, no serialization, no external services. This eliminates the latency and cost of external APIs and vector databases.

Embedding generation

Approach Latency per call Cost
OpenAI API (text-embedding-3-small) 50-200 ms (network round-trip) $0.02 / 1M tokens
Cohere API (embed-english-v3.0) 50-200 ms (network round-trip) $0.10 / 1M tokens
MemVector + local GGUF model 5-15 ms (in-process) Free

Local embedding with MemVector is 10-40x faster than API calls and has zero per-token cost. The model runs inside the PHP process via llama.cpp — no HTTP overhead, no API keys, no rate limits.

Vector search

Approach Latency per query Notes
Pinecone / Qdrant / Weaviate (cloud) 10-50 ms (network) Managed service, per-vector pricing
Pinecone / Qdrant / Weaviate (self-hosted) 5-20 ms (network) Separate process, TCP/gRPC overhead
PostgreSQL + pgvector 5-50 ms (query + network) Shared DB, connection pooling overhead
MemVector (in-process) 0.1-5 ms No network, no serialization

MemVector searches vectors directly in the PHP process memory (or mmap'd files). There is no serialization, no socket, no protocol parsing. HNSW index + AVX2 SIMD keeps search under 1 ms for most workloads up to millions of vectors.

Full RAG pipeline (embed + search + rerank)

Approach Total latency Components
OpenAI embed + Pinecone search 100-400 ms 2 network round-trips
OpenAI embed + Pinecone search + Cohere rerank 200-600 ms 3 network round-trips
MemVector (all in-process) 10-30 ms 0 network round-trips

A complete two-stage retrieval pipeline (embed query → vector search → cross-encoder rerank) runs in a single PHP process with no external dependencies.

Memory footprint

Component RSS
MemVector extension (no model) ~1 MB
+ embedding model (all-MiniLM-L6-v2, 24 MB GGUF) ~33 MB
+ reranker model (bge-reranker-v2-m3, 636 MB GGUF) ~200 MB
+ 100K vectors (384 dim, f32) ~150 MB
+ 100K vectors (384 dim, int8 quantized) ~40 MB

Model weights are mmap()'d and shared across php-fpm/OpenSwoole workers by the OS page cache.

Why small models + OpenSwoole

MemVector is designed for small, task-specific models — not large generative LLMs. Embedding models (24-138 MB) and reranker models (366-636 MB) are compact enough to run on CPU with low latency and fit comfortably in server memory. These models do one thing well: convert text to vectors or score relevance — tasks where small specialized models match or exceed large API-hosted models in quality.

Combined with OpenSwoole, where each worker process loads the model once and reuses it across thousands of requests, you get a fully self-contained semantic search stack:

  • No API costs — models run locally, no per-token billing
  • No network latency — embed, search, and rerank happen in-process
  • No external services — no vector database, no embedding API, no reranker API to deploy and maintain
  • Predictable performance — no cold starts, no rate limits, no variance from shared infrastructure

A single OpenSwoole worker with a 24 MB embedding model + MemVector store handles embed + search in under 10 ms per request using ~50 MB of RAM.

Requirements

  • PHP 8.1+
  • C++17 compiler (GCC 7+ or Clang 5+)
  • Optional: llama.cpp for text embeddings and cross-encoder reranking

Build

phpize
./configure --enable-memvector
make
make test

Install llama.cpp (optional, for embeddings & reranking)

macOS (Homebrew):

brew install llama.cpp

Linux (build from source):

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build -DBUILD_SHARED_LIBS=ON -DGGML_CUDA=OFF
cmake --build build --config Release -j$(nproc)
sudo cmake --install build --prefix /usr/local
sudo ldconfig  # refresh shared library cache

For GPU acceleration, replace -DGGML_CUDA=OFF with -DGGML_CUDA=ON (requires CUDA toolkit).

Build with llama.cpp support (embeddings & reranking)

phpize
./configure --enable-memvector --with-llama=/usr/local
make
make test

Pass --with-llama=DIR if llama.cpp is installed in a non-standard location.

Download models

Embedding models — any GGUF embedding model works. Recommended starter (24 MB, 384 dimensions):

curl -L -o all-MiniLM-L6-v2-Q8_0.gguf \
  https://huggingface.co/leliuga/all-MiniLM-L6-v2-GGUF/resolve/main/all-MiniLM-L6-v2.Q8_0.gguf
Model Dimensions Size Download
all-MiniLM-L6-v2 (Q8) 384 24 MB HuggingFace
nomic-embed-text-v1.5 (Q8) 768 138 MB HuggingFace
bge-small-en-v1.5 (F16) 384 67 MB HuggingFace

Reranker models — GGUF cross-encoder models for two-stage retrieval:

curl -L -o bge-reranker-v2-m3-Q8_0.gguf \
  https://huggingface.co/gpustack/bge-reranker-v2-m3-GGUF/resolve/main/bge-reranker-v2-m3-Q8_0.gguf
Model Size Download
bge-reranker-v2-m3 (Q8) 636 MB HuggingFace
bge-reranker-v2-m3 (Q2) 366 MB HuggingFace

Configure options

Flag Description
--enable-memvector Enable the extension (required)
--enable-memvector-avx2 AVX2 SIMD (auto by default, auto-detected)
--with-llama[=DIR] Enable llama.cpp support (embeddings & reranking)

Feature Detection

Optional features can be detected at runtime via constants:

if (defined('MEMVECTOR_SHM')) {
    // Shared memory mode available
    $store = new MemVectorStore('mystore', ['storage' => 'shm', 'dimensions' => 128]);
}

if (defined('MEMVECTOR_LLAMA')) {
    // llama.cpp support compiled in (embeddings & reranking)
    $emb = new MemVectorEmbedding('/path/to/embedding-model.gguf');
    $rr  = new MemVectorReranker('/path/to/reranker-model.gguf');
}

PHP API

Class: MemVectorStore

__construct(?string $dir = null, ?array $options = null)

Create or open a vector store.

// Memory mode (ephemeral, single-process)
$store = new MemVectorStore();
$store = new MemVectorStore(null, ['dimensions' => 128]);

// Disk mode (persistent, file-backed mmap)
$store = new MemVectorStore('/path/to/dir', ['dimensions' => 128]);

// Shared memory mode (persistent, cross-process)
$store = new MemVectorStore('mystore', ['storage' => 'shm', 'dimensions' => 128]);

Options (all string values are case-insensitive):

Key Type Default Description
storage string auto 'disk', 'memory', or 'shm'
dimensions int 1536 Vector dimensionality (1-4096)
distance string 'cosine' 'cosine', 'dot', 'euclidean', 'manhattan'
quantization string 'none' 'none', 'f16', 'int8', 'binary', 'pq'
segment_size int 10M (disk/shm), 4096 (memory) Max records per segment
writable bool true Open in read-only mode (shm mode)
memory_limit int|string 0 Max total mmap bytes ('1G', '512M', '128K'). 0 = unlimited

set(string $key, array $vector, ?string $metadata = null): bool

Insert or replace a vector by key (upsert semantics).

$store->set('doc_42', $vector);
$store->set('doc_42', $new_vector, '{"category": "science"}'); // replaces previous
  • $key — Unique string key (max 63 chars)
  • $vector — Float array matching the store's dimensions
  • $metadata — Optional metadata string (max 4096 bytes)

batchSet(array $batch): int

Insert or replace multiple vectors in one call. Returns the count of items inserted.

$count = $store->batchSet([
    ['key' => 'doc_1', 'vector' => $vec1],
    ['key' => 'doc_2', 'vector' => $vec2, 'metadata' => '{"tag": "a"}'],
]);

search(array $vector, int $topK = 10, float $precision = 50): array

Search for the nearest neighbors of a query vector.

$results = $store->search($query_vector, 5);
$results = $store->search($query_vector, 5, 100); // higher precision
$results = $store->search($query_vector, 5, 0);   // force brute-force
  • $topK — Number of results (max 1000)
  • $precision — HNSW beam width. Higher = better recall, slower. 0 = brute-force.

The HNSW index auto-builds when precision > 0 and vector count reaches auto_index_threshold.

Returns results sorted by score (highest first):

[
    ['key' => 'doc_42', 'score' => 0.95, 'metadata' => '{"category": "science"}'],
    ['key' => 'doc_7',  'score' => 0.89, 'metadata' => null],
]

Score meaning by distance metric:

  • Cosine: similarity (0 to 1, higher = more similar)
  • Dot: raw dot product (higher = more similar)
  • Euclidean/Manhattan: negated distance (higher = closer)

get(string $key): ?array

Retrieve a record by key. Returns null if not found.

$record = $store->get('doc_42');
// ['key' => 'doc_42', 'vector' => [0.1, ...], 'metadata' => '...']

delete(string $key): bool

Soft-delete a record by key. Returns true if found and deleted.

count(): int

Return the total number of live records.

stats(): array

Return store statistics.

Key Type Description
storage string "memory", "disk", or "shm"
total_count int Total records across all segments
segment_count int Number of segments
dimensions int Vector dimensionality
distance string "cosine", "dot", "euclidean", or "manhattan"
quantization string "none", "f16", "int8", "binary", or "pq"
bytes_per_vector int Storage bytes per vector
index_status string "none" or "ready"
unindexed_count int Records not yet in the HNSW index
memory_mb float Total memory/mmap usage in MB

dump(?string $path = null): string|int

Export all records as JSONL. Each line: {"key":"...","vector":[...],"metadata":"..."}

$jsonl = $store->dump();                        // returns string
$count = $store->dump('/backup/vectors.jsonl');  // streams to file, returns count

load(string $path): int

Import records from a JSONL file (inverse of dump()). Returns the number of records loaded. Uses upsert semantics.

$count = $store->load('/backup/vectors.jsonl');

clear(): bool

Remove all records. Memory and shm modes only.

snapshot(string $dir): bool

Persist a memory or shm store to disk. Can be reloaded with 'storage' => 'memory':

$store->snapshot('/data/snapshot');
// Later:
$store = new MemVectorStore('/data/snapshot', ['storage' => 'memory', 'dimensions' => 128]);

close(): void

Close the store and release resources. Called automatically on GC. Safe to call multiple times.

static destroy(string $name): bool

Unlink a shared memory store. Requires MEMVECTOR_SHM.

Class: MemVectorEmbedding

Requires --with-llama at build time. Detect with defined('MEMVECTOR_LLAMA').

__construct(string $model_path, ?array $options = null)

Load a GGUF embedding model.

$emb = new MemVectorEmbedding('/models/all-MiniLM-L6-v2.Q8_0.gguf', [
    'context_size' => 512,
    'gpu_layers'   => 0,
    'normalize'    => true,
]);

Options:

Key Type Default Description
context_size int 512 Max tokens per text input
gpu_layers int 0 Number of layers to offload to GPU
normalize bool true L2 normalize output vectors

Throws MemVectorException if the model file cannot be loaded.

embed(string $text): array

Generate an embedding vector for a single text. Returns float[] of size dimensions().

$vector = $emb->embed("The quick brown fox");

embedBatch(array $texts): array

Generate embeddings for multiple texts. Returns float[][].

$vectors = $emb->embedBatch(["Hello", "World", "Test"]);
// count($vectors) === 3, count($vectors[0]) === $emb->dimensions()

dimensions(): int

Return the model's embedding dimensionality.

close(): void

Free the model and context. Called automatically on GC. Safe to call multiple times.

Class: MemVectorReranker

Cross-encoder reranker for two-stage retrieval. Requires --with-llama at build time. Detect with defined('MEMVECTOR_LLAMA').

__construct(string $model_path, ?array $options = null)

Load a GGUF cross-encoder reranker model.

$rr = new MemVectorReranker('/models/bge-reranker-v2-m3-Q8_0.gguf', [
    'context_size' => 512,
    'gpu_layers'   => 0,
]);

Options:

Key Type Default Description
context_size int 512 Max tokens per query+document pair
gpu_layers int 0 Number of layers to offload to GPU

Throws MemVectorException if the model file cannot be loaded.

Recommended model: bge-reranker-v2-m3 (Q8_0 ~1 GB)

score(string $query, array $texts): array

Score each text against the query using the cross-encoder. Returns [['index' => int, 'score' => float], ...] sorted descending by score.

$scores = $rr->score("What is PHP?", [
    "PHP is a scripting language.",
    "The weather is sunny.",
]);
// → [['index' => 0, 'score' => 0.87], ['index' => 1, 'score' => -3.2]]

rerank(string $query, array $candidates, int $topK = 0): array

Rerank search results from MemVectorStore::search(). Scores each candidate's metadata field against the query. Returns [['key' => string, 'score' => float, 'metadata' => string], ...] sorted descending. $topK = 0 returns all candidates.

$candidates = $store->search($emb->embed($query), 50);
$reranked = $rr->rerank($query, $candidates, 5);

close(): void

Free the model and context. Called automatically on GC. Safe to call multiple times.

Two-Stage Retrieval Example

$emb = new MemVectorEmbedding('/models/all-MiniLM-L6-v2.Q8_0.gguf');
$rr  = new MemVectorReranker('/models/bge-reranker-v2-m3-Q8_0.gguf');
$store = new MemVectorStore(null, ['dimensions' => $emb->dimensions()]);

// Index documents
foreach ($docs as $key => $text) {
    $store->set($key, $emb->embed($text), $text);
}

// Stage 1: Fast vector search (broad recall)
$candidates = $store->search($emb->embed($query), 50, 0);

// Stage 2: Accurate cross-encoder rerank (high precision)
$results = $rr->rerank($query, $candidates, 5);

Exception Class: MemVectorException

All errors throw MemVectorException (extends \Exception).

try {
    $store->set('x', [0.1]); // wrong dimension
} catch (MemVectorException $e) {
    echo $e->getMessage(); // "Vector dim mismatch: expected 128"
}

INI Settings

Directive Default Description
memvector.default_dir /tmp/memvector Default directory for disk stores
memvector.default_dim 1536 Default vector dimension
memvector.default_storage 0 Default storage mode
memvector.mem_initial_capacity 4096 Initial memory-mode capacity
memvector.hnsw_max_connections 16 HNSW max connections per node
memvector.hnsw_construction_ef 200 HNSW build beam width
memvector.hnsw_ef_search 50 HNSW search beam width
memvector.auto_index_threshold 100 Min vectors before HNSW auto-builds (0 = disable)
memvector.pq_subvectors 0 PQ sub-vector count (required for quantization = 'pq')
memvector.pq_training_size 256 Vectors to buffer before PQ codebook auto-trains
memvector.use_avx2 1 Enable AVX2 SIMD (if compiled with support)
memvector.model_cache 1 Cache llama.cpp models per-process (PHP_INI_SYSTEM). Requires --with-llama

Usage Examples

Semantic Search with Embeddings

$emb = new MemVectorEmbedding('/models/all-MiniLM-L6-v2.Q8_0.gguf');
$store = new MemVectorStore(null, ['dimensions' => $emb->dimensions()]);

// Index a knowledge base
$articles = [
    'relativity'  => 'Einstein published the theory of general relativity in 1915',
    'evolution'   => 'Darwin proposed natural selection as the mechanism of evolution',
    'computing'   => 'Turing described a theoretical machine that could compute anything',
    'genetics'    => 'Watson and Crick discovered the double helix structure of DNA',
    'quantum'     => 'Heisenberg formulated the uncertainty principle in quantum mechanics',
];

foreach ($articles as $key => $text) {
    $store->set($key, $emb->embed($text), json_encode(['text' => $text]));
}

// Semantic search — finds related articles, not just keyword matches
$results = $store->search($emb->embed('biology and heredity'), 3);
foreach ($results as $r) {
    $meta = json_decode($r['metadata'], true);
    printf("  [%.2f] %s: %s\n", $r['score'], $r['key'], $meta['text']);
}
// Output:
//   [0.67] genetics: Watson and Crick discovered the double helix...
//   [0.51] evolution: Darwin proposed natural selection...
//   [0.24] quantum: Heisenberg formulated the uncertainty...

RAG Pipeline (Retrieval-Augmented Generation)

// 1. Build index from documents
$emb = new MemVectorEmbedding('/models/nomic-embed-text-v1.5.Q8_0.gguf');
$store = new MemVectorStore('/data/knowledge', ['dimensions' => $emb->dimensions()]);

foreach ($documents as $doc) {
    // Chunk long documents and index each chunk
    foreach (chunk_text($doc->content, 512) as $i => $chunk) {
        $store->set("{$doc->id}_chunk_{$i}", $emb->embed($chunk), json_encode([
            'doc_id' => $doc->id,
            'title'  => $doc->title,
            'chunk'  => $chunk,
        ]));
    }
}

// 2. Retrieve relevant context for a user query
$query = "How does photosynthesis work?";
$results = $store->search($emb->embed($query), 5, 100);

$context = implode("\n\n", array_map(function ($r) {
    return json_decode($r['metadata'], true)['chunk'];
}, $results));

// 3. Send to LLM with retrieved context
$prompt = "Context:\n{$context}\n\nQuestion: {$query}\nAnswer:";
// $answer = $llm->complete($prompt);

Two-Stage Retrieval with Reranking

$emb = new MemVectorEmbedding('/models/all-MiniLM-L6-v2.Q8_0.gguf');
$rr  = new MemVectorReranker('/models/bge-reranker-v2-m3-Q8_0.gguf');
$store = new MemVectorStore(null, ['dimensions' => $emb->dimensions()]);

// Index documents (store text as metadata for reranking)
$docs = [
    'php-web'     => 'PHP is widely used for building dynamic web applications and APIs.',
    'python-ml'   => 'Python is the dominant language for machine learning and data science.',
    'php-history' => 'PHP was originally created by Rasmus Lerdorf in 1994.',
    'weather'     => 'The weather forecast predicts rain for the upcoming weekend.',
];

foreach ($docs as $key => $text) {
    $store->set($key, $emb->embed($text), $text);
}

// Stage 1: Fast vector search — broad recall (top 50)
$query = "Tell me about PHP";
$candidates = $store->search($emb->embed($query), 50, 0);

// Stage 2: Cross-encoder rerank — high precision (top 3)
$results = $rr->rerank($query, $candidates, 3);

foreach ($results as $r) {
    printf("  [%.4f] %s: %s\n", $r['score'], $r['key'], $r['metadata']);
}
// PHP-related documents rank at the top, weather at the bottom

Persistent Disk Store with HNSW

$store = new MemVectorStore('/var/data/vectors', [
    'dimensions'   => 768,
    'distance'     => 'cosine',
    'segment_size' => 100000,
]);

// Index vectors — HNSW auto-builds on first search
foreach ($documents as $doc) {
    $store->set($doc->id, $doc->embedding, json_encode($doc->metadata));
}

// Fast approximate search
$results = $store->search($query, 10, 100);
$store->close();

// Data and index persist — reopen anytime
$store = new MemVectorStore('/var/data/vectors', ['dimensions' => 768]);
$results = $store->search($query, 10, 100);

Quantized Store (F16 — half memory)

$store = new MemVectorStore(null, [
    'dimensions'   => 768,
    'quantization' => 'f16',  // 2 bytes/dim instead of 4
]);

$store->set('doc_1', $embedding);
$results = $store->search($query, 10);

Product Quantization (PQ — extreme compression)

ini_set('memvector.pq_subvectors', 96);      // 96 sub-vectors of 8 dims each
ini_set('memvector.pq_training_size', 1000); // auto-train after 1000 vectors

$store = new MemVectorStore(null, [
    'dimensions'   => 768,
    'quantization' => 'pq',
]);

// First 1000 vectors are buffered for codebook training
for ($i = 0; $i < 5000; $i++) {
    $store->set("vec_$i", $embeddings[$i]);
}
// After training: 96 bytes/vector instead of 3072 (32x compression)

JSONL Backup & Restore

$store = new MemVectorStore('/data/vectors', ['dimensions' => 768]);

// Export
$count = $store->dump('/backup/vectors.jsonl');
echo "Exported $count records\n";

// Import into a new store
$newStore = new MemVectorStore('/data/vectors_v2', ['dimensions' => 768]);
$loaded = $newStore->load('/backup/vectors.jsonl');
echo "Imported $loaded records\n";

Snapshot & Reload in Memory Mode

$store = new MemVectorStore(null, ['dimensions' => 128]);
// ... add vectors ...

$store->snapshot('/data/snapshot');
$store->close();

// Reload entirely in memory (fast startup, no disk I/O after load)
$store = new MemVectorStore('/data/snapshot', ['storage' => 'memory', 'dimensions' => 128]);

Shared Memory with OpenSwoole Workers

// Master: create and populate
$store = new MemVectorStore('search_index', [
    'storage'    => 'shm',
    'dimensions' => 1536,
]);
foreach ($documents as $doc) {
    $store->set($doc->id, $doc->embedding, $doc->title);
}
$store->close();

// Workers: attach read-only (zero-copy, cross-process)
$store = new MemVectorStore('search_index', [
    'storage'    => 'shm',
    'dimensions' => 1536,
    'writable'   => false,
]);
$results = $store->search($query_vector, 10, 100);
$store->close();

// Cleanup
MemVectorStore::destroy('search_index');

Memory-Limited Store

$store = new MemVectorStore(null, [
    'dimensions'   => 128,
    'segment_size' => 10000,
    'memory_limit' => '256M',
]);
// Throws MemVectorException when adding would exceed the limit

Model Cache (php-fpm)

When memvector.model_cache=1 (default), MemVectorEmbedding and MemVectorReranker cache loaded GGUF models per worker process. The first construction loads the model from disk; subsequent constructions with the same file path reuse the cached model and only create a lightweight context. This avoids re-parsing weights on every request. Disable with memvector.model_cache=0 in php.ini (PHP_INI_SYSTEM — cannot be changed at runtime).

Latency impact — construction time per request:

Model Size Without cache With cache Speedup
all-MiniLM-L6-v2 (Q8) 24 MB ~60 ms ~1.5 ms ~40x
nomic-embed-text-v1.5 (Q8) 138 MB ~300 ms ~1.5 ms ~200x
bge-reranker-v2-m3 (Q8) 636 MB ~1500 ms ~1.5 ms ~1000x

Embed/score inference time is the same with or without the cache (~5-15 ms per call depending on model and text length).

Memory impact — per worker process:

Model Size Cache RSS per worker Context per request
all-MiniLM-L6-v2 (Q8) 24 MB ~22 MB ~1 MB
nomic-embed-text-v1.5 (Q8) 138 MB ~60 MB ~2 MB
bge-reranker-v2-m3 (Q8) 636 MB ~200 MB ~5 MB

Model weights are loaded via mmap(), so the OS shares physical pages across all workers through the page cache. The per-worker private cost is mainly model metadata (vocab, tensor descriptors). For example, 10 php-fpm workers with a 24 MB model use ~24 MB shared + 10 x 8 MB private = ~104 MB total physical RAM, not 10 x 48 MB.

Long-Lived PHP Runtimes

The model cache is designed for php-fpm's per-request lifecycle. For long-lived PHP runtimes — where worker processes persist across requests — the cache is unnecessary. Load the model once at startup and reuse the same object across all requests. This avoids both model loading and context creation overhead, giving the best possible performance.

This applies to any long-lived PHP framework or library:

  • OpenSwoole — async event-driven server
  • ReactPHP — event loop for non-blocking I/O
  • AMPHP — async PHP framework
  • Workerman — high-performance PHP socket server
  • RoadRunner — Go-powered PHP application server
  • FrankenPHP — modern PHP app server (worker mode)
  • Any long-running CLI script or daemon

OpenSwoole example:

$server->on('workerStart', function ($server, $workerId) {
    $server->emb = new MemVectorEmbedding('/models/all-MiniLM-L6-v2.Q8_0.gguf');
    $server->store = new MemVectorStore('/data/vectors', [
        'dimensions' => $server->emb->dimensions(),
    ]);
});
$server->on('request', function ($request, $response) use ($server) {
    $vec = $server->emb->embed($request->post['text']);
    $results = $server->store->search($vec, 5);
    $response->end(json_encode($results));
});

ReactPHP example:

$emb = new MemVectorEmbedding('/models/all-MiniLM-L6-v2.Q8_0.gguf');
$store = new MemVectorStore('/data/vectors', ['dimensions' => $emb->dimensions()]);

$http = new React\Http\HttpServer(function (Psr\Http\Message\ServerRequestInterface $request) use ($emb, $store) {
    $body = json_decode((string) $request->getBody(), true);
    $vec = $emb->embed($body['text']);
    $results = $store->search($vec, 5);
    return React\Http\Message\Response::json($results);
});

In all cases, the MemVectorEmbedding, MemVectorReranker, and MemVectorStore objects stay in memory for the lifetime of the worker — zero overhead per request beyond the actual embed/search computation.