LAUNCH ETA: 2026 May

Embedding Models on Affordable Cloud VMs and Apple Silicon

13 min read

Embedding model inference converts input into fixed-size vectors for similarity/search, while LLM inference generates or scores text token by token. Embedding inference is much cheaper because it does a single forward pass to produce a vector, while LLM inference repeatedly runs the model to generate each output token. It is used in e.g. a vector database pipeline, where embeddings are the ingestion primitive (documents, chunks, updates) and the query primitive (search, rerank candidates, analytics). That makes the operational question about what hardware and runtime sustain predictable embedding throughput under latency constraints.

This post reports on a benchmark that compares CPU-only embedding generation from small cloud VMs (DigitalOcean) to Apple Silicon, using standard OpenAI-compatible /v1/embeddings surface. Our focus is deliberately practical. End-to-end request latency, throughput (requests/sec), and stability under concurrency and batching, measured at the API boundary rather than via token-level microbenchmarks.

Apple Silicon, and in particularly our M4 test machine, operates in a different performance class than 1–2 vCPU cloud instances for this workload. Separately, runtime overhead (Ollama vs a minimal llama.cpp server) can dominate outcomes on constrained VMs, sometimes exceeding the performance differences between CPU tiers.

What exactly is being benchmarked?

Embedding generation is a forward pass through a transformer encoder: tokenization, repeated attention/MLP blocks, and a pooling/projection head to a fixed-size vector. There is no autoregressive decode loop. The computation is therefore dominated by dense linear algebra (GEMM-heavy), vectorized kernels (SIMD-sensitive), cache behavior, and memory bandwidth. Put differently: this is CPU quality and runtime efficiency dominated rather than IO (disk/network) bound.

For each target environment (Ollama or llama.cpp on a particular machine type), our setup provisions a low cost cloud VM, loads a specific embedding model, sends a standardized suite of embedding requests with varied input sizes and batch shapes, and measures end-to-end timing.

The hardware tier question on DigitalOcean

Embedding inference is strongly CPU-bound. DigitalOcean documents that Regular droplets come from a mixed/older CPU pool, while Premium droplets (the -amd / -intel SKUs) are guaranteed to run on one of their latest two CPU generations and use NVMe SSDs.1 For CPU-bound inference, “Premium vs Regular” is a factor that spawned our curiosity for evaluation.

Upfront, in our measurements, Premium AMD consistently beat Premium Intel, and, unsurprisingly, both beat Regular. The magnitude is large enough that small savings can even be the wrong tradeoff for embeddings, because we pay it back in throughput, tail latency, and operational headroom.

The model question: bge-m3 vs the E5 family

The benchmark data below covers bge-m3 and a multilingual E5 variant. Even when we hold hardware constant, these models have different engineering profiles; it is useful to understand why.

bge-m3 is explicitly designed for multi-mode retrieval: a single checkpoint can produce dense embeddings, sparse lexical weights, and multi-vector (ColBERT-style) representations, which is a direct enabler for hybrid retrieval pipelines without stitching together separate models or a separate BM25-like component.2 3 Practically, it also supports long inputs (up to 8192 tokens) and emits 1024-dimensional vectors.2 The combination of longer context, larger vector dimensionality, and larger parameter count tends to increase compute and memory demands versus smaller dense-only encoders.

By contrast, multilingual-e5-small is a compact, widely used dense embedding model. It emits 384-dimensional vectors and is commonly treated as a good multilingual default when we want speed and small vectors.4 Its usage also comes with a gotcha that becomes operationally important: semantic search quality depends on prefix discipline (query: vs passage:), and dropping the prefixes can noticeably degrade retrieval behavior.5 Its maximum input length is typically treated as 512 tokens in hosted documentation.6

A useful mental model is that bge-m3 buys retrieval flexibility (hybrid signals, longer context, multi-vector options) at higher inference cost, while small E5 variants buy cheap dense vectors at the cost of truncation and reduced retrieval expressivity. What the benchmark shows, however, is that on small cloud CPUs the runtime and CPU class can swamp these model-level considerations.

Methodology and Workload

All measurements here are from API-level embedding requests against /v1/embeddings. The harness runs multiple cases per environment (short, medium, long, and batched/mixed patterns), performs warmups, then executes timed runs.

From the benchmark output we supplied:

  • runs per case: 10
  • total requests per benchmark: 50
  • concurrency: varies (1 or 2 in the data)

Raw results table

To avoid cherry-picking, the table below is constructed directly from the rows we provided. It includes both Apple Silicon and DigitalOcean environments, shows the runtime, model, concurrency, success rate, throughput, and the headline latency distribution (p50/p95).

Table 1 — End-to-end embedding benchmark results (raw-summary rows)

EnvironmentCPU tierModelRuntimeConcurrencyreq/sp50 (ms)p95 (ms)Avg (ms)Notes
MacBook AirApple Silicon (Mac14,2)bge-m3Ollama10.92300.363942.821088.76Laptop baseline showing reasonable single-request latency but limited throughput
Mac miniApple Silicon (Mac16,11)bge-m3Ollama15.86159.48294.52168.53High-performance Apple Silicon system delivering the best overall latency and throughput
DO RegularDO-Regularbge-m3Ollama10.052458.2775408.7019708.01Regular droplet tier shows extremely poor performance and extreme tail latency under Ollama
DO RegularDO-Regularbge-m3llama.cpp10.311740.3412875.074053.66Switching to llama.cpp dramatically improves throughput but remains slow on shared CPU hardware
DO RegularDO-Regularmultilingual-e5-large-instruct (Q8_0 via Ollama)Ollama10.132621.1417975.117786.20Smaller encoder reduces compute pressure but performance remains constrained by the droplet tier
DO RegularDO-Regularmultilingual-e5-largellama.cpp10.291768.5213527.874318.93Performance similar to bge-m3 when using the more efficient runtime
DO Premium AMDDO-Premium-AMDbge-m3Ollama10.081646.3043418.2911950.67Premium tier improves latency somewhat but runtime overhead still dominates
DO Premium AMDDO-Premium-AMDbge-m3llama.cpp10.62799.246266.472011.72Best single-vCPU droplet result; runtime efficiency is the dominant improvement
DO Premium AMDDO-Premium-AMDmultilingual-e5-large-instruct (Q8_0 via Ollama)Ollama10.281143.857801.893516.50Smaller encoder substantially improves Ollama performance on the same hardware
DO Premium AMDDO-Premium-AMDmultilingual-e5-largellama.cpp10.61837.966190.712041.89Throughput nearly identical to bge-m3 under llama.cpp
DO Premium IntelDO-Premium-Intelbge-m3Ollama10.071790.8151977.9513903.33Premium Intel tier trails AMD and exhibits very large tail latency
DO Premium IntelDO-Premium-Intelbge-m3llama.cpp10.451095.198718.962796.43Faster than Ollama but still meaningfully slower than Premium AMD
DO Premium IntelDO-Premium-Intelmultilingual-e5-large-instruct (Q8_0 via Ollama)Ollama10.201612.4311329.015121.15Encoder size helps but does not close the gap with AMD hardware
DO Premium IntelDO-Premium-Intelmultilingual-e5-largellama.cpp10.441141.748745.032858.49Roughly equivalent to bge-m3 when using the same runtime
DO Premium AMDDO-Premium-AMDbge-m3Ollama10.101434.4436078.679534.71Increasing CPU cores slightly improves latency but Ollama remains the bottleneck
DO Premium AMDDO-Premium-AMDbge-m3llama.cpp10.76683.325176.371637.43Best overall cloud result observed in the benchmark dataset
DO Premium AMDDO-Premium-AMDmultilingual-e5-large-instruct (Q8_0 via Ollama)Ollama10.271353.118557.173728.66Encoder size again helps Ollama but still trails llama.cpp performance
DO Premium AMDDO-Premium-AMDmultilingual-e5-largellama.cpp10.68832.735593.801835.51Similar performance characteristics to bge-m3 on the same instance
DO Premium AMD 2vCPUDO-Premium-AMDbge-m3Ollama20.1110550.7937281.8718662.06Increasing concurrency causes large latency inflation with minimal throughput gain
DO Premium AMD 2vCPUDO-Premium-AMDbge-m3llama.cpp20.73933.386122.433134.38Throughput nearly unchanged from concurrency 1, indicating CPU saturation
DO Premium AMD 2vCPUDO-Premium-AMDmultilingual-e5-large-instruct (Q8_0 via Ollama)Ollama20.288388.289325.057012.58Parallel requests significantly increase median latency without improving throughput
DO Premium AMD 2vCPUDO-Premium-AMDmultilingual-e5-largellama.cpp20.671044.107037.843405.81Similar throughput to concurrency 1 but with higher latency
DO Premium AMD 2vCPUDO-Premium-AMDbge-m3Ollama20.119997.5935627.7017746.53Large batch processing increases latency but does not improve throughput
DO Premium AMD 2vCPUDO-Premium-AMDbge-m3llama.cpp20.216265.2719429.639630.11Excessive batching causes a severe collapse in throughput and latency stability
DO Premium AMD 2vCPUDO-Premium-AMDmultilingual-e5-large-instruct (Q8_0 via Ollama)Ollama20.288220.6210873.187077.59Similar behavior to other Ollama runs under heavy batching
DO Premium AMD 2vCPUDO-Premium-AMDmultilingual-e5-smallllama.cpp20.70994.396669.163290.87Smaller embedding model maintains stable throughput under the same workload

The headline comparisons are immediate:

  1. Apple Silicon is unsurprisingly in a different throughput class. The Mac mini run reports 5.86 req/s at p50 ≈ 159 ms for bge-m3, while the best cloud configuration in our dataset (Premium AMD 2vCPU + llama.cpp) reaches 0.76 req/s at p50 ≈ 683 ms. That is roughly an order-of-magnitude throughput gap and a multi-x median latency gap.

  2. Regular droplets are not merely slower but also have pathological tails under Ollama. The bge-m3 + Ollama run on DO Regular is 0.05 req/s with p95 ≈ 75 seconds, which changes how we must design timeouts, retries, and queue backpressure.

  3. Runtime choice dominates on small CPUs. On multiple droplets, switching from Ollama to llama.cpp moves throughput by 5-8× for the same model class, and collapses p95 by large factors (though with the success-rate caveat visible in our data).

Interpreting the two big effects: CPU tier and runtime overhead

CPU tier: Premium AMD > Premium Intel > Regular

Across the 1vCPU cases, Premium AMD consistently outperforms Premium Intel for both bge-m3 and E5-large under llama.cpp, and both Premium tiers substantially outperform Regular. This aligns with DigitalOcean’s own plan description: Premium droplets are pinned to newer CPU generations, whereas Regular can land on older silicon.1 With embedding inference being dominated by vectorized math and memory behavior, per-core quality matters far more than “number of cores” at the low end.

The implication is that if we are cost-optimizing CPU-only embeddings on small DigitalOcean droplets, start from Premium AMD as a default and treat Regular as a last resort for lightly loaded dev/test.

Runtime overhead: llama.cpp vs Ollama

The raw data show a consistent pattern that on droplets, llama.cpp throughput is dramatically higher than Ollama for the same nominal workload, even when both are ultimately driving similar kernels underneath. The plausible technical causes include (a) more direct control over batching and thread scheduling, (b) reduced service-layer overhead, (c) differences in memory allocation and copy behavior around request handling, and (d) model packaging/serving decisions (e.g., how input is tokenized and how outputs are marshaled). The benchmark is not attempting to attribute the gap to a single internal mechanism; it is recording the operational reality that in constrained CPU environments, server overhead is not amortized away but can become the bottleneck.

Scaling: 2 vCPU is not 2×

The dataset includes a comparison between Premium AMD 1vCPU and 2vCPU under llama.cpp at concurrency 1:

  • 1vCPU: 0.62 req/s, p50 ~799 ms
  • 2vCPU: 0.76 req/s, p50 ~683 ms

That is a ~20–25% uplift, not a doubling. This is expected when a workload is bounded by shared resources that do not scale with vCPU count in a linear way: memory bandwidth, cache contention, and hypervisor scheduling effects. Embedding inference in particular mixes compute-heavy layers with memory-traffic-heavy layers; adding a second vCPU helps, but quickly runs into the next shared constraint.

It would be a mistake is to treat embeddings like an embarrassingly parallel web workload where throughput scales linearly with cores on the same machine. The benchmark data argues strongly against that assumption for small VMs with the same memory constraints.

Concurrency and batching experiments

we also supplied three Premium AMD 2vCPU runs at concurrency 2 with different batching choices. The key results:

Concurrency=2, ubatch=512, batch=2048 (bge-m3, llama.cpp):

  • 0.73 req/s, p50 ~933 ms

Compared to concurrency=1 on the same class of machine:

  • 0.76 req/s, p50 ~683 ms

Throughput does not increase; latency worsens. This is consistent with CPU saturation: concurrency becomes a queue that inflates response times rather than a lever that increases utilization. Put plainly, the instance is already “busy enough” at concurrency 1.

A more extreme configuration (ubatch=2048, batch=2048) shows an even stronger cautionary tale:

Concurrency=2, ubatch=2048, batch=2048 (bge-m3, llama.cpp):

  • 0.21 req/s, p50 ~6265 ms

This is a collapse in throughput paired with a large latency inflation. On a 2GB machine, aggressive batching pushes the system into a regime where memory pressure and kernel inefficiencies dominate. We see that embedding inference is memory hungry. Even if the model fits, a batching strategy can move us from compute-bound to memory-bound to thrash-bound.

Interestingly, in that same run set, multilingual-e5 under llama.cpp at concurrency 2 reports 0.70 req/s with p50 ~994 ms, i.e., it remains in the reasonable range where bge-m3 collapsed. That difference is consistent with model footprint and per-request compute, but the overarching point remains: on small machines, batching and concurrency knobs can easily make things worse, and the best settings are hardware- and model-dependent.

Apple Silicon is obviously far ahead

Apple’s high-performance cores deliver high single-thread and per-watt throughput, supported by strong memory bandwidth and a system design tuned for sustained vectorized compute. Embedding inference rewards exactly those properties. When we compare that against a small cloud VM where our vCPU is a slice of shared server silicon (often with lower sustained clocks, weaker per-core performance, and noisy neighbors), we should expect large gaps.

The most actionable interpretation is that a single high-performance local embedding worker can replace multiple tiny cloud instances for steady-state embedding throughput, especially when we include the runtime overhead costs and the tail latency stability costs.

Practical implications for vector store operators

The benchmark data supports a few concrete engineering conclusions.

  1. 1vCPU droplets can be a precarious baseline for embeddings. Even when the median latency looks merely slow, the tails can be huge (especially with Ollama), and the system can saturate before we can scale it with concurrency.
  2. if we must run embeddings on DigitalOcean, Premium AMD is a rational default tier. The Regular tier is both slower and more prone to extreme tail behavior under the same serving stack.1
  3. on constrained VMs, llama.cpp is the efficiency play, not a marginal optimization. In our results it is frequently the difference between unusable and barely viable, and sometimes the difference between barely viable and operationally acceptable, with the important caveat that the llama.cpp runs here show a lower recorded success rate and therefore require investigation (timeouts, error modes, or harness thresholds) before one should declare victory.
  4. embeddings capacity planning should be framed around saturation points, not around theoretical scaling. The concurrency experiments show a pattern typical of CPU-bound inference: once the machine is saturated, we can trade throughput for latency (or vice versa), but we do not get more total work done.

Aarchitecture that matches the data

A design that fits these results suggests separating the public-facing service responsibilities from the embedding compute responsibilities. The cloud side handles what clouds are good at: API ingress, auth, rate limits, queues, retries, and observability. The embedding worker side runs on hardware that is predictably good at dense math (Apple Silicon in this dataset), exposes an internal endpoint, and maintains model residency in memory for sustained throughput, aka, call home for embedding inference. If our workload is steady (ingestion pipelines, reindex jobs, continuous updates), this can be economically and operationally superior to running many tiny cloud instances. If our workload is bursty or geographically distributed, cloud still makes sense, but we should treat the minimum viable droplet as Premium AMD, multiple vCPUs, enough RAM to avoid batching collapse, and assume that runtime choice is a first-order decision.

Takeaways

The empirical story is:

  • embedding inference is a CPU- and memory-behavior problem;
  • cloud tiering (Premium vs Regular) is decisive for CPU-bound inference on DigitalOcean;
  • runtime overhead can dominate on small machines;
  • scaling cores and increasing concurrency are not free multipliers at low resource levels;
  • Apple Silicon delivers an order-of-magnitude class improvement in embedding throughput relative to small cloud VMs.

If our system’s critical path depends on embeddings, as most vector-store-backed systems do, then treating embeddings as a generic web workload is likely a poor design choice.


  1. DigitalOcean, “Choosing the Right CPU Droplet Plan” — https://docs.digitalocean.com/products/droplets/concepts/choosing-a-plan/  ↩︎ ↩︎ ↩︎

  2. Hugging Face model card, BAAI/bge-m3https://huggingface.co/BAAI/bge-m3  ↩︎ ↩︎

  3. BGE documentation, “BGE-M3” — https://bge-model.com/bge/bge_m3.html  ↩︎

  4. Hugging Face model card, intfloat/multilingual-e5-smallhttps://huggingface.co/intfloat/multilingual-e5-small  ↩︎

  5. Elastic Search Labs, “Multilingual vector search: Elasticsearch with E5 embedding model” — https://www.elastic.co/search-labs/blog/multilingual-vector-search-e5-embedding-model  ↩︎

  6. Google Cloud Vertex AI documentation, “Multilingual E5 Small” — https://docs.cloud.google.com/vertex-ai/generative-ai/docs/maas/e5/multilingual-e5-small  ↩︎