RSS
LAUNCH ETA: 2025 December

Local AI Capture: Ollama, Open WebUI, and llama.cpp

โ€ข 14 min read

We have seen examples such as Red Hat placing RHEL sources behind customer portals and contracts, and Canonical combining GPL code with contributor license agreements, trademark conditions, and a transition from GNU coreutils to MIT-licensed uutils. In these cases, the starting point is an open license and an accompanying narrative about shared ownership (GPL in the Linux ecosystem, MIT/BSD and “self-hosted” for many local AI offerings). Over time, vendors can introduce leverage points at subscription gates, binary distribution channels, graphical interfaces, or hosted tiers. Formal rights remain available, but contracts, branding policies, and custom “open” licenses can make redistribution, forking, or white-labeling significantly more difficult in practice.

Local inference in a narrow sense means “run a model on your own hardware so no one else gets a copy of your data”. In practice, the party providing the interface to local inference can package that capability as a product and charge for it, and Ollama, Open WebUI, and llama.cpp illustrate different approaches to this space. Ollama offers a one-command install, a bundled model catalog, an always-running localhost API on port 11434, and a desktop application for macOS and Windows, and the company markets this as private, local, and user-controlled. 1 2 ( GitHub ) Open WebUI distributes an “extensible, self-hosted AI interface” that wraps Ollama and other backends, presents itself as an offline-first control panel, and now ships under a custom license that requires retention of Open WebUI branding and reserves full white-label deployment for paying customers. 3 4 ( Open WebUI ) llama.cpp describes itself in different terms, stating its “main goal” is to enable large language model inference with minimal setup and high performance across a wide range of hardware, locally and in the cloud, implemented in C/C++ on top of a lightweight tensor library, with aggressive quantization options and support for diverse backends including consumer GPUs and Apple Silicon. 5 6 ( GitHub )

These projects compete in a similar functional space but adopt different organizational and licensing models. Ollama operates as a venture-backed product with published pricing, a polished user interface, and an optional remote execution tier; Open WebUI functions as a web console with a custom, branding-preserving license; and llama.cpp functions as an MIT-licensed runtime that users typically compile, tune, and embed. In this ecosystem, Ollama is promoted as a default on-ramp for “just run an LLM locally,” Open WebUI as a browser-based interface for that stack, and llama.cpp as a baseline engine providing the underlying inference capabilities. 1 2 3 4 5 ( GitHub 10 )

Ollama’s stated promises vs behavior

Ollama claims to be a local inference platform where you download models, run them on your own machine, and talk to them through a simple REST API, which is marketed as privacy-preserving and developer-friendly. 1 2 ( GitHub ) However, Ollama now also offers a hosted tier, priced at around $20 per month, in which inference runs on Ollama’s datacenter hardware instead of the local machine, and it advertises this as faster, able to load larger models, and still “privacy first”, while presenting the hosted option as a way to obtain higher usage and throughput. 2 ( Ollama )

Ollama frames itself as “open source” and points to an MIT-licensed CLI/server on GitHub, written in Go, that exposes a localhost HTTP API for chat, embeddings, and model management, and that can be scripted or called from other tools. 1 ( GitHub ) Ollama also ships and promotes a polished desktop application with a graphical chat interface, model selector, drag-and-drop file upload, and adjustable context-length settings, but that application is not clearly licensed under the same MIT terms, and contributors have filed a GitHub issue (11634) stating that the desktop app has “no clear / non-existent license” and requesting that Ollama either publish it under the repository or include an explicit EULA in the installer, arguing that the placement of a GitHub link next to the download button may lead users to assume the same openness as the CLI. 7 ( GitHub )

Ollama markets itself as a local tool you can trust offline. 1 2 ( GitHub ) However, users report that the official desktop app refuses to respond when the machine is offline even if the model is fully downloaded, while the CLI continues to work, which means the GUI depends on network access in practice, creating a discrepancy between the offline-first marketing and the behavior observed in that issue report. Issue 11632 describes the GUI client on macOS fails offline, while the CLI continues operating normally against local weights. 8 ( GitHub ) UPDATE: This was resolved September 2025.

It also presents itself as a community project and leans on the language of open tooling. 1 ( GitHub ) Hacker News discussions around the “Turbo” launch describe Ollama as a YC-style venture-backed company with a small internal team that controls nearly all core development. This implies a corporate governance structure in which relicensing or introduction of additional proprietary components is possible, and community commenters highlight this as a risk area given the common preference of investors for defensible intellectual property and recurring revenue. 9 ( Hacker News )

Taken together, these observations highlight differences between Ollama’s local-control positioning and several design and business choices: a networked daemon that contacts a remote registry and exposes a paid remote execution tier, a desktop application whose licensing is less clearly documented than the CLI, marketing that highlights the GUI, and a governance model centered on a corporate team while the project is described in open-tooling terms. 1 2 9 7 8 ( GitHub )

llama.cpp’s scope and incentives

llama.cpp is blunt about scope, it calls itself an inference runtime designed to make large language models run efficiently on a wide range of hardware, and it targets Apple Silicon, consumer GPUs, and CPU+GPU hybrids, while supporting multiple quantization schemes down to very low bit rates so that models that normally demand high-end GPUs can still load and generate on constrained systems. 5 6 ( GitHub ) The project builds in C/C++ on top of GGML, exposes direct flags for memory usage and offload configuration, and explicitly prioritizes the ability to fit a given model on a given machine, which is reflected in guides showing llama.cpp running quantized models on low-end hardware, including devices without large VRAM budgets. 5 6 ( GitHub )

llama.cpp also gives operators direct control over decoding. The server mode and bindings expose logit biasing, grammar-constrained sampling, token-level penalties, streaming, seed control, batch control, and continuous batching for multi-user concurrency, and these capabilities are documented in the server README and the Python bindings. These controls allow operators to enforce behavior at decode time, for example by suppressing specific phrases, requiring structured JSON output, or constraining output to a grammar, without retraining the model. 10 11 ( https://cnb.cool )

The incentive structure around llama.cpp can be inferred from public maintainer statements and repository design. The maintainer, Georgi Gerganov, states repeatedly that the goal is to keep the code free to use, modify, and redistribute under MIT, with no intention to add restrictions, and the project’s governance sits in that repo, not in a separate corporate shell that can relicense for revenue. 5 ( GitHub ) This model tends to align the project’s success with broad technical adoption as a portable, controllable inference runtime rather than with subscription revenue from proprietary user interfaces.

Attribution and dependency visibility

Ollama’s codebase embeds llama.cpp and related MIT-licensed components as statically linked dependencies, which is fine under MIT as long as you ship the required copyright notices in source and binary form, and users opened Issue #3185 in March 2024 stating that they could not find Georgi Gerganov’s notice in the packaged Ollama binaries on Linux or Windows, calling this a straight license violation. 12 ( GitHub ) Ollama did not immediately fix this, and the issue stayed active and visible for months. Some observers interpret this as supporting a broader claim that Ollama prefers to present itself as an independent engine rather than an integration layer on top of llama.cpp, because that framing strengthens Ollama’s positioning as a standalone product and emphasizes its paid tiers and graphical client as distinct offerings. 12 9 7 ( GitHub )

Hacker News commenters describe Ollama as VC-funded and raise governance concerns about the small internal team’s control, warning that the same small internal team can change terms, relicense, or close features at any time, and stating that this is “not proper governance.” 9 ( Hacker News ) For operators, this concern has operational implications: if Ollama’s differentiation depends in part on minimizing the visible role of llama.cpp, then Ollama has an understandable incentive to invest in additional proprietary layers (such as GUI and hosted inference) and to manage attribution carefully. Public tickets requesting clearer GUI licensing and more explicit attribution are consistent with that reading, even though alternative explanations such as prioritization or oversight are also possible. 7 ( GitHub )

llama.cpp is positioned simply as “the runtime,” and that claim is supported by the codebase, performance targets, and direct hardware support, so its value proposition does not depend on changing how upstream components are presented. 5 6 ( GitHub )

Security cost and blast radius

Running a local model is often described as private and safe, and Ollama’s marketing emphasizes these properties. At the same time, Ollama is a long-running network service that listens on localhost:11434 and can be exposed on a LAN or the public internet, and that service has carried a high-severity remote code execution vulnerability in past versions. 1 13 ( GitHub ) Wiz Research disclosed CVE-2024-37032 (“Probllama”) in June 2024, describing it as an easy-to-exploit remote code execution bug in Ollama that allowed an attacker to get code execution through a malicious model registry interaction, and the NVD entry states that all Ollama versions before 0.1.34 mishandled model digest validation, enabling path traversal and arbitrary file writes. 13 7 ( wiz.io ) Follow-up writeups describe this as “high severity,” because a remote unauthenticated attacker could compromise the host, steal data, or install ransomware, which turns a supposedly local inference node into an entry point. 13 7 ( wiz.io )

The Ollama desktop app adds another exposure vector, because it runs on top of that daemon and, according to user reports, refuses to answer offline in some builds, which suggests network dependencies and background calls beyond what is strictly required for local inference, so the marketed “private, offline AI on your machine” experience has, in some versions, depended on successful online interactions for the GUI to function. 8 ( GitHub ) This contrasts with a deployment model where the inference loop runs as a function in an application process and does not accept inbound connections, which is a common pattern when llama.cpp is embedded directly. llama.cpp ships as a C/C++ runtime that can be called in-process, with an optional HTTP server mode that exposes documented routes and concurrency controls, and it does not require a background daemon that contacts an external registry. 5 10 ( GitHub )

This yields a clear trade-off between convenience and security posture. Ollama provides a fast install, an HTTP API, and a GUI, but it also operates as a long-running daemon that was affected by an unauthenticated RCE in earlier releases (patched in 0.1.34 and later) and is paired with a GUI client that has previously exhibited offline degradation in user reports, while the project simultaneously develops a paid cloud tier. 2 9 13 7 8 ( Ollama ) llama.cpp gives a bare runtime, direct control of sampling, and documented server behavior, and expects you to build the rest yourself, which means more integration work up front, but a smaller attack surface and no inherent network requirement. 5 6 10 ( GitHub )

Control surface and determinism

For operators who care about output control, llama.cpp exposes token-level levers that let you enforce behavior at generation time, including logit biasing to suppress or force specific tokens, grammar-based constrained sampling for structured JSON or tool arguments, seed control for reproducibility, and continuous batching for predictable concurrency. 10 11 ( https://cnb.cool ) These controls enable policy enforcement at generation time: operators can remove specific phrases, require strict JSON, or enforce stylistic or structural constraints without modifying model weights, allowing a single model to support multiple output policies by varying decode parameters.

Ollama gives you a higher level API that wraps generation, streaming, and model management into a single daemon, which lowers entry cost for app teams, but it also means Ollama controls the sampler defaults, context handling, and concurrency model, and those defaults can change across releases. From a downstream perspective, such changes can affect reproducibility and may require additional validation when upgrading. 1 2 9 7 ( GitHub ) llama.cpp, by contrast, exposes a low-level completion loop with access to raw logits and decoding controls, which enables deterministic routing and enforcement for policy-sensitive output when decode parameters and seeds are fixed.

Benchmarking the runtimes

At this point the comparison becomes primarily technical. Either Ollama’s newer backend delivers superior throughput, startup latency, memory footprint, and concurrency to llama.cpp on the same weights, or it does not. Ollama supporters on Hacker News already claim Ollama is no longer “just a wrapper,” asserting that newer models run on Ollama’s own backend and that llama.cpp is only used for legacy paths, which is convenient messaging for a company now selling a $20/month datacenter tier and a proprietary desktop client. 2 9 7 ( Ollama ) llama.cpp maintainers and users reject that framing, pointing out that Ollama still tracks llama.cpp features and performance work while under-crediting it, and that Ollama’s claims of independence mainly exist to justify enclosure and monetization rather than to prove a real technical fork that outperforms llama.cpp head to head. 12 9 ( GitHub )

We can settle this without guesswork, and we should, because otherwise we risk building on top of a daemon whose incentives do not match ours. The benchmark spec is simple.

โ€“ Use the same model weights and the same quantization level in both runtimes, converting formats if needed so we are testing inference engines and not model differences. โ€“ Fix sampler parameters, temperature, top-k/top-p, and seed, then generate from the same prompt. โ€“ Measure cold start to first token, which captures model load time and graph init. โ€“ Measure steady-state tokens per second during streaming response once the model is hot. โ€“ Record resident RAM and VRAM usage at steady state under that run. โ€“ Push near the context ceiling, observe slowdown or eviction behavior, and log crashes or stalls. โ€“ Fire concurrent requests and measure degradation, not just single-stream latency.

Any engine that fails these checks or cannot reproduce the same output under fixed seeds and fixed sampler parameters will be excluded from production use for policy-bound generation or routing, because it does not meet our requirements for stable, controlled output.

Requirement

Before adoption, we will run the above benchmark on Ollama and llama.cpp using identical model weights, aligned sampler parameters, and we will collect runtime metrics for comparison. Any runtime that fails to meet reproducible performance requirements or that requires dependence on a proprietary desktop client, an always-on daemon with a documented history of unauthenticated RCE in earlier releases, or whose primary hosted scaling option is a $20/month remote execution tier. 2 9 13 7 8 ( Ollama )

For interface layers that sit on top of those runtimes, including Open WebUI, we will treat license terms and governance as part of the threat model: no custom “open” license that bans meaningful rebranding, no requirement to keep upstream logos in front of end users while reserving white-labeling for an enterprise plan, no CLA that hands unilateral relicensing power to a single company, and no default path that quietly steers a supposedly local UI toward hosted tiers. Components that fail those checks are treated as proprietary vendor integrations rather than as core infrastructure for control, privacy, or long-term operability. 3 4 14 ( Open WebUI )


  1. https://github.com/ollama/ollama “Ollama GitHub README, local model serving, localhost:11434 API, Go implementation” ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎

  2. https://ollama.com/cloud “Ollama Cloud / Turbo marketing page, $20/mo preview tier, datacenter hardware, privacy-first messaging” ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎

  3. https://github.com/open-webui/open-webui “Open WebUI GitHub README, self-hosted interface, Ollama/OpenAI integration, offline positioning” ↩︎ ↩︎ ↩︎

  4. https://openwebui.com/license “Open WebUI License page, BSD-3-derived license with branding preservation and enterprise exceptions” ↩︎ ↩︎ ↩︎

  5. https://github.com/ggml-org/llama.cpp “llama.cpp GitHub README, project goals and scope” ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎

  6. https://rocm.blogs.amd.com/ecosystems-and-partners/llama-cpp/README.html “AMD ROCm partner writeup on llama.cpp hardware targets, quantization, CPU/GPU portability” ↩︎ ↩︎ ↩︎ ↩︎ ↩︎

  7. https://nvd.nist.gov/vuln/detail/CVE-2024-37032 “NVD entry for CVE-2024-37032, Ollama <0.1.34 path traversal leading to RCE” ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎

  8. https://github.com/ollama/ollama/issues/11632 “Ollama Issue #11632: official desktop app fails offline while CLI works offline” ↩︎ ↩︎ ↩︎ ↩︎ ↩︎

  9. https://news.ycombinator.com/item?id=44802414 “Hacker News: Ollama Turbo discussion, governance warnings about VC-funded control and relicensing risk” ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎

  10. https://github.com/ggerganov/llama.cpp/tree/master/examples/server “llama.cpp server README, HTTP server, batching, grammar-constrained decoding, logit bias” ↩︎ ↩︎ ↩︎ ↩︎

  11. https://llama-cpp-python.readthedocs.io/en/latest/api-reference/ “llama-cpp-python API reference: grammar, logit_bias, sampling controls, deterministic seeds” ↩︎ ↩︎

  12. https://github.com/ollama/ollama/issues/3185 “Ollama Issue #3185: missing MIT license notices for llama.cpp and other dependencies” ↩︎ ↩︎ ↩︎

  13. https://www.wiz.io/blog/probllama-ollama-vulnerability-cve-2024-37032 “Wiz Research: CVE-2024-37032 (Probllama) remote code execution in Ollama” ↩︎ ↩︎ ↩︎ ↩︎ ↩︎

  14. https://isitreallyfoss.com/projects/open-webui/ “Is It Really FOSS? notes on Open WebUI’s license history, custom terms, and branding limits” ↩︎