Local and Open Source AI Systems

Problem

Hosted AI services can expose sensitive material, create recurring dependency and hide implementation details. Local deployment offers control but is often described without distinguishing location, model openness, licensing and operational responsibility.

Introduction

Local AI means running a model on hardware controlled by the operator rather than sending every request to a third-party inference service. The hardware may be a laptop, workstation, studio server, edge device or private cloud. The model may be fully open source, open-weight under a restrictive license, or privately trained. These categories overlap but are not interchangeable.

The appeal is clear. Private documents can remain on local infrastructure. A creative tool can work offline. Costs can become predictable after hardware investment. Models can be versioned, adapted and integrated without depending on one hosted product interface. For archives and research studios, local execution can protect unreleased material and preserve a reproducible workflow.

Local AI is not automatically private, sustainable, secure or inexpensive. Applications may still send telemetry. Downloaded models may have unclear provenance. A local service can expose an unauthenticated network endpoint. Hardware and energy costs move from provider to operator. The decision should be based on requirements, not independence as an aesthetic.

Local, open-weight and open source

Local describes where inference happens. Open-weight describes access to learned parameters. Open source describes freedoms and access across the system.

The Open Source Initiative’s Open Source AI Definition states that an open-source AI system should grant freedoms to use, study, modify and share. It also requires access to a preferred form for modification, including sufficient data information, code and parameters. This is stricter than making a model file downloadable.

Many widely distributed models are open-weight. Users can run and sometimes adapt them, but training datasets may not be reproducible, training code may be incomplete, or the license may restrict uses. Open-weight releases still create substantial public value, but precise language matters. A studio evaluating long-term dependence should know which rights and materials are actually available.

The inference stack

A local language-model stack includes model architecture, weights, tokenizer, inference runtime, model format, hardware backend and application interface. Each layer affects compatibility.

The model architecture defines the computation. The weights contain learned parameters. The tokenizer maps text to model vocabulary. The runtime executes model operations. A file format packages parameters and metadata. Hardware backends accelerate computation on CPU, GPU or specialized processors. An application adds prompting, retrieval, tools, memory and user interface.

llama.cpp is a prominent open-source runtime for C and C++ inference. It supports portable execution and quantized model formats across a broad range of hardware. Its importance is not that every local system must use it, but that it demonstrates how inference can be separated from a hosted vendor and embedded into other tools.

Architecture

A local stack contains model weights, tokenizer, inference runtime, hardware backend, storage, application interface, optional retrieval index and security boundary. Each layer can be replaced or governed independently when contracts are explicit.

Quantization

Model weights are commonly represented with floating-point numbers. Quantization reduces the precision used to store and compute those values. A lower-precision model requires less memory and can run faster, making larger models practical on consumer hardware.

Quantization is a tradeoff. More aggressive compression can reduce quality, especially on sensitive tasks. The effect depends on architecture, quantization method, model size and workload. File-size labels alone do not predict performance.

The operator should evaluate a quantized model on representative tasks. A configuration adequate for summarizing studio notes may fail on code generation or multilingual extraction. Throughput, first-token latency, context length and memory usage should be measured together.

QLoRA showed that quantization can also support efficient model adaptation by combining a frozen quantized base model with trainable low-rank adapters. The broader lesson is that local experimentation does not always require retraining every parameter.

Hardware

Model inference is constrained primarily by memory capacity, memory bandwidth and compute. A model must fit in available memory together with runtime overhead and the key-value cache used for context. Longer contexts consume additional memory. Batch size and concurrent users increase requirements.

Consumer laptops with unified memory can run useful small and medium models. Dedicated GPUs can provide higher throughput but introduce power, cooling and driver concerns. CPU-only inference is portable and can be sufficient for background tasks, although latency may be higher.

Hardware choice should follow workload. An individual knowledge assistant has different needs from a shared transcription service or real-time installation. A model that produces one research draft per minute may be acceptable; an interactive audio tool may require much tighter latency.

Privacy and security

Keeping prompts local reduces one class of data exposure, but privacy depends on the complete system. Applications can write conversation logs, caches and embeddings to disk. Model files can be replaced with malicious packages. Plugins can access the network. A browser interface can expose the service to other devices.

A local deployment should define network binding, authentication, encryption, log retention and update procedure. Sensitive projects should use isolated environments and explicit data directories. Models and runtime binaries should come from verified sources when possible.

Prompt injection remains relevant. A local RAG system can retrieve a document containing instructions designed to manipulate tool use. Local execution changes infrastructure ownership; it does not remove adversarial content.

Model selection

The best local model is not necessarily the largest model that fits. Selection should consider task quality, license, language coverage, context behavior, tool support, structured output, hardware efficiency and update stability.

Benchmark rankings provide orientation but rarely match a particular archive or creative practice. Build a local evaluation set from real tasks: extract metadata from project notes, summarize technical sources, generate code against the repository style, classify audio annotations or answer questions from the Knowledge Hub. Include difficult and negative examples.

Smaller specialized models can outperform larger general models on bounded tasks when paired with retrieval and validation. They also reduce latency and operational cost.

Adaptation and fine-tuning

Prompting and retrieval should be tested before fine-tuning. Many failures attributed to model knowledge are actually context or interface failures. A clear schema, relevant examples and authoritative retrieval may be enough.

Fine-tuning is appropriate when a stable behavior, vocabulary or output form must be learned across many requests. Low-rank adaptation can reduce training cost. The training set still requires provenance, rights, quality control and separation between training and evaluation.

A model trained on private material becomes another sensitive artefact. Checkpoints, adapters and logs need retention policy. Fine-tuning can also degrade general capabilities or make undesired patterns more persistent.

Local RAG

Retrieval-augmented generation is one of the strongest local use cases. Documents remain on controlled storage, an embedding or lexical index selects relevant passages, and a local model synthesizes an answer. This can support research archives, private documentation and project memory.

The weakest implementation simply splits files into arbitrary chunks and sends nearest matches to the model. A stronger system preserves titles, authors, dates, entity IDs, access levels and section boundaries. It combines lexical and semantic retrieval and returns citations.

Electronic Artefacts already has a suitable corpus structure: canonical entities, sources, frontmatter, graph relations and generated pages. A local assistant could retrieve public and private projections according to identity. VASTE could enforce context boundaries, while the Knowledge Hub supplies source-rich records.

Creative applications

Local models can support private ideation, draft transformation, metadata extraction, code assistance, visual tagging and installation control. For ORETH, local inference could protect unreleased audio annotations or support offline machine-listening interfaces. For Palimpsests, it could help connect textual and sonic fragments without uploading the archive.

Creative use benefits from model replaceability. A workflow should preserve prompts, model identifier, quantization, adapters and source material so that outputs can be understood later. The generated artefact is only one part of the record.

Models should augment selection and experimentation rather than silently replacing authorship. The artist or editor needs controls for variation, rejection and revision.

Operations

Self-hosting creates maintenance work. Runtimes change, model formats evolve, security issues appear and hardware fails. An operator needs version pinning, health checks, backups for configuration, model inventories and rollback procedures.

Observability should capture request duration, token counts, errors and resource use without retaining sensitive content by default. Concurrency limits protect interactive performance. A queue may be better than allowing every process to load its own model copy.

Model updates should be treated like software releases. Evaluate the new version before replacing the old one. Preserve enough metadata to reproduce important outputs. A local system becomes durable through operations, not only through downloadable files.

Economics and sustainability

Hosted inference converts infrastructure into usage fees. Local inference converts it into hardware, energy and maintenance. The cheaper option depends on volume, latency, model size and staffing. Sporadic use may favor hosted services. Continuous predictable workloads may favor local execution.

Energy comparisons require workload-level measurement. A small local model can avoid network and data-center overhead, but an underused high-end workstation is not automatically efficient. Right-sizing matters.

Independence has value beyond direct cost. Local systems can continue when an API changes, a service is unavailable or a provider withdraws a model. That resilience may justify additional operational effort for archives and long-lived tools.

Governance and licensing

Before deployment, record the model source, version, license, allowed uses, training-data disclosures, runtime license and third-party dependencies. A model license may restrict commercial use or particular domains. Runtime code and model weights may use different licenses.

The system should also record what data it processes and whether outputs can be published. Local execution does not grant rights over source material or generated content.

For Electronic Artefacts, the distinction between proprietary runtime and open research is important. VASTE can remain proprietary while integrating open-source infrastructure and open-weight models under compatible terms. Transparency should state which layer is open and which is controlled.

When local AI is the right choice

Choose local inference when privacy, offline operation, latency, customization or independence outweigh operational cost. It is especially strong for stable internal workloads and installations that cannot depend on external connectivity.

Choose hosted inference when access to frontier capability, elastic scale or managed operations matters more than data locality. Hybrid systems can route sensitive or routine tasks locally and exceptional tasks to a hosted model under explicit policy.

The architecture should make that routing visible. Users need to know where their material is processed.

Implementation

Begin with a representative task set and hardware budget. Pin model, runtime and quantization versions, bind services to protected interfaces, measure latency and memory, then add retrieval or fine-tuning only when evaluation shows a need.

Evidence

The OSI definition establishes criteria for open-source AI beyond downloadable weights. llama.cpp demonstrates portable local inference, while QLoRA documents efficient adaptation of quantized models.

Electronic Artefacts implications

Local AI fits Electronic Artefacts when it strengthens control over knowledge and cultural material. It can support a private research assistant over graph records, an ORETH analysis service, a studio coding tool or a multimodal archive interface.

The durable principle is composability. Models should be replaceable components behind clear contracts. Sources, entity identities and permissions should remain independent of one model vendor. Provenance should capture how AI contributed to an artefact.

Limitations

Local performance depends heavily on current hardware, drivers, model formats and runtime support. Product-level comparisons can become obsolete quickly, so this article focuses on architectural principles rather than benchmark rankings.

Related concepts

Read Open-Weight Model, Open Source, Large Language Model, Provenance and Retrieval-Augmented Generation.

Related technology

See llama.cpp for a local inference runtime.

Glossary

Open-weight: distributing learned parameters under stated terms.

Quantization: reducing numerical precision to lower memory and compute requirements.

Inference runtime: software that executes a trained model.

Adapter: a smaller set of trainable parameters added to a base model.

Unified memory: memory shared by CPU and accelerator in some hardware architectures.

References

Open Source Initiative. Open Source AI Definition 1.0.
ggml-org. llama.cpp.
Dettmers et al. QLoRA. 2023.