Making Sense of Apple’s AI Ecosystem: MLX, HuggingFace, CoreAI

A hype-free explanation of how Apple AI Ecosystem shares a single Swift interface.

Introduction

In my recent blog post, I discussed the Foundation Models framework and how it helps you build intelligent, agentic features on Apple platforms. If you haven’t read that post yet, I’d recommend starting there first before continuing here.

At its core, the Foundation Models framework will act as a frontend orchestrator, thanks to this year’s LanguageModel protocol update 😉. It receives the user’s prompt, decides which instructions to apply, which tools to invoke, and how to stitch the results together to get the best possible output.

But that naturally raises the next question. What exactly is powering the engine underneath?

That’s what this post is about. We’re going one layer deeper, into the models themselves. You’ll learn

how to use LLMs locally with MLX
how MLX helps LLM provider like Anthropic to bring their models to work with Foundation Models APIs
integrate open-source LLM models to Foundation Models framework via Hugging Face(AnyLanguage) and
fine tune via CoreAI

If the Foundation Models framework is the frontend orchestrator, what we’re covering today is the backend, the actual engine that makes your features run.

Hardware Requirements

MLX is purpose-built for Apple Silicon, so the first requirement is simple, you need a Mac with an M-series chip. Intel Macs are not supported.

Beyond that, the real constraint is unified memory. Unlike a traditional GPU setup where you’d need to worry about VRAM separately, Apple Silicon shares one memory pool between the CPU, GPU, and Neural Engine. That’s actually great news for running large models, because the entire pool is available to MLX. The practical implication is quite straightforward, the bigger the model, the more RAM you need.

Here’s a rough guide based on 4-bit quantised model variants (approximate figures, actual requirements vary by model architecture though)

Unified Memory	Models you can run
8 GB	Up to ~3B parameter models (e.g. Llama 3.2 3B 4-bit)
16 GB	Up to ~8B parameter models (e.g. Llama 3.1 8B 4-bit)
32 GB	Up to ~34B parameter models
64 GB+	70B models and beyond

If you’re on M5, there’s a bonus. All Apple Silicon Macs have a Neural Engine, but M5 adds dedicated Neural Accelerators on top of it. MLX targets these automatically for matrix multiplication, the core operation during prompt processing, and the result is a 4x speedup over M4 with zero code changes on your part, cool, right? In fact, MLX selects the best kernel for whatever hardware you’re running on so no need to worry.

For most developers following this blog post, a 16 GB M-series Mac is a comfortable starting point. If you want to experiment with larger frontier-class models, 32 GB or more gives you meaningful headroom without needing aggressive quantisation.

Hugging Face

If you haven’t come across Hugging Face before, think of it as the npm registry for machine learning models. Researchers and teams publish model weights there openly, and anyone can download and use them.

What makes it relevant here is the mlx-community organisation on Hugging Face. This is a community-maintained collection of popular open-source models that have already been converted and quantised for MLX. Models like Llama, Gemma, Qwen, and Mistral are all there in various sizes and quantisation levels, ready to drop straight into mlx-swift-lm by repo ID.

That’s where the model IDs in the code examples throughout this post come from. Whenever you see something like "mlx-community/Llama-3.2-3B-Instruct-4bit", that’s a direct reference to a repository on Hugging Face.

MLX Models Vs SystemLanguageModel

Before we dive into swapping models, it helps to understand what each one actually is 😊.

SystemLanguageModel is Apple’s on-device model, the one Apple Intelligence is built on. It ships with the Foundation Model framework, so you never download weights or manage a model file. Apple rebuilt it this year with stronger reasoning and improved tool calling, added image understanding via prompt attachments, and continues to refine its guardrails. The tradeoff is that get exactly what Apple ships, a locked appliance optimised to run privately within Apple’s defined safety boundaries.

MLX, on the other hand, started life as a numerical computing framework for Apple Silicon. You can imagine this is why I haven’t touched on this framework before, as I’ve never come close to the AI/ML research side of things in my career. Apple designed it for researchers and engineers who wanted to run, explore, and fine-tune arbitrary models on Mac, taking full advantage of the unified memory architecture.

Take note that MLX still lies in the On-Device-AI category. It lets you download the pre-trained model and do the work on your hardware. This is why I have added hardware requirements section.

Before the model can even think about talking to a chat session, it has to know how to calculate matrices on Apple’s Silicon hardware architecture. Open-source models on Hugging Face (like Llama 3 or Qwen) are usually written for Nvidia GPUs using CUDA or standard PyTorch. If you run them raw on a Mac, they don’t know how to efficiently talk to Apple’s Unified Memory or Metal GPU shaders. The core MLX framework is basically Apple’s native version of PyTorch. It rewrites those deep learning operations specifically for M-series chips. It allows us to take an open model, quantize it down (e.g., to 4-bit), and run it natively on Apple’s GPU.

Covering the full MLX library is out of scope here, and frankly, the low-level numerical computing side is well beyond what most app developers need to touch. What matters for us is the layer built on top: MLX-Swift-LM.

MLX from Swift

mlx-swift-lm is a Swift package built on top of MLX that handles the language model layer for you. It ships four library products:

MLXLLM — text-only language models
MLXVLM — vision-language models (for image and text inputs)
MLXEmbedders — text embedding and encoding models
MLXCommon — the shared core: generation logic, KV caching, and base protocols the three above rely on

MLXLM handles the model layer, loading weights, running inference, and managing the KV cache. But notice that none of this involves Foundation Models yet. mlx-swift-lm has its own high-level chat API called ChatSession (from MLXLMCommon), which works entirely independently.

If you’ve used LanguageModelSession before, ChatSession will feel familiar. The mental model is the same, download a model, create a session, send prompts, get responses. Where it differs is in the lower-level control it exposes. GenerateParameters lets you tune things like temperature and maxTokens directly, which Foundation Models abstracts away behind higher-level APIs.

The following are the codes I take from mlx-swift-example repository. You should definitely check out the example projects. They are quite interesting indeed.

Configuring an Open-Source Model

The first step is picking a model. LLMRegistry ships with a set of pre-defined configurations for popular models, analogous to how Foundation Models gives you SystemLanguageModel.default as a ready-to-use starting point.

let modelConfiguration = LLMRegistry.gemma3_1B_qat_4bit

Alt Text

Each entry in LLMRegistry is a ModelConfiguration that wraps a Hugging Face repo ID under the hood. gemma3_1B_qat_4bit, for example, points to mlx-community/gemma-3-1b-it-qat-4bit on Hugging Face. If the model you want isn’t already in the registry, you can define your own configuration directly.

let modelConfiguration = ModelConfiguration(id: "mlx-community/Llama-3.2-3B-Instruct-4bit")

Either way, the rest of the setup stays identical.

Downloading the Model

Once you have a configuration, you load and download the model with #huggingFaceLoadModelContainer

let modelContainer = try await #huggingFaceLoadModelContainer(
    configuration: LLMRegistry.gemma3_1B_qat_4bit
) { progress in 
	// update your UI for loading progress
}

#huggingFaceLoadModelContainer is a Swift macro from MLXHuggingFace that handles the full download and caching flow for you. On the first call, it fetches the model weights from Hugging Face. Every subsequent call loads them straight from the local cache without hitting the network. If you want to configure the downloader, tokeniser, and what not, I recommend you to check out their example repository once again.

Downloaded models land in the Hugging Face hub cache on your Mac

~/.cache/huggingface/hub/

Each model gets its own folder named after the repo, for example:

~/.cache/huggingface/hub/models--mlx-community--gemma-3-1b-it-qat-4bit/

You can open this in Finder with:

open ~/.cache/huggingface/hub

If you want to free up disk space, delete the model’s folder directly

rm -rf ~/.cache/huggingface/hub/models--mlx-community--gemma-3-1b-it-qat-4bit

Replace the folder name with whichever model you want to remove. The next time your app runs and requests that model, it will re-download it automatically.

Starting a Chat Session

Once you have the model container, wrap it in a ChatSession to start conversing

let session = ChatSession(modelContainer)

let reply = try await session.respond(to: "What is lazy evaluation?")
print(reply)

ChatSession automatically maintains conversation history across turns, the same way LanguageModelSession does in Foundation Models. A follow-up question like "Give me a Swift example" will have full context from the previous exchange without you doing anything extra.

You can pass instructions to your chat session directly as a String.

let session = ChatSession(
	modelContainer, 
	instructions: "You are a concise technical writer."
)

You can also change it at any point by mutating session.instructions directly.

Streaming responses

Both APIs support streaming, but the return type differs slightly. Foundation Models streams partial Response values, while ChatSession.streamResponse(to:) yields raw String chunks via AsyncThrowingStream:

for try await chunk in session.streamResponse(to: "Explain value types in Swift.") {
    print(chunk, terminator: "")
}

Each chunk is a fragment of the response as the model generates it, so you can display tokens progressively in your UI.

Tool calling

Foundation Models uses a Tool protocol with structured Swift types. ChatSession takes a lower-level approach. You pass a list of ToolSpec descriptions and a toolDispatch closure that receives the raw ToolCall and returns a String result.

let session = ChatSession(
    modelContainer,
    instructions: "Use tools when needed.",
    tools: [myToolSpec],
    toolDispatch: { call in
        // inspect call.name and call.arguments, run your logic, return a result
        return "Tool result as a plain string"
    }
)

This is more manual than Foundation Models’ structured approach, but it gives you full control over how tools are resolved, including async network calls, database lookups, or anything else you can express in a closure.

MLX from Terminal

Everything we’ve covered so far involves writing Swift code. But mlx-lm also ships a Python CLI that lets you download models, chat with them, and run a local server, all without a single line of application code. It’s a great way to test a model before committing it to your app. Apple walked through this exact workflow in Run local agentic AI on the Mac using MLX (WWDC 2026, session 232), which is worth watching if you want to go deeper. It is quite impressive that an on-device model created a SwiftUI drawing app from scratch, though, I don’t know the exact spec of his machine and how long does it actually take.

Install mlx-lm command

pip install mlx-lm

That one command gets you the full CLI toolkit. If your machine doesn’t have python, be sure to download that first so that you get access to pip command.

Chat directly in the terminal

To have a quick one-shot conversation with any mlx-community model

mlx_lm.generate \
  --model mlx-community/Llama-3.2-3B-Instruct-4bit \
  --prompt "Explain value semantics in Swift in two sentences."

For an interactive multi-turn session, use mlx_lm.chat instead

mlx_lm.chat --model mlx-community/Llama-3.2-3B-Instruct-4bit

The model downloads automatically on first run into ~/.cache/huggingface/hub/, the same cache location as the Swift package, so switching between the CLI and your app costs nothing extra.

Run a local server

The more powerful option is spinning up a local server that any agent or tool can talk to.

mlx_lm.server --model mlx-community/Llama-3.2-3B-Instruct-4bit

By default the server binds to 127.0.0.1 on port 8080. You can configure via --port:

mlx_lm.server --model mlx-community/Llama-3.2-3B-Instruct-4bit --port 11434

Take note that the /v1/chat/completions path is not something you configure, it is hardcoded because it mirrors the exact endpoint path OpenAI defined for their API. This is the same HTTP API that OpenAI defined for ChatGPT, and it became a de facto standard that the whole AI tooling ecosystem adopted. When a tool says it is “OpenAI-compatible”, it means it speaks that same protocol, so any client, agent, or IDE integration built to talk to OpenAI can be pointed at a different base URL and work without modification. In this case, that base URL is your own Mac. From that point, anything that speaks the protocol can use your local model, including OpenCode, Xcode, custom scripts, or even a plain curl

curl http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default_model",
    "messages": [{ "role": "user", "content": "Hello!" }]
  }'

Connecting with OpenCode

To make OpenCode use your local model for everything, add a provider entry pointing at the local server:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "mlx": {
      "name": "MLX Local",
      "baseURL": "http://127.0.0.1:8080/v1",
      "models": { "default_model": {} }
    }
  },
  "model": "mlx/default_model"
}

OpenCode has no idea the model is running on your Mac rather than a cloud endpoint. Every request goes through your local server, so apparently no API key, no usage cost, and fully private.

Connecting with Xcode

If you want Xcode 27’s built-in Intelligence features to use your local model, open Settings, Intelligence, Add Chat Provider, select “Locally Hosted”, and set the port to 8080. Xcode will route its model requests to your running MLX server from that point on.

Going beyond localhost with MacProvider

mlx_lm.server only listens on your Mac. If you want your MLX endpoint reachable from other machines or devices without managing tunnels yourself, MacProvider is a 3rd party built thin layer over mlx-lm that handles that gap. I haven’t tried it out for myself yet, but, it looks quite nice so I want to share with you all here.

If I understand the idea correctly, you install a small CLI agent on your Mac, which connects outbound over WebSocket to a coordinator. No port-forwarding or firewall changes needed on your end. From the buyer side, you get a standard OpenAI-compatible endpoint at api.streamvc.live/v1 and can swap it into any existing client.

from openai import OpenAI

client = OpenAI(
    base_url="https://api.streamvc.live/v1",
    api_key="<your-api-key>",
)

response = client.chat.completions.create(
    model="mlx-community/Llama-3.2-3B-Instruct-4bit",
    messages=[{"role": "user", "content": "Hello"}],
)

One thing worth knowing is that prompts and responses pass through the gateway for routing and billing. Model weights stay on the provider’s Mac and never leave, but this is not a private-inference guarantee. The README is upfront about this, so go in with eyes open if your use case is sensitive.

MLX for LLM Providers

Everything in this section is theoretical on my part. I am not a model provider, and I have not implemented any of this myself. What follows is my reading of Apple’s WWDC 2026 session Bring an LLM provider to the Foundation Models framework, which walks through exactly how providers are expected to integrate. If you are building a provider package, that session is the authoritative source, not this post.

This year, Apple opened the Foundation Models abstraction layer to third-party providers. Anthropic and Google are both shipping Swift packages that bring Claude and Gemini into LanguageModelSession through the same LanguageModel protocol that SystemLanguageModel uses. From your app code, the swap looks identical to switching between any other model.

What makes this interesting is what happens underneath. A provider package is no longer an HTTP client wrapper. It is a full integration layer that has to bridge the provider’s own model behaviour, safety guardrails, and capabilities into Apple’s framework contracts. Here is how that works at a high level.

The two protocols

Every provider implementation revolves around two protocols: LanguageModel and LanguageModelExecutor.

LanguageModel is the lightweight, value-typed description of the model. It declares what the model can do by listing its capabilities, such as tool calling, guided generation, and reasoning. It also provides an executorConfiguration, which acts as the cache key for the executor instance inside a session.

LanguageModelExecutor is where the actual work happens. It initialises with the configuration, optionally prewarms (loading weights or opening a connection), and then handles each generation request by translating the Foundation Models Transcript into the provider’s native wire format, making the call, and streaming the response back through a typed channel.

public protocol LanguageModel: Sendable {
    var capabilities: LanguageModelCapabilities { get }
    var executorConfiguration: Executor.Configuration { get }
}

public protocol LanguageModelExecutor: Sendable {
    init(configuration: Configuration) throws
    func prewarm(model: Model, transcript: Transcript)
    func respond(
        to request: LanguageModelExecutorGenerationRequest,
        model: Model,
        streamingInto channel: LanguageModelExecutorGenerationChannel
    ) async throws
}

Translating the transcript

The executor receives the full conversation as a Transcript, which contains typed entries: instructions, user prompts, tool calls, tool outputs, model responses, and reasoning. The provider’s job is to map these entries to whatever role format their API expects (system, user, assistant, tool, and so on) and then apply any context and generation options the developer passed.

This is where a provider’s safety layer lives. Anthropic’s executor, for example, would forward the assembled prompt to the Claude API, which applies Anthropic’s own constitutional AI guardrails server-side. The response comes back and the executor streams it into the channel. Apple’s framework never sees or touches the safety logic directly. It just receives the streamed tokens.

Streaming the response

Responses are always streamed, even when the developer calls the one-shot respond(to:) API. The recommended order is metadata first, then token usage, then text deltas:

// 1. Send identifying metadata early for logging
await channel.send(.response(action: .updateMetadata([
    "modelID": "claude-opus-4-8",
    "requestID": request.id.uuidString
])))

// 2. Report prompt token usage before generation
await channel.send(.response(action: .updateUsage(
    input: .init(totalTokenCount: promptTokens, cachedTokenCount: cachedTokens),
    output: .init(totalTokenCount: 0, reasoningTokenCount: 0)
)))

// 3. Stream generated tokens as they arrive
for try await token in responseStream {
    await channel.send(.response(action: .appendText(token)))
}

Authentication and security

Because these are server-backed models, the provider package has to handle authentication. Apple’s guidance is to use OAuth flows rather than raw API keys, store tokens in Keychain, and consider App Attest for device verification. The upside of this design is that app developers never manage API keys directly; the provider package handles the full credential lifecycle.

This is fundamentally different from the on-device models. With SystemLanguageModel or open-source MLX models, nothing leaves your device. With Claude or Gemini behind this protocol, every request goes to Anthropic’s or Google’s servers, with the same privacy characteristics as calling their APIs directly. The LanguageModel protocol makes the call site identical, and hence the effortless model swapping, but the data flow under the hood is not.

Use Hugging Face Models via AnyLanguageModel

mlx-swift-lm is great for raw inference and experimentation. But it operates independently of Foundation Models, which means you lose all the agentic primitives: DynamicProfile, structured output via @Generable, the instruction system, tool orchestration, and everything else built on top. Building those yourself from scratch in MLX is sort of re-inventing the wheel.

If you want open-source models inside your existing Foundation Models code, AnyLanguageModel from Hugging Face is the best option. Of course, it is still using MLX under the hood. Thanks to LanguageModel abstraction, and wrapper ecosystems like AnyLanguageModel, MLX has become an invisible implementation detail for consumers like yourself.

AnyLanguageModel is a drop-in replacement for the Foundation Models framework. It supports a wide range of backends: Apple’s own models, MLX, Core ML, llama.cpp (GGUF), Ollama, and even cloud providers like Anthropic, Google Gemini, and OpenAI. The entire switch only takes one line

--- a/AwesomeFeature.swift
+++ b/AwesomeFeature.swift
@@ -1 +1 @@
-import FoundationModels
+import AnyLanguageModel

That’s it. Every LanguageModelSession, every Tool, every Instructions block you already wrote continues to work. You just gain the ability to back it with a different model.

Installation

You can install the package via SPM

// Package.swift
dependencies: [
    .package(
        url: "https://github.com/huggingface/AnyLanguageModel",
        exact: "0.8.0",
        traits: ["MLX"] // opt into the MLX backend
    ),
    // Required: add the underlying dep directly (SPM trait bug workaround)
    .package(url: "https://github.com/ml-explore/mlx-swift-lm", exact: "2.25.5"),
]

AnyLanguageModel uses Swift 6.1 package traits to keep binary size small. You opt in with "MLX", "CoreML", or "Llama" as needed, and add the underlying dependency alongside it due to a known SPM bug. For this article, we will pass MLX as a trait.

This tells the package to quietly install mlx-swift-lm under the hood. If you changed that trait to “Llama”, the wrapper would swap out MLX completely and use llama.cpp (GGUF format) to execute the model instead, without changing a single line of your actual generation code.

If you want to know more about Swift Package Trait, I have another blog post for you 😉.

Xcode projects: Xcode doesn’t support package traits directly yet. The workaround is to create a local Swift package shim that re-exports AnyLanguageModel with the traits enabled, then add that local package to your Xcode project. The AnyLanguageModel README has a step-by-step guide for this.

Using an MLX model inside Foundation Models

Once the package is added, swap SystemLanguageModel.default for an MLX model and the rest of your code is unchanged:

import AnyLanguageModel

// Use any mlx-community model by Hugging Face repo ID
let model = MLXLanguageModel(modelId: "mlx-community/Llama-3.2-3B-Instruct-4bit")
let session = LanguageModelSession(model: model)

// Tools, instructions, streaming — all the same Foundation Models APIs
let response = try await session.respond(to: "What is lazy evaluation?")
print(response.content)

This is all it takes to bridge between the two worlds. You get the full model flexibility of the mlx-community on Hugging Face, and you keep every Foundation Models primitive you’ve already built on top.

Decoding the Model Name

If you’re not from an AI/ML background, model identifiers like mlx-community/Llama-3.2-3B-Instruct-4bit look like random noise at first glance. They’re not. Each segment tells you something specific, and once you know the pattern, picking a model becomes straightforward.

Llama  -  3.2  -  3B  -  Instruct  -  4bit
  (1)    (2)    (3)      (4)          (5)

(1) Model family — The architecture name, like Llama, Gemma, Qwen, or Mistral. Think of this like a framework name such as SwiftUI vs UIKit vs CoreGraphics etc. Different teams, different design decisions, different trade-offs.

(2) Version — The release iteration of that family. 3.2 is a newer, generally better version than 3.1 from the same family, much like a library version.

(3) Parameter count — 3B means 3 billion parameters. Parameters are the learned numerical weights that define how the model thinks. More parameters generally means a smarter, more capable model, but also more memory and slower inference.

(4) Fine-tuning flavour — This is the most important one to get the right model. Let’s try to break it down.

Base (or Pre-trained): Trained purely to predict the next token in a sequence. Ask it “How do I bake a cake?” and it might respond with more questions like “How do I bake bread? How do I make cookies?” It’s not trying to help you, it’s just completing a pattern. Useful for further fine-tuning, not for shipping to users.
Instruct (or Chat): Further trained with human feedback to behave like an assistant. It knows it should give you a direct, helpful answer. Always use Instruct or Chat variants for app features.

(5) Quantisation — The compression level applied to the weights. 4bit means each parameter is stored using 4 bits instead of the full 16 or 32 bits. This drastically reduces the model’s memory footprint so it can run on-device, at the cost of a small accuracy trade-off. According to my experience, 4bit is the sweet spot for most use cases, it fits in reasonable RAM and the quality degradation is barely noticeable in practice. 8bit is higher quality with more memory. 2bit is very compressed and noticeably worse.

Choosing the Right Model

Honestly, this is an area I’m still experimenting with myself, so take this as a working guide rather than a definitive rulebook. That said, there are three things worth checking before you commit to a model: memory fit, task complexity, and the context window.

Memory fit first

This one is non-negotiable. Apple Silicon uses unified memory, so the model, your app, and the OS all share the same pool. If a model is too large to fit, macOS starts paging weights to the SSD, and inference slows to a crawl. Not a graceful degradation, just very slow.

A rough formula to estimate how much memory a model needs:

$\text{Memory (GB)} \approx \frac{\text{Parameters (B)} \times \text{Bit-depth}}{8} + 1$

With above formula in mind, try to plug in some real examples:

Model	Parameters	Quantisation	Estimated RAM
Gemma 3 1B	1B	4-bit	~1.5 GB
Llama 3.2 3B	3B	4-bit	~2.5 GB
Llama 3.1 8B	8B	4-bit	~5 GB
Mistral 34B	34B	4-bit	~18 GB

Add another ~1-2 GB headroom for the KV cache during generation, especially for longer conversations. So on a 16 GB machine, an 8B parameters, each parameter stored as 4-bit model is comfortable. A 34B model needs 32 GB or more to avoid paging.

Task complexity

Bigger is not always better, especially when load time and memory cost are real constraints on a device. A 1B-3B model handles summarisation, classification, short Q&A, and simple text transformations surprisingly well. For multi-step reasoning, code generation, or anything that requires holding a lot of context in mind at once, you start to feel the ceiling around 3B and a 7B-8B model becomes noticeably better. Beyond 8B, you’re mostly trading memory for marginal quality gains on typical app tasks.

Here is a rough starting point so you know how it looks

1B-3B — lightweight tasks: classification, short Q&A, simple rewrites
7B-8B — general-purpose assistant tasks, code, structured reasoning
34B+ — research-grade tasks where quality matters more than resource cost

Context window

The context window is the maximum number of tokens a model can hold in one go, covering your system prompt, conversation history, and the pending response all at once. Most MLX models in the 1B-8B range support 4K-8K tokens. Newer releases of Llama and Qwen push to 32K or 128K.

For most app features this doesn’t matter much. Short chat turns, single-pass generation, and tool calling fit comfortably in 4K. Where it starts to matter is long document summarisation or multi-turn conversations that carry a lot of history. In those cases, check the model card on Hugging Face before committing. If your use case involves processing large inputs, a smaller model with a 32K window can apparently beat a larger model with a 4K window.

When in doubt, start with a 3B or 7B Instruct model, measure quality on your actual task, and only go larger if the output noticeably falls short. At least, that is my plan right now.

Core AI

When I first saw “CoreAI” mentioned alongside Foundation Models, I assumed it was purely a fine-tuning story. It is much broader than that. Core AI is Apple’s on-device inference stack, the same engine that powers Apple Intelligence internally, now opened up to developers. I bet it is going to be a successor to Core ML, in a way, as it rebuilt from the ground up for the generative AI era. Fine-tuning and model compression are part of the toolchain, but for most app developers the relevant story is inference, bringing your own model onto the device and running it efficiently.

Where Core ML spoke in .mlmodel files and focused on classification and traditional ML workloads, Core AI speaks in .aimodel files and is designed for modern transformer-based models. It can run across the CPU, GPU, and Neural Engine, and has first-class support for stateful workloads like autoregressive text generation, where the model needs to carry forward a KV cache between each output token. The framework covers the full lifecycle, Python tooling for conversion and compression, Xcode integration for inspection and profiling, and Swift APIs for running inference on-device.

Converting a Model

Core AI’s Python tooling (coreai_torch) handles the trip from a PyTorch model to a .aimodel asset. There is a separate tool, coreai_opt, for model compression and quantisation if you need to shrink the model further before deployment. If you’re converting your own PyTorch model from scratch, you may use coreai_torch library. I won’t be covering this in the article as it is not my best interest at the moment.

If you want to try out, Apple publishes the Core AI Models repository, which has ready-made export recipes for popular models like SAM3 and Qwen. Instead of figuring out the export flags yourself, you run their provided recipe and get an optimized .aimodel out the other side. This is the practical path for me.

Once you have the .aimodel file, you can open it directly in Xcode to inspect its function signatures, tensor shapes, and memory footprint before touching a line of Swift.

Loading and Running in Swift

The Swift side of Core AI is built around three types: AIModel, InferenceFunction, and NDArray.

import CoreAI

let model = try await AIModel(contentsOf: modelURL)
let mainFunction = try model.loadFunction(named: "main")! // "main" is the function name you assigned during export

let input = NDArray(shape: [1, 128], scalarType: .float32)
var outputs = try await mainFunction.run(inputs: ["input": input])
let result = outputs.remove("output")?.ndArray

This is the low-level API, and it mirrors Core ML’s MLModel pattern fairly closely. You’re responsible for preparing your input tensors and interpreting the output tensors. For custom models where you control the architecture, this is exactly the right level of abstraction.

But for language models specifically, the Core AI Models Swift package wraps all the tokenizer setup and tensor handling for you, and gives you a Foundation Models integration out of the box.

Plugging into Foundation Models

This is the part that makes Core AI genuinely interesting from an app developer’s perspective. The Core AI Models package ships a CoreAILanguageModel type that conforms to Apple’s LanguageModel protocol. You load a Qwen model or any other Core AI language model and pass it straight into LanguageModelSession.

import FoundationModels
import CoreAILanguageModels

let model = try await CoreAILanguageModel(resourcesAt: modelURL)
let session = LanguageModelSession(model: model)

let response = try await session.respond(to: "Translate 'apple' to French.")
print(response.content)

Because it conforms to LanguageModel, everything you’ve already built with Foundation Models still works. Structured generation with @Generable, tool calling, streaming, instructions, all of it carries over unchanged. The model backing the session just happens to be your custom .aimodel running entirely on-device instead of Apple’s system model.

@Generable
struct VocabCard {
    let word: String
    let translation: String
    let exampleSentences: [String]
}

let response = try await session.respond(
    to: "Create a vocab card for the word 'bloom'.",
    generating: VocabCard.self
)
let card: VocabCard = response.content

This is the same @Generable pattern you’d use with SystemLanguageModel.default. Nothing changes in how you write the feature code.

The Specialisation Delay

The first time a Core AI model runs on a user’s device, it has to be specialised. Specialisation is a compilation step that tailors the model for the exact chip and OS version on that specific device. It can take a noticeable amount of time for larger models, and if it happens in the middle of an interactive flow, the user just sees your app stall.

The Apple’s recommended approach, obviously, is to just treat this step like a download. Do it deliberately, outside the user’s primary action.

Gate the feature behind an explicit opt-in
Download the .aimodel asset via Background Assets after the user opts in
Trigger specialisation during a first-run experience with clear progress feedback
On subsequent launches, the device cache makes loading fast

Apple provides AIModelCache and explicit specialisation APIs so you can check whether a model is ready and request preparation ahead of time. For the full API details, the Managing model specialisation and caching documentation page is the right starting point.

If you want to go deeper on any of this, Apple’s WWDC 2026 sessions are the best resource: Meet Core AI gives the full framework overview, and Integrate on-device AI models into your app using Core AI walks through a complete real-world integration with Qwen and SAM3.

Conclusion

Apple’s AI ecosystem has a lot of moving parts, and I hope this post made it a little less overwhelming. The good news is that once you see how LanguageModel acts as the common thread, everything else starts to fall into place naturally.

Thanks for sticking around to the end. It’s a long one, I know. If you found it useful, or if something here is wrong, I’d love to hear from you 😊.