Training Provenance
Training provenance starts with a simple question: can someone prove that a model's current MLflow record matches what was recorded when training finished?
ar-io-mlflow answers that by writing a canonical payload to MLflow, signing a compact commitment envelope, and anchoring only that envelope to Arweave. The plugin can also anchor MLflow dataset records so downstream trainers can refer to an immutable dataset proof without publishing rows or raw files.
Prerequisites
- Python 3.10 or newer
- MLflow 2.14 or newer
- A local or remote MLflow tracking store
- Optional Arweave JWK wallet for production identity
If no wallet is configured, the plugin generates one at ~/.ario-mlflow/wallet.json and reuses it. That is convenient for evaluation. In production, set ARIO_MLFLOW_ARWEAVE_WALLET to a dedicated wallet file from your secrets manager.
Install
Install ar-io-mlflow from source.
git clone https://github.com/ar-io/ar-io-mlflow.git
cd ar-io-mlflow
pip install -e .For the quickstart example, install scikit-learn as well:
pip install scikit-learnAnchor a training run
Configure MLflow
Point MLflow at the tracking store you want to use. This can be a local folder for development or your normal remote tracking URI.
from pathlib import Path
import mlflow
tracking_dir = Path("./mlruns").resolve()
mlflow.set_tracking_uri(f"file://{tracking_dir}")
mlflow.set_experiment("verifiable-ai")Log a dataset and model
Log the dataset through MLflow so the plugin can include dataset provenance in the training proof. The dataset proof commits to the dataset name, source, digest, and schema hash, not to the row contents.
import mlflow.data
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42,
)
with mlflow.start_run() as run:
dataset = mlflow.data.from_numpy(
X_train,
targets=y_train,
source="https://archive.ics.uci.edu/dataset/53/iris",
name="iris-train",
)
mlflow.log_input(dataset, context="training")
model = LogisticRegression(max_iter=200).fit(X_train, y_train)
mlflow.log_params({"max_iter": 200, "random_state": 42})
mlflow.log_metric("accuracy", model.score(X_test, y_test))
mlflow.sklearn.log_model(model, name="model")Anchor the proof
Call ario_mlflow.anchor() before the run exits. The call hashes model artifacts, writes ario/payload.json, signs the envelope, uploads it through Turbo, and writes ario.* tags back to MLflow.
import ario_mlflow
with mlflow.start_run() as run:
# Fit model, log params, log metrics, log dataset, log model...
result = ario_mlflow.anchor(
metadata={"service_name": "credit-risk-training"},
)
print("run_id:", run.info.run_id)
print("payload_hash:", result["payload_hash"])
print("training_tx:", result["tags"].get("ario.training_tx"))
print("verify_status:", result["tags"]["ario.verify_status"])If the Arweave upload fails, the run still succeeds and the envelope remains signed locally. In that case ario.verify_status is signed and ario.training_tx is absent.
Verify the run later
Run the CLI against the same MLflow tracking store.
MLFLOW_TRACKING_URI=file:///absolute/path/to/mlruns \
ar-io-mlflow verify run <run_id>Verification checks that the envelope exists on ar.io, the MLflow payload still hashes to the anchored commitment, the live MLflow run still re-derives the same canonical bytes, and the Ed25519 signature is valid.
Standalone dataset proofs
Dataset publishers can anchor a dataset proof without an active training run and hand the transaction ID to downstream teams.
import mlflow
import ario_mlflow
dataset = mlflow.data.from_pandas(
df,
source="s3://example-bucket/training/q1.parquet",
name="credit-risk-q1",
)
result = ario_mlflow.anchor(dataset=dataset)
print(result["tx_id"])This pattern is useful when a data platform team publishes approved datasets and model teams consume them later. The proof records an immutable commitment to the dataset descriptor, while the source data remains in S3, a lakehouse, or another controlled system.
What gets written
On the MLflow run, the plugin writes tags such as:
ario.enabledario.versionario.public_keyario.verify_statusario.artifact_hashario.payload_hashario.training_txario.arweave_urlario.wallet_mode
It also writes ario/payload.json as the canonical payload artifact. Arweave receives only the compact signed envelope, usually hundreds of bytes rather than the source data or model artifact.
Production notes
- Use a dedicated wallet per environment through
ARIO_MLFLOW_ARWEAVE_WALLET. - Set
ARIO_MLFLOW_SIGNING_KEYexplicitly if you need controlled key rotation. - Configure
ARIO_MLFLOW_GATEWAYSwith at least two gateways for fetch fallback. - Monitor runs where
ario.verify_status = signed, because those were signed but not anchored. - Keep MLflow artifacts backed up independently. Arweave preserves the envelope, but MLflow still holds the canonical payload used for full record matching.
Next steps
After training proofs are anchored, continue to Model and Decision Proofs to anchor registry events and enforce artifact integrity before inference.
How is this guide?