Versioning Data and Models (DVC, MLflow)

Machine learning systems are not defined by code alone. They are defined by the combination of code, data, features, hyperparameters, environment, model artifacts, evaluation results, and deployment lineage. Versioning these components is essential for reproducibility, auditability, experimentation, rollback, and operational trust. This whitepaper explains the technical foundations of data and model versioning, with a special focus on DVC and MLflow.

Abstract

In traditional software engineering, source code version control systems such as Git make it possible to track changes, reproduce releases, collaborate safely, and revert mistakes. Machine learning systems, however, introduce additional stateful assets that are often much larger and more dynamic than code: datasets, feature tables, model checkpoints, experiment metrics, training configurations, and deployment metadata. This paper explains why code-only versioning is insufficient for ML, and how modern MLOps practices extend version control to data and models. It covers concepts such as lineage, reproducibility, experiment tracking, artifact registries, immutable references, staged promotion, and governance. It then explains how DVC supports data and pipeline versioning, and how MLflow supports experiment tracking, model packaging, and model registry workflows. All formulas are embedded inline in HTML-friendly format for direct use in WordPress or similar editors.

1. Introduction

Let a trained model artifact be denoted by M. In a realistic ML system, M is a function not only of source code, but of:

training dataset D
feature transformation logic φ
hyperparameters λ
random seed s
training code version C
runtime environment E

Conceptually, the resulting model may be written as: M = Train(D, φ, λ, s, C, E).

If any of these inputs change, the resulting model may differ. This is why robust ML systems require versioning beyond code alone.

2. Why Code Versioning Alone Is Not Enough

Git works extremely well for source code and small configuration files, but machine learning introduces additional challenges:

datasets are large and change over time
model files can be large binary artifacts
metrics and parameters from experiments must be compared systematically
the same code can produce different models under different data or hyperparameters
deployed models need stage control and rollback logic

Therefore, ML versioning requires artifacts and metadata that Git alone does not manage conveniently.

3. Reproducibility as a Versioning Objective

One of the primary goals of data and model versioning is reproducibility. Ideally, given a reference to a prior run, one should be able to reconstruct:

which code version was used
which exact dataset snapshot was used
which hyperparameters were used
which model artifact was produced
which metrics were observed

If two runs use the same full specification (D, φ, λ, s, C, E), then reproducibility seeks: Train(D, φ, λ, s, C, E) ≈ Train(D, φ, λ, s, C, E).

In practice, deterministic equivalence may depend on hardware and numerical nondeterminism, but strong versioning greatly reduces uncertainty.

4. Lineage

Lineage is the traceable chain connecting inputs, transformations, outputs, and downstream usage. For a model artifact M, lineage should answer:

which dataset version produced it
which feature pipeline produced the training table
which training run and configuration produced the artifact
which evaluation metrics justified its promotion
which deployment endpoint or batch job uses it

Strong lineage supports debugging, audits, rollback, and regulatory accountability.

5. Data Versioning Basics

Data versioning means maintaining identifiable snapshots or references to datasets over time. Let the evolving dataset be: D⁽¹⁾, D⁽²⁾, ..., D^(t).

A versioned ML pipeline should know exactly which dataset snapshot D^(t) was used in a given experiment.

Dataset changes may include:

new rows
corrected values
schema changes
new labels
feature recalculations

6. Model Versioning Basics

Model versioning means assigning persistent identifiers to trained model artifacts and their associated metadata. A model version should not be treated as just a filename. It should be tied to:

training run ID
dataset version
code commit
hyperparameters
environment specification
evaluation metrics
deployment stage

One may think of a model version as: V_M = (artifact, metadata, lineage, stage).

7. Immutable vs Mutable References

Good versioning systems distinguish immutable version identifiers from mutable aliases. For example:

immutable: exact model version 17, exact dataset hash, exact run ID
mutable: “production”, “staging”, “latest”, “candidate”

Immutable references are necessary for exact reproducibility. Mutable aliases are convenient for operational routing.

8. Experiment Tracking

Experiment tracking is the structured recording of training runs, including:

parameters
metrics
artifacts
notes or tags
source commit references

If run r produces metrics m₁, m₂, ..., m_k, experiment tracking stores: Run(r) = (λ, metrics, artifacts, lineage).

This enables comparison, selection, and reproducibility across many trials.

9. Artifact Storage

Model artifacts are often large binary objects such as:

serialized tree ensembles
deep learning checkpoints
tokenizers and preprocessors
feature encoders
evaluation reports

Versioning systems therefore typically store metadata in one place and large artifacts in dedicated remote storage such as object stores or artifact repositories.

10. Data Lineage and Feature Lineage

If the processed training table is: X = φ(D), then both the dataset version and the transformation version matter. Two training runs using the same raw data but different feature logic are not equivalent.

Therefore, versioning should include:

raw data snapshot reference
processing pipeline version
feature schema version
materialized feature table version if applicable

11. DVC: Data Version Control

DVC is a tool designed to bring Git-like principles to data, pipelines, and ML artifacts. It does not replace Git; rather, it complements Git by versioning large files and pipeline dependencies through lightweight metadata stored in the repository, while actual data is stored in local or remote storage.

11.1 Core Idea of DVC

DVC tracks data artifacts using metadata files and content hashes. Instead of storing large files directly in Git, DVC stores references to them. A file or directory version can therefore be identified by a content hash h(D).

Conceptually: Version(D) = hash(contents(D)).

If the underlying data changes, the hash changes, and the version reference changes.

11.2 DVC and Remote Storage

Actual data artifacts are usually stored in remote locations such as:

S3-compatible object stores
Azure Blob Storage
Google Cloud Storage
shared filesystems

Git stores the metadata pointers; DVC stores or retrieves the large artifacts from remote storage.

11.3 DVC Pipeline Stages

DVC can also version pipelines as stages with explicit dependencies and outputs. A simplified stage representation includes:

command to run
input dependencies
output artifacts

If stage S transforms input A into output B, DVC records: S : A → B.

When dependencies change, downstream stages can be recomputed systematically.

11.4 Reproducibility with DVC

Because DVC tracks both the data artifact references and the pipeline structure, it becomes possible to recreate previous training or processing states as long as the referenced data and code remain available.

12. MLflow

MLflow is a platform for managing the machine learning lifecycle, especially:

experiment tracking
artifact logging
model packaging
model registry workflows

While DVC is particularly strong in data and pipeline versioning, MLflow is particularly strong in run tracking and model lifecycle management.

13. MLflow Tracking

MLflow Tracking records each experiment run as a structured entity containing:

parameters
metrics
tags
artifacts

If run r uses hyperparameters λ and produces metric vector m, then MLflow conceptually stores: Run(r) = (λ, m, artifacts, tags).

This makes it easy to compare experiments across many runs.

14. Parameter and Metric Logging

During training, one may log:

learning rate
regularization strength
tree depth
batch size
accuracy, loss, F1, AUC, RMSE, or other metrics

If validation F1 for run r is F1(r), a selection step may look for: r^* = argmax_r F1(r).

15. MLflow Artifacts

MLflow also stores artifacts such as:

trained model files
plots
confusion matrices
feature importance reports
prediction samples
evaluation JSON files

These are associated with the run so that outputs remain organized and auditable.

16. MLflow Models

MLflow defines a packaging format for models that includes both the model artifact and metadata describing how it can be served or loaded. This helps standardize deployment across different model flavors and runtimes.

A packaged model may be thought of as: M = (weights, flavor, environment, signature, metadata).

17. MLflow Model Registry

The MLflow Model Registry provides lifecycle management for versioned models. A registered model may have multiple versions and promotion stages such as:

None or candidate
Staging
Production
Archived

This helps teams manage approval flows, rollback, and operational state transitions.

18. DVC vs MLflow

DVC and MLflow solve overlapping but distinct problems.

18.1 DVC Strengths

data versioning
pipeline dependency tracking
Git-integrated reproducibility
large artifact references via remote storage

18.2 MLflow Strengths

experiment tracking
metric comparison across runs
artifact logging
model packaging and registry workflows

18.3 Complementary Usage

In practice, many teams use them together. DVC can track datasets and processing pipeline versions, while MLflow can track training runs, metrics, and model promotion history.

19. Versioning Granularity

Effective ML versioning often requires separate but connected identifiers for:

raw dataset snapshot
processed dataset snapshot
feature definition version
training code commit
experiment run ID
model version ID
deployment version

Granularity matters because different failure modes require rollback at different layers.

20. Model Promotion and Governance

Not every trained model should become a deployed model. A robust workflow uses promotion gates such as:

metric thresholds
bias or fairness checks
latency tests
drift simulation
human approval

If candidate model M_new must outperform current production model M_prod, one may require: Score(M_new) > Score(M_prod) + δ, where δ is a meaningful margin.

21. Rollback

Versioning is only truly valuable if rollback is possible. When a deployed model underperforms or causes operational issues, teams should be able to return quickly to a previous trusted version: M_prod := M_old.

This requires that old artifacts, metadata, and compatible preprocessing logic all remain accessible.

22. Environment Versioning

Reproducing a model requires more than code and data. It also requires environment consistency:

Python version
library versions
CUDA or hardware dependencies
OS-level dependencies

If the runtime environment is denoted by E, then model reproducibility depends on versioning E alongside code and data.

23. Auditability

In regulated or high-stakes environments, teams may need to answer questions such as:

which data trained the deployed model on a given date
what metrics justified promotion
who approved the promotion
which feature transformations were active

Good versioning systems make such audit trails possible.

24. Common Failure Modes Without Proper Versioning

unable to reproduce a model that is already in production
unclear difference between two seemingly similar runs
data drift unnoticed because dataset snapshots were not tracked
inference errors because preprocessing version mismatched training
rollback impossible because artifacts were overwritten
team confusion about which model is authoritative

25. Practical MLOps Pattern

A practical versioning workflow often looks like this:

Git tracks code and lightweight config
DVC tracks datasets and pipeline outputs
MLflow tracks training runs and model artifacts
Model registry manages promotion and deployment stages

This creates a layered, traceable lifecycle from raw data to production model.

26. Evaluation Metrics and Versioned Comparison

Model versioning is only useful if versions can be compared objectively. Standard metrics include: Accuracy = (TP + TN)/(TP + TN + FP + FN), Precision = TP/(TP + FP), Recall = TP/(TP + FN), F1 = 2(Precision × Recall)/(Precision + Recall), and regression metrics such as RMSE.

Comparison should also include:

latency
model size
resource cost
fairness metrics
robustness metrics

27. Strengths of DVC

strong Git-compatible data versioning workflow
pipeline dependency tracking
reproducible artifact retrieval
suitable for teams already using Git-centric workflows

28. Strengths of MLflow

excellent experiment tracking
strong run comparison and metric logging
model packaging standardization
clear model registry workflow for staging and production

29. Limitations and Trade-Offs

DVC is not a full model registry by itself
MLflow does not replace full data versioning systems
versioning discipline still depends on process design and team practice
large-scale lineage across many systems may require broader platform integration

30. Best Practices

Version data, code, features, environment, and model artifacts together—not separately in isolation.
Use immutable identifiers for reproducibility and mutable aliases only for operational convenience.
Track experiment parameters and metrics for every serious run.
Use a model registry for promotion workflows and rollback readiness.
Ensure feature logic used in inference is version-aligned with training.
Preserve lineage from source data to deployed endpoint.

31. Conclusion

Versioning data and models is foundational to modern machine learning engineering because an ML system is defined by much more than source code. Training data, feature logic, hyperparameters, environments, artifacts, and evaluation outputs all shape the final model and must be tracked if the system is to be reproducible, auditable, and maintainable.

DVC and MLflow address different but complementary parts of this challenge. DVC extends version control principles to large data artifacts and pipeline dependencies. MLflow supports experiment tracking, model packaging, and registry workflows for controlled promotion. Together, they help transform ML development from an ad hoc experimental activity into a disciplined lifecycle with lineage, comparison, rollback, and governance. In modern MLOps, versioning is not optional infrastructure—it is one of the core conditions for trustworthy machine learning.