Versioning Data and Models (DVC, MLflow)

Machine learning systems are not defined by code alone. They are defined by the combination of code, data, features, hyperparameters, environment, model artifacts, evaluation results, and deployment lineage. Versioning these components is essential for reproducibility, auditability, experimentation, rollback, and operational trust. This whitepaper explains the technical foundations of data and model versioning, with a special focus on DVC and MLflow.

Abstract

In traditional software engineering, source code version control systems such as Git make it possible to track changes, reproduce releases, collaborate safely, and revert mistakes. Machine learning systems, however, introduce additional stateful assets that are often much larger and more dynamic than code: datasets, feature tables, model checkpoints, experiment metrics, training configurations, and deployment metadata. This paper explains why code-only versioning is insufficient for ML, and how modern MLOps practices extend version control to data and models. It covers concepts such as lineage, reproducibility, experiment tracking, artifact registries, immutable references, staged promotion, and governance. It then explains how DVC supports data and pipeline versioning, and how MLflow supports experiment tracking, model packaging, and model registry workflows. All formulas are embedded inline in HTML-friendly format for direct use in WordPress or similar editors.

1. Introduction

Let a trained model artifact be denoted by M. In a realistic ML system, M is a function not only of source code, but of:

  • training dataset D
  • feature transformation logic φ
  • hyperparameters λ
  • random seed s
  • training code version C
  • runtime environment E

Conceptually, the resulting model may be written as: M = Train(D, φ, λ, s, C, E).

If any of these inputs change, the resulting model may differ. This is why robust ML systems require versioning beyond code alone.

2. Why Code Versioning Alone Is Not Enough

Git works extremely well for source code and small configuration files, but machine learning introduces additional challenges:

  • datasets are large and change over time
  • model files can be large binary artifacts
  • metrics and parameters from experiments must be compared systematically
  • the same code can produce different models under different data or hyperparameters
  • deployed models need stage control and rollback logic

Therefore, ML versioning requires artifacts and metadata that Git alone does not manage conveniently.

3. Reproducibility as a Versioning Objective

One of the primary goals of data and model versioning is reproducibility. Ideally, given a reference to a prior run, one should be able to reconstruct:

  • which code version was used
  • which exact dataset snapshot was used
  • which hyperparameters were used
  • which model artifact was produced
  • which metrics were observed

If two runs use the same full specification (D, φ, λ, s, C, E), then reproducibility seeks: Train(D, φ, λ, s, C, E) ≈ Train(D, φ, λ, s, C, E).

In practice, deterministic equivalence may depend on hardware and numerical nondeterminism, but strong versioning greatly reduces uncertainty.

4. Lineage

Lineage is the traceable chain connecting inputs, transformations, outputs, and downstream usage. For a model artifact M, lineage should answer:

  • which dataset version produced it
  • which feature pipeline produced the training table
  • which training run and configuration produced the artifact
  • which evaluation metrics justified its promotion
  • which deployment endpoint or batch job uses it

Strong lineage supports debugging, audits, rollback, and regulatory accountability.

5. Data Versioning Basics

Data versioning means maintaining identifiable snapshots or references to datasets over time. Let the evolving dataset be: D(1), D(2), ..., D(t).

A versioned ML pipeline should know exactly which dataset snapshot D(t) was used in a given experiment.

Dataset changes may include:

  • new rows
  • corrected values
  • schema changes
  • new labels
  • feature recalculations

6. Model Versioning Basics

Model versioning means assigning persistent identifiers to trained model artifacts and their associated metadata. A model version should not be treated as just a filename. It should be tied to:

  • training run ID
  • dataset version
  • code commit
  • hyperparameters
  • environment specification
  • evaluation metrics
  • deployment stage

One may think of a model version as: VM = (artifact, metadata, lineage, stage).

7. Immutable vs Mutable References

Good versioning systems distinguish immutable version identifiers from mutable aliases. For example:

  • immutable: exact model version 17, exact dataset hash, exact run ID
  • mutable: “production”, “staging”, “latest”, “candidate”

Immutable references are necessary for exact reproducibility. Mutable aliases are convenient for operational routing.

8. Experiment Tracking

Experiment tracking is the structured recording of training runs, including:

  • parameters
  • metrics
  • artifacts
  • notes or tags
  • source commit references

If run r produces metrics m1, m2, ..., mk, experiment tracking stores: Run(r) = (λ, metrics, artifacts, lineage).

This enables comparison, selection, and reproducibility across many trials.

9. Artifact Storage

Model artifacts are often large binary objects such as:

  • serialized tree ensembles
  • deep learning checkpoints
  • tokenizers and preprocessors
  • feature encoders
  • evaluation reports

Versioning systems therefore typically store metadata in one place and large artifacts in dedicated remote storage such as object stores or artifact repositories.

10. Data Lineage and Feature Lineage

If the processed training table is: X = φ(D), then both the dataset version and the transformation version matter. Two training runs using the same raw data but different feature logic are not equivalent.

Therefore, versioning should include:

  • raw data snapshot reference
  • processing pipeline version
  • feature schema version
  • materialized feature table version if applicable

11. DVC: Data Version Control

DVC is a tool designed to bring Git-like principles to data, pipelines, and ML artifacts. It does not replace Git; rather, it complements Git by versioning large files and pipeline dependencies through lightweight metadata stored in the repository, while actual data is stored in local or remote storage.

11.1 Core Idea of DVC

DVC tracks data artifacts using metadata files and content hashes. Instead of storing large files directly in Git, DVC stores references to them. A file or directory version can therefore be identified by a content hash h(D).

Conceptually: Version(D) = hash(contents(D)).

If the underlying data changes, the hash changes, and the version reference changes.

11.2 DVC and Remote Storage

Actual data artifacts are usually stored in remote locations such as:

  • S3-compatible object stores
  • Azure Blob Storage
  • Google Cloud Storage
  • shared filesystems

Git stores the metadata pointers; DVC stores or retrieves the large artifacts from remote storage.

11.3 DVC Pipeline Stages

DVC can also version pipelines as stages with explicit dependencies and outputs. A simplified stage representation includes:

  • command to run
  • input dependencies
  • output artifacts

If stage S transforms input A into output B, DVC records: S : A → B.

When dependencies change, downstream stages can be recomputed systematically.

11.4 Reproducibility with DVC

Because DVC tracks both the data artifact references and the pipeline structure, it becomes possible to recreate previous training or processing states as long as the referenced data and code remain available.

12. MLflow

MLflow is a platform for managing the machine learning lifecycle, especially:

  • experiment tracking
  • artifact logging
  • model packaging
  • model registry workflows

While DVC is particularly strong in data and pipeline versioning, MLflow is particularly strong in run tracking and model lifecycle management.

13. MLflow Tracking

MLflow Tracking records each experiment run as a structured entity containing:

  • parameters
  • metrics
  • tags
  • artifacts

If run r uses hyperparameters λ and produces metric vector m, then MLflow conceptually stores: Run(r) = (λ, m, artifacts, tags).

This makes it easy to compare experiments across many runs.

14. Parameter and Metric Logging

During training, one may log:

  • learning rate
  • regularization strength
  • tree depth
  • batch size
  • accuracy, loss, F1, AUC, RMSE, or other metrics

If validation F1 for run r is F1(r), a selection step may look for: r* = argmaxr F1(r).

15. MLflow Artifacts

MLflow also stores artifacts such as:

  • trained model files
  • plots
  • confusion matrices
  • feature importance reports
  • prediction samples
  • evaluation JSON files

These are associated with the run so that outputs remain organized and auditable.

16. MLflow Models

MLflow defines a packaging format for models that includes both the model artifact and metadata describing how it can be served or loaded. This helps standardize deployment across different model flavors and runtimes.

A packaged model may be thought of as: M = (weights, flavor, environment, signature, metadata).

17. MLflow Model Registry

The MLflow Model Registry provides lifecycle management for versioned models. A registered model may have multiple versions and promotion stages such as:

  • None or candidate
  • Staging
  • Production
  • Archived

This helps teams manage approval flows, rollback, and operational state transitions.

18. DVC vs MLflow

DVC and MLflow solve overlapping but distinct problems.

18.1 DVC Strengths

  • data versioning
  • pipeline dependency tracking
  • Git-integrated reproducibility
  • large artifact references via remote storage

18.2 MLflow Strengths

  • experiment tracking
  • metric comparison across runs
  • artifact logging
  • model packaging and registry workflows

18.3 Complementary Usage

In practice, many teams use them together. DVC can track datasets and processing pipeline versions, while MLflow can track training runs, metrics, and model promotion history.

19. Versioning Granularity

Effective ML versioning often requires separate but connected identifiers for:

  • raw dataset snapshot
  • processed dataset snapshot
  • feature definition version
  • training code commit
  • experiment run ID
  • model version ID
  • deployment version

Granularity matters because different failure modes require rollback at different layers.

20. Model Promotion and Governance

Not every trained model should become a deployed model. A robust workflow uses promotion gates such as:

  • metric thresholds
  • bias or fairness checks
  • latency tests
  • drift simulation
  • human approval

If candidate model Mnew must outperform current production model Mprod, one may require: Score(Mnew) > Score(Mprod) + δ, where δ is a meaningful margin.

21. Rollback

Versioning is only truly valuable if rollback is possible. When a deployed model underperforms or causes operational issues, teams should be able to return quickly to a previous trusted version: Mprod := Mold.

This requires that old artifacts, metadata, and compatible preprocessing logic all remain accessible.

22. Environment Versioning

Reproducing a model requires more than code and data. It also requires environment consistency:

  • Python version
  • library versions
  • CUDA or hardware dependencies
  • OS-level dependencies

If the runtime environment is denoted by E, then model reproducibility depends on versioning E alongside code and data.

23. Auditability

In regulated or high-stakes environments, teams may need to answer questions such as:

  • which data trained the deployed model on a given date
  • what metrics justified promotion
  • who approved the promotion
  • which feature transformations were active

Good versioning systems make such audit trails possible.

24. Common Failure Modes Without Proper Versioning

  • unable to reproduce a model that is already in production
  • unclear difference between two seemingly similar runs
  • data drift unnoticed because dataset snapshots were not tracked
  • inference errors because preprocessing version mismatched training
  • rollback impossible because artifacts were overwritten
  • team confusion about which model is authoritative

25. Practical MLOps Pattern

A practical versioning workflow often looks like this:

  • Git tracks code and lightweight config
  • DVC tracks datasets and pipeline outputs
  • MLflow tracks training runs and model artifacts
  • Model registry manages promotion and deployment stages

This creates a layered, traceable lifecycle from raw data to production model.

26. Evaluation Metrics and Versioned Comparison

Model versioning is only useful if versions can be compared objectively. Standard metrics include: Accuracy = (TP + TN)/(TP + TN + FP + FN), Precision = TP/(TP + FP), Recall = TP/(TP + FN), F1 = 2(Precision × Recall)/(Precision + Recall), and regression metrics such as RMSE.

Comparison should also include:

  • latency
  • model size
  • resource cost
  • fairness metrics
  • robustness metrics

27. Strengths of DVC

  • strong Git-compatible data versioning workflow
  • pipeline dependency tracking
  • reproducible artifact retrieval
  • suitable for teams already using Git-centric workflows

28. Strengths of MLflow

  • excellent experiment tracking
  • strong run comparison and metric logging
  • model packaging standardization
  • clear model registry workflow for staging and production

29. Limitations and Trade-Offs

  • DVC is not a full model registry by itself
  • MLflow does not replace full data versioning systems
  • versioning discipline still depends on process design and team practice
  • large-scale lineage across many systems may require broader platform integration

30. Best Practices

  • Version data, code, features, environment, and model artifacts together—not separately in isolation.
  • Use immutable identifiers for reproducibility and mutable aliases only for operational convenience.
  • Track experiment parameters and metrics for every serious run.
  • Use a model registry for promotion workflows and rollback readiness.
  • Ensure feature logic used in inference is version-aligned with training.
  • Preserve lineage from source data to deployed endpoint.

31. Conclusion

Versioning data and models is foundational to modern machine learning engineering because an ML system is defined by much more than source code. Training data, feature logic, hyperparameters, environments, artifacts, and evaluation outputs all shape the final model and must be tracked if the system is to be reproducible, auditable, and maintainable.

DVC and MLflow address different but complementary parts of this challenge. DVC extends version control principles to large data artifacts and pipeline dependencies. MLflow supports experiment tracking, model packaging, and registry workflows for controlled promotion. Together, they help transform ML development from an ad hoc experimental activity into a disciplined lifecycle with lineage, comparison, rollback, and governance. In modern MLOps, versioning is not optional infrastructure—it is one of the core conditions for trustworthy machine learning.

Uma Mahesh
Uma Mahesh

Author is working as an Architect in a reputed software company. He is having nearly 21+ Years of experience in web development using Microsoft Technologies.

Articles: 179