🎉 New Course

Ultimate DevOps Real-World Project Implementation on AWS

My newest course. Real-world DevOps on AWS with production architecture.

$15.99 $84.99 81% OFF

Coupon Code

Enroll Now on Udemy
MLOps DVC Data Version Control DevOps
2 min read 284 words

DVC: Git for Your ML Training Data

You version code with Git. DVC does the same for ML training data. Here is your weekend starter guide to data version control.

You version code with Git. What about your model training data?

If you’ve ever asked “Which dataset trained this model?” or “Can we reproduce last month’s model exactly?”, you need DVC.

DVC Data Version Control


What DVC Solves

ProblemWithout DVCWith DVC
Which dataset trained this model?“Check the shared drive, maybe?”git log shows exact data version
Someone changed the training dataNo history, no diffdvc diff shows exactly what changed
Reproduce last month’s modelImpossiblegit checkout + dvc checkout

Your Weekend Starter

Six commands. That’s all you need. (Full DVC docs)

1
2
3
4
5
6
7
pip install dvc
dvc init
dvc remote add -d myremote s3://mybucket/data
dvc add data/training.csv
git add data/training.csv.dvc
git commit -m "Track data with DVC"
dvc push

Your data is now versioned.


What Just Happened

  • DVC hashed your data file (content-addressable storage)
  • Created a small .dvc pointer file (a few bytes)
  • Git tracks the pointer file (small, fast)
  • dvc push sends actual data to remote storage (S3, GCS, Azure Blob)

Data stays out of Git. Versioning stays in Git.


The Workflow

Change the data? Run dvc add again. New hash, new pointer, new commit.

Want last week’s data? git checkout <commit> + dvc checkout. Done.

Same Git workflow. New file types.


The DevOps Parallel

GitDVC
git adddvc add
git pushdvc push
git pulldvc pull
git checkoutdvc checkout
GitHub/GitLabS3/GCS/Azure Blob

If you know Git, you know DVC. The commands are intentionally identical.


Your ML models are only as reproducible as your data. Start versioning it.

This is Part 4 of the MLOps for DevOps Engineers series. Next up: 5 Levels of ML Model Deployment.

For weekly MLOps tips, join the newsletter.

Share this article
K
Kalyan Reddy Daida

Instructor with 383,000+ students across 21 courses on AWS, Azure, GCP, Terraform, Kubernetes & DevOps. Sharing real-world patterns from production environments.

Enjoyed this? Get more in your inbox.

Weekly DevOps & Cloud insights from a 383K+ Udemy instructor