Replicate Version control for machine learning

A machine learning model is the combination of the code and the training data, so knowing what data you trained a model with is essential.

There are two ways to do this:

Store training data in Replicate. This is recommended if your training data is small (<100MB).
Point at data in another system. This is recommended if your training data is large or you already have somewhere to store it.

Store training data in Replicate

If your training data is small, then we recommend storing it with Replicate in each experiment.

For example, if your training data is in a directory training-data/ alongside your training code, then you might write this in your code:

experiment = replicate.init(
    path="training-data/",
    params={...}
)

If you want to store both your training script and training data, you can just save everything:

experiment = replicate.init(
    path=".",
    params={...}
)

Then, to copy this data back to the current directory, you use replicate checkout:

$ replicate checkout <experiment ID>

The downside of this approach is that Replicate makes a complete copy of your training data on each experiment. So, this approach only works if your training data is small.

How small is "small" depends on your storage costs and bandwidth, but typically we'd recommend doing this if your data is less than 100MB.

Point at data in another system

If your training data is large, or you already have a system for storing your training data, then we recommend putting a pointer to your training data in the params dictionary.

For example, if your training data is on S3, you might put the URL to your training data in params:

training_data_url = "s3://hooli-training-data/hotdogs-2020-05-03.tar.gz"
experiment = replicate.init(
    path=".",
    params={
        "training_data_url": training_data_url
    }
)
# ... download training_data_url and run training

This assumes you are disciplined about versioning your data and the contents of that URL never change. If the data at this URL might change, then you might want to calculate the shasum and record this in params.

Then, if the data changes, you will see a different shasum in replicate diff, and you will know an experiment was trained on different data.

Note: This documentation is incomplete. We'd love to hear about ways that you are versioning data. See this GitHub issue or chat to us in Discord.

```