Updated at 2018-07-19 23:02

Reproducibility is the basis of scientific method.

Building machine learning solutions without reproducibility is like programming without version control. In addition to that, you need to keep track of the training data and experiment results.

... one of the places I interviewed at just had a wall of post-it notes, one for each file in the tree, and coders would take them down when they were modifying files, and return them when they were done!

Benefits of reproducible machine learning:

  1. You can build your own model again from scratch with controlled changes.
  2. You can build on top of your colleague's work and via versa.
  3. You can share your model to the scientific community and use models from papers.
  4. You can share your model as open-source and use models made by somebody online.
  5. You can share your model with your clients.
  6. If key data scientist leaves, the company can still work on the model.

Requirements to reproduce a machine learning model:

  1. Recording each step it takes to build the model.
  2. Recording what raw data and which version was used.
  3. Recording what pre-processing was done for the raw data.
  4. Recording system-level dependencies such as OS and GPU drivers.
  5. Recording code-level dependencies such as NumPy and TensorFlow versions.
  6. Recording the actual code that was used for each step.

Requirements for machine learning teamwork:

  1. Sharing what experiments has been tried e.g. new features or hyperparameters.
  2. Sharing code changes made.
  3. Sharing what other team members are currently working on.

Main objection to reproducibility is how much extra process it introduces. If the workflow becomes a lot more complex, data scientists won't use the service/tool.