Posts

Showing posts with the label Version Control

Reproducibility and Data

Image
TL;DR →   Version control for your data doesn’t mean s**t if you’re not versioning right up to the point you use it /via https://www.digital-science.com/blog/guest/digital-science-doodles-lab-life-and-experimental-reproducibility/ I’ve written about the  Reproducibility Crisis in Machine Learning  before. Without going into gory details (just go read  the post  instead), the field is a perfect storm that comes from combining huge — nay massive — data sets, jobs that take forever to run, pipelined workflows within these jobs, workflows running in parallel, and poor version control. “Poor Version Control → This one is the killer here. While you might (might! Admit it, you don’t!) have version control for your individual models and datasets, what you inevitably end up  not  doing is versioning  all  your tweaks in flight. It is, after all, easier to just tweak some of the data on the servers rather than copying all your assets up again...

Reproducibility and Machine Learning

Image
I can say with my hand on my heart, that machine learning is by far the worst environment I’ve ever found for collaborating and keeping track of changes.  —  Pete Warden I’d actually quite agree. Mind you, it’s not because of something fundamentally bad about the world of Deep Learning (•), it’s more about a collection of things that add up to a lot of pain. To summarize   a much longer writeup about this 1. Size : The source data is   large . This is a problem in and as of itself, since the state-of-the-art in managing   large   amounts of data is still, well, sucky. Think “ where do you put it ”, “ how do you get at it ”, “ how do   others   get at it ”, etc. And yeah, DropBox, and it’s ilk is where most of this stuff lives. 2. Canonical Data : The data used isn’t canonical (Actually it’s worse, it is   near-canonical ). Your version of the data may be different by just a few records, or you tweaked just a few records. And this may be before y...