By nature, I am curious. I am not only interested in the why-of-things but also in the “how”. Be able to document it and reproducing it later. And that, most of the time, could be a time consuming affair. Pleasurable, rewarding, but time consuming.
Add to that data science and deep learning and you get a exponential combination.
I own a 8-core, 32GB RAM, 3TB SSD, Quadro K2100M GPU laptop that originally acquired with the intention of running several virtual machines with Windows, Linux and MacOS, as part of my work as an atypical petroleum engineer.
Let me explain: I love to experiment with engineering software (Prosper, PipeSim, GAP, Eclipse, MBAL, OFM, WinGLUE, OpenServer, PI, etc.), - making them do stuff they are not supposed to be doing. Extending their capabilities. And that, inevitably, gets you in touch with digital interfaces to exchange data from one application to another; create connections with well tests, or oil/gas production databases (Oracle, MS-SQL, and what-not); learning programming, and scripting languages you were not expecting to (SQL, VB, C#, bash, Python); tweaking hardware to get the best of it (like finding a PC with 64-bits and enough RAM to run a network model of 100+ wells); trying to automate optimization tasks with scripts (when doing the well modeling manually will not meet a deadline); and once in a while doing computational physics and enjoy real time simulation (transient modeling).
When you are working with production or reservoir engineering software, if you don’t want to mix environments in a physical machine (here I mean the core PC or laptop), you end up using virtual machines. And if you are a daring Linux soul, Docker containers.
Virtualization, data science and machine learning
As you might guessed at this point, if you are doing Data Science or Machine Learning, virtualization and containerization - I will refer to them as V/C -, make perfect sense as companions for reproducible work. Why?
Because virtualization (either with VirtualBox or Vmware), and containerization (with Docker, in my case):
(1) you can rollback to a previous state of the virtual operating system if found software or driver conflicts;
(2) switch virtual environments if needed at any moment;
(3) snapshot data experiments;
(4) prevent “damage” to the physical machine by installing an untested application;
(5) combine Python and R deep learning frameworks;
(6) isolate side effects of an application or package to another;
(7) having a Jupyter or RStudio test server up and running without adding clutter to your physical machine;
(8) download a Python or R virtual environment from the web without going through the painful and time consuming installation;
(9) move from one deep learning platform to another with relatively easiness; and
(10) test attached hardware such in the case of verifying that your GPU can be seen from a virtual environment. I will come back to this at a later article.
The Meaning and Value of Reproducibility
Reproducibility may not be the main purpose of V/C at first but it becomes obvious -or automatic?- as you keep adding and working on more projects. The more dissimilar the more productivity you get.
What I mean by reproducible I might not be using it in the same context or complexity as in scientific research, but close to it. What I mean by reproducible work is:
You could open a project and be able to run it a year after you delivered.
You are able to share your work with a colleague or team, and they should be able to obtain the same results given the same data.
The raw data is separated from the processed data and remains unaltered for the whole life of the project, so others can build their own analysis from that raw data.
The processed data should be able to be re-generated by a sequential number of steps in scripts without the help of a mouse or clicks.
A new run of the project should be able to produce the same final output, calculations, plots, tables, figures, and report as per in the original document.
Text, calculations and plots should be all be updated from the same project run.
The project data (raw, processed and utility), metadata, calculation scripts, figures, project book, all reside in a repository that could withstand the test of time.
You are able to trace back changes or modifications to the data or the analysis by means of version control.
As you can see, reproducibility brings many advantages. Yes, it is a little bit harder but completely doable if you use the right tools.
Reproducibility is not only good for business but keeps you honest.
Points of Discussion in Oil & Gas
What is your interpretation of reproducibility?
Is reproducibility given importance in business continuity?
“Reproducibility is nothing new. We have been doing that for ages in my company.”
“Engineers are not provided with the right tools for reproducible work”
“Nah. Reproducibility is just for scientific nerds. No need to apply in petroleum engineering.”
“Reproducibility works great for other industries; we should make an effort to adopt it.”
“Reproducibility is already applied in the major O&G companies. That’s why they are economically sound.”
- You can do containers in Windows as well, with its own hypervisor, or easier with Kitematic with VirtualBox.
- When I say framework I am referring to any of the machine learning tools available such as PyTorch, TensorFlow, Keras, mxnet, H2O, Caffe, Theano, scikit-learn, and few others.
- GPU: Graphical Processing Unit. The same card that gamers adore.