Transforming Petroleum Engineers in Data Science Wizards. Update 2019

Note. This is an update of the original article I published in 2017. Many, many things have changed, or have progressed, so fast that the article needs some rewriting.

Very often I receive questions from colleagues asking for tips on Data Science and Machine Learning applied to Petroleum Engineering. These answers address some of those questions I have collected over time. In this case, you may call this, some advice to becoming a Petroleum Engineer and Data Science wizard:

Complete any of the Python or R online courses on Data Science. My favorites are the ones from Johns Hopkins and the University of Michigan in Coursera (Data Science Specializations in R or Python). Don’t be mistaken: the data science specialization in R is a high quality course, and would make you feel sometimes like going through a PhD program. You will need a firm commitment and set aside some time for lectures, quizzes and project assignments. You could complement it with DataCamp short workshops. These are Just two names that come quick to my mind. But there are many others: edX, Udacity, Udemy, etc., or from reputable universities such as Stanford, MIT, or Harvard. If you don’t have previous programming experience, start with Python; if you feel confident about your programming skills, and would like breaking the barrier between engineering and science, go full throttle with R. You will not regret it. *Note.* For those who have asked me if I recommend a formal data science degree at a university, what I tell them is try first with online courses and see if it is for you.
Start using Git as much as possible in all your projects. It is useful for sharing and maintaining code, working in teams, synchronize projects, work in different computers, reproducibility of your data science projects, etc. To access Git in the cloud you may use GitHub, Bitbucket, or GitLab. Don’t be frustrated if you don’t get it or understand it at first; everybody struggles with Git. Even PhDs, who, by the way, have written the best tutorials on Git. So, you are not alone.
Learn the basics of the Unix terminal. This is useful for many things that Windows doesn’t do -and might never do; it is even useful for Linux and Macs which are Unix based as well. You can do automatic scripting using the Unix terminal that can serve you in many data oriented activities, such as operations on huge datasets, deployment, backups, file transfer, manage remote computers, secure transfer, low level settings, version control with Git, etc. If you a Windows guy, get familiar with Unix, from hybrid Windows applications such as Git-Bash, MSYS2, or Cygwin. There is no question about that you have to know the Unix terminal. It makes your data science much more powerful and reproducible, giving you also avenues for deployment. I am finding more frequently articles where they have managed to read and transform terabyte-size datasets, in laptops, using combination of Unix utilities like grep, awk, sed, etc., along with data.frame and data.table structures. No need of big-data computer clusters with Hadoop or Spark, which are more more difficult to handle.
As soon as you have installed R, Rtools and RStudio in your computer, start using Markdown. In R is called Rmarkdown, which is widely used in science for generating documentation, papers, citations, booklets, manuals, tutorials, schematics, diagrams, web pages, blogs, slides, etc. Make a habit in using Markdown. If possible, during engineering work, avoid Word -which generates mostly binary files. Working with Markdown makes easier to do revision control and it is reproducible, both, key to reliable, testable, traceable, repeatable data science. With markdown, you can also embed Latex equations with text, code and calculations. Besides you gain an additional ecosystem to run code and tools from the Latex universe, which is enormous.
Strive to publish your engineering results using Markdown. It will complement your efforts of batch automation, data science and machine learning. Combine calculations with code and text using the Rmarkdown notebooks in R. Essentially, any document can be written mixing text, graphics and calculations with R or Python. Even though I am originally a Python guy (10+ years), I am not strongly recommending the Python notebooks, or Jupyter, because they are not 100% human readable text (it uses JSON), that you may find difficult to apply version control and reproducible practices, or using it with Git. I have possible built more than a thousand Jupyter notebooks but when I learned Rmarkdown, it was like stepping in another dimension.
Start bringing your data into datasets with assistance of in R or Python. Build your favorite collections of datasets. Share with colleagues in the office and discuss the challenges in making raw data tidy. Generate tables, plots and statistical reports to come up with discoveries. Use markdown to document the variables or features (columns). If you want to share the data, keeping the confidentiality, learn how to anonymize your data with R or Python cryptographic or scrambling packages.
Start solving daily engineering problems with R or Python incorporating them in your workflow. If you can, avoid Excel or Excel-VBA if possible. VBA purpose was not version control, or reproducibility, or data science, much less, machine learning. Sticking to Office tools may keep you stuck to outdated practices or being unable to perform a much richer and productive data science. There is one more thing you may have possible noticed, and that is Excel plots are very simplistic; they go back to 30 years ago techniques-, and you would run the risk of dumbing down your analysis, or prevent you of making discoveries from your data, or showing a compelling story, which is the purpose of data science anyway.
Learn and apply statistics everywhere; every time you can, on all petroleum engineering activities you perform. Find what no other person can by using math, physics and statistics. Data Science is about making discoveries and answering questions on the data. Data Science was invented by statisticians; who at that time they called it “data analysis”. An article I never get tired of read and re-read is this “50 years of Data Science by David Donohoe”. Please, read it. It will explain statistics and its tempestuous, albeit tight, relationship with data science. And remember, additional oil and gas that you can find with data science and machine learning, will be the cheapest hydrocarbons to produce.
Read what other disciplines outside petroleum engineering are doing in data science and machine learning. You will see how insignificant we are in the scheme of things. It’s up to us to change that. Look at bioscience, biostatistics, genetics, robotics, medicine, cancer research, psychology, biology, ecology, automotive, finance, etc. They are light years ahead of the oil and gas industry.
Read articles in the net on data science. It doesn’t matter if it is Python or R. You just have to learn what data science is about; how it could bring value to your everyday workflow. They may give you ideas of applications involving data in your petroleum engineering area of expertise. They may not be data science per-se now but they most likely could be the next stepping stone. Additionally, most of the articles are free as well as hundreds of books, booklets, tutorials and papers. We never had the chance to learn so much for so little. Somebody has call this the era of democratization of knowledge and information. What you have to invest is time.
Start inquiring about what machine learning is about. Same with artificial intelligence. There is nothing better than knowing, at least, the fundamentals of what others are trying to sell you. There is so much noise, and snake-oil marketing nowadays surrounding the words “machine learning” and “artificial intelligence”. Two books I would recommend, out of the top of my head, on artificial intelligence: “Computational Intelligence. A logical approach” by David Poole, Alan Mackworth and Randy Goebel; and “Artificial Intelligence, A Modern Approach” by Russell and Norvig. You will find that AI is not what you read in newspapers or articles. What’s more. I have a tip for you: if you see an article with the figure of a humanoid, or human-faced robot, or mechanical arms with some brain on it, skip those articles. That is not what AI is about.
Review C++ and Fortran scientific code. I don’t mean to say that you need to learn another programming language, but knowing what they can do will add power to your toolbox. Sooner or later you will need Fortran, C or C++ for reasons of efficiency and speed. Not for nothing the best in class simulators and optimizers of today have plenty of Fortran routines under the hood.
Learn how to read from different file formats. It is amazing the enormous variety of file formats in what you may find raw data. There is a lot of value that you could bring to your daily activities by automating your data analysis workflow using R or Python. Also, ask what are the different data formats that are used in your company for storing data. Get familiar with them. Try reading some chunks of that data: try with logs, seismic, well tests, buildups, drilling reports, deviation surveys, geological data, process data, simulation output, etc. Create tidy datasets out of them. Explore the data.
Something that is more challenging is learning how to read and transform unstructured data, meaning, data that is not in row-column (rectangular) format. The typical and close cases to us are the text output from simulators, optimizers, stimulation or well design, etc. This is one of the most difficult data to operate with, when learning or knowing “regex” really pays off. Consider also how much data is coming as video and images! Today there are plenty of algorithms available that deal with that kind of data either with Matlab, Python or R.
Learn something about Virtual Machines with VirtualBox or Vmware. It is very useful to have several operating systems working at the same time in your PC: Windows, Linux, MacOS. There is a lot of good data science and machine learning stuff in Linux packaged as VMs which could be run under Windows very easily. These are applications that are ready to run without the need of installing anything on the physical machine. Few months ago, I was able to download a couple of Linux VM with whole bunch of machine learning and artificial intelligence applications, and test them with minimum effort. I have other VMs from Cloudera and Horton-Works that run big-data applications such as Hadoop, Spark, etc. Another virtualization tool that you may want to learn is Docker containers. The concept is similar to that of virtual machines but lighter and less resource intensive. These tools will make your data science even more reproducible and stand the test of time.

Alfonso R. Reyes

Houston, Texas. 2019

References

Article published in SPE Data Science and Digital Engineering: link
My GitHub repository with open source code: link
My Rmarkdown blog: link