Transforming Petroleum Engineers in Data Science Wizards

Once in a while I get messages from colleagues asking for tips on Data Science applied to Petroleum Engineering. This is stuff I have collected over time (responses), advice to follow to become a Petroleum Engineer and Data Science wizard:

Complete any of the Python or R online courses on Data Science. My favorites are the ones from Johns Hopkins in Coursera, complementing with DataCamp short workshops. Just two that come quick to my mind. But there are many others: edX, Stanford University, etc. If you don’t have previous programming experience, start with Python; if you feel confident about your programming skills, go full throttle with R. You will not regret it.
Start using Git as much as possible in all your projects. It is useful for sharing and maintaining code, working in teams, synchronize projects, work in different computers, reproducibility of your data science projects, etc. To bring Git to the cloud you may use GitHub, Bitbucket or GitLab.
Learn the basics of a Unix or the Linux terminal. This is useful for many things that Windows doesn’t do -and might never do; it is even useful for Macs which is Unix based as well. You can do scripting using the Unix terminal that can serve you in many data oriented activities, such as backups, file transfer, manage remote computers, secure transfer, low level settings, ftp, version control with Git, etc. Get familiar with Unix or Linux, or from a Windows cousin such as MSYS2 or Cygwin. There is no question about that you have to know the Unix terminal. It makes your data science much more powerful and reproducible.
As soon as possible as you have installed R, Rtools and RStudio in your computer, start using Markdown. In R is called Rmarkdown and is used for generating documentation, papers, citations, booklets, manuals, schematics, diagrams, etc. Make a habit in using Markdown. If possible, avoid Word -which generate mostly binary files. Work with full ASCII text applying the codes that Markdown allows. This makes easier to do revision control and it is reproducible, both key to reliable data science. You can also embed equations and running code with Rmarkdown and tools from the Latex universe.
Combine calculations with code and text using the Rmarkdown notebooks in R. Try not to use Word or Excel too much. You will know you are transforming into a data scientist when you use Excel every two or three months. Essentially, everything can be done with R or Python. Even that I am originally a Python guy, I am not recommending the Python notebooks or Jupyter because they are not 100% human readable text, and you may find difficult to apply version control and reproducible practices.
Learn something about Virtual Machines with VirtualBox or Vmware. It is very useful to have several operating systems working at the same time in your PC: Windows, Linux, osX. There is a lot of good data science stuff in Linux packaged as VMs which could be run under Windows very easily. These are applications that are ready to go without the need of installing anything. I downloaded the other day a Linux VM with whole bunch of machine learning and artificial intelligence apps. I have others from Cloudera and Horton-Works that run big-data applications such as Hadoop, Spark, etc. Once you learn Virtual Machines move on start working with Docker containers. That will make your data science even more reproducible and stand the test of time.
Start bringing your data into datasets that can be used in R or Python. Make your favorite collections of datasets. Generate tables, plots and statistical reports to come up with discoveries. Use markdown to document the variables (columns). If you can, store your datasets using Git. If you want to share the data, keeping the confidentiality, learn how to anonymize your data with R or Python packages.
Start doing engineering with R or Python. Avoid Excel or Excel-VBA if possible since this programming language purpose was not version control or reproducibility. This may keep you stuck to the ground and being unable to perform a much richer and productive data science. Publish your engineering results using Markdown. There is one more thing you have possible noticed, and that is Excel plots are so simplistic, they go back to 30 years ago-, and you would run the risk of dumbing down your analysis, or prevent you of making discoveries from your data.
Learn and apply statistics everywhere; every time you can, on all petroleum engineering activities you perform. Find what no other person can by using math, physics and statistics. Data Science is about making discoveries and answering questions on the data.
Read what other disciplines outside petroleum engineering are doing in data science and machine learning. You will see how insignificant we are in the scheme of things. It’s up to us to change that. Look at bioscience, biostatistics, genetics, robotics, medicine, automotive . They are light years ahead of the oil and gas industry.
Read the articles in the net on data science. It doesn’t matter if it is Python or R. You just have to learn what data science is about. Additionally, they are free and hundreds of books, booklets and papers. Also free.
Start inquiring about what machine learning is about. Same with artificial intelligence. Know first. Then, think of applications in your petroleum engineering area of expertise. They may not be data science per-se but they are the next stepping stone.
Review C++ and Fortran scientific code. I don’t mean to say that you need to learn another programming language, but knowing what they can do will add power to you toolbox. Sooner or later you will need Fortran, C or C++ for reasons of efficiency and speed. Trust me.
Learn how to read from different file formats, not only Excel. You can do this with R or Python. Ask what are the different data formats that are used in your company for storing data. Try reading some chunks of that data: try with logs, seismic, well tests, buildups, drilling reports, deviation surveys, geological data, process data, etc. Create tidy datasets out of them.
Learn how to read and format unstructured data, meaning data that is not in row-column (rectangular) format. This is the most challenging data. This when learning or knowing “regex” pays off.

If I remember the other ones, I will add them to the list.

Alfonso