At request of colleagues, what I have to say on this is: I believe that data science doesn’t have the word “science” to make it look sexy. It really means it. Data Science as a discipline is not new. It has been living among us for 50 years. It was invented by scientists with deeply ingrained love for statistics.
Statisticians have been the inventors and guardians of data science. They still are.
Article Predict Core Properties with Machine Learning by Amrita Sen.
_“The OAG platform allows subject matter experts to then quickly build and assess machine learning models without having to learn Python or R.” _
<- ARR. But that is the whole point of the data science revolution: getting rid of black boxes and bring more open science to increase the discoveries and untap hidden oil. Petroleum engineers should strive to make reproducible examples.
No, I am not travelling or anything like that. I am actually half done writing the tutorial for the multiwell-stats application in Python. I will be writing it using the magnificent tools of data science, so we have a fully reproducible document. My pick for writing tutorials, booklets and books is #bookdown. It is an #rstats package that lets you combine math, code and text in the same document.
Remember, data science is about reproducibility, as in reproducible research.
This post was inspired on my response a few months ago in the SPE forums. The question was -if I remember correctly-, on how you sell predictive analytics to a conservative manager.
I have made some changes to my original answer to make it more current, and independent off the original post.
So, the question is:
How do sell a petroleum engineering data science project to your skeptic manager?
As I announced last week, my blog is now online at http://blog.oilgainsanalytics.com. LinkedIn may obfuscate the link so I am also providing it as an image below. Clicking on the image will bring you to the blog:
I believe in the sharing philosophy of data science as I learned it from my biostatistician instructors at Johns Hopkins University (Peng, Leek, Caffo, et al).
One of the most challenging things in dealing with data is Nested Structures. In a perfect world, data would be tables (rectangular) and be tidy. If physicists are finding the right format, they should also work for petroleum engineering folks.
The image is a screenshot of Fig.1 from the paper “Machine Learning in High Energy Physics Community White Paper”
Link to post in Linkedin
Read from an article yesterday how learning the Cloud is a must if you are in data science and machine learning. I made some annotations:
Although the article is a discussion on DevOps, there are parts that go beyond and touch data science and machine learning interests.
That was just in case you thought you had it hard learning Python or R.
The article is named How To Become a DevOps Engineer In Six Months or Less.
I would start by identifying acute problems in your area of expertise (domain): production , reservoir, drilling, completions, geophysics, chemistry, seismic, geophysics, etc., that you feel could be resolved by applying data science.
They may be big problems or small ones. Start with the small ones, or break the big ones in manageable pieces that you can address one step at a time.
Once you have two or three data science “project” candidates, start applying the basics to solve the problem.
I watched few days ago the interview from professor Andrew Ng to one of the luminaries of deep learning and artificial intelligence, Dr. Youshua Bengio. He has written books and dozens of papers on deep learning and neural networks. I liked the style. Pretty down to earth stuff. Just the way professor Andrew likes to do: bringing machine learning, deep learning to the masses.
So the question remains: do petroleum engineers need to learn data science, computer science, statistics, machine learning, neural networks, virtualization and GPU based engineering?
One of the first concepts that one learns when working with data is rearranging raw data into tidy datasets. A tidy dataset not only means having the data in a row-column format but in such a way that a row corresponds to an observation and a column to a variable. This facilitates enormously the analysis. I know this could sound a little bit confusing, so I will show what raw data and tidy data looks like with an example.