Data Science for Petroleum Engineering - Part 5.3 Finding and filling missing well data in alphanumerics

This is what we will reviewing in this lecture.

NOTE. You can find the PDF version of the R markdown notebook in GitHub at this link. The reproducible R markdown notebook itself is here. Both are full versions of this LinkedIn article. For the time being, LinkedIn publishing does not support markdown which would make sharing scientific and engineering documents much easier.

Load the raw data file

# code

We will see that some well names can be fixed manually and others should be done automatically with a script.

In our particular case we only have 100 wells but what about if we have 1000, or 5000? Doing it manually is not an option. Some are quickly fixable some others are more challenging. Let’s start by the easier ones.

When correcting data, always go from the more general to the more particular. convert lowercase to uppercase Let’s convert the well names to uppercase and verify how many were corrected.

# code

Two were corrected.

removing spaces

# code

One well name was corrected.

correct the completion type The completion type, at the end of the well name, should have two characters: LS, TS or SS.

# code

Those were the easy ones. We had three corrections. There are 5 more to go.

correcting the field abbreviation in the well name There are two wells that were not properly field identified. We have an additional “I” in the field name abbreviation. We have to remove it. At this point we have two choices: (1) change all the first 4 first characters to PSCO, or, (2) replace only those two well names with the issue by replacing the “I” with a blank.

# code

In the example we used invert=TRUE to negate the correct pattern. If we want the regex pattern including the negation we would have to use:

# figure

option (1): change all the first 4 first characters to PSCO

# code

# figure

# code

option (2): replace only those two well names with the issue.

# code

# figure

# code

correct the length of the well number The well names have been corrected on the field identifier. Next if correcting the length of the well number.

# code

Alright. So far, we have corrected the field name in the well name. There are still three more wells to go which problems are:

# text

The well number should go from 000 to 999, right after the field identifier (one character).

# code

Replacing:

# code

Very good. Now we have one well left.

# code
# figure

If we had longer numbers we would modify the regex to:

# figure

See in this example that as more zeros show up in the number (last line), those zeros are removed from the string to fit the 3 digit number limit.

Add the one-letter platform identifier to the well name

# code
# figure

# code

# code

Check if Company is correct

# code

We don’t get any return. All the company names are the same. Cool!

Detect incorrect names and synonyms in Analyst

# code

We can correct manually. In this example we will make use of the operator %in%. It is pretty handy for checking if elements belong to a particular group.

# code

There is only one observation left, the one with NA. We will have to cross-reference it.

Find and replace incorrect and missing values in Field

# code

It has been fixed now.

Add a column for the Completion type To close this chapter, let’s add a new variable (column) where we have only the Completion Type. We can take advantage that the last two characters of the well name is the completion type.

We introduce here another function nchar which returns the number of characters of a string of text. The second function is substr.

# code

Let’s apply these two functions:

# code

Replace values in Location

# code

Observe that in this example we are using the pattern [MQRS][0-9]{3}-[LTS]S together with the parameter invert=TRUE in grep. This means that the pattern will be negated when invert is TRUE.

# code

If we would like instead is the regex for the negated pattern it would have to look like this:

# figure

You see that the words matched are those which do not match the correct pattern.

Replace NA values in Platform

# code

Again, if instead of using invert=TRUE in grep we could have used the negation of the pattern which is:

# figure

What this regex does is match those words that do not contain a valid platform character.

# code

References

For the regex generation and testing I used these two useful websites: http://www.regextester.com/ and http://regexr.com

Links

R markdown notebook for part 5.3 PDF from the R markdown part 5.3 Previous article part 5.2 Follow me in Twitter fonzie@oilgains