It happens with certain frequency: we need to share some well data but for confidentiality reasons we can’t. It could be a paper, sharing with colleagues, a conference, a lecture, an article; 1001 reasons. How can we share some well data without giving away our well names, or platform, and field?
Note: this is a recurrent topic in medicine, genetics, bio-sciences, health care and other industries where privacy is paramount.
In the world of data science fortunately there are powerful tools that let us do just that. I will show here how to it using R.
We start with an example. I have some well data that that provides data required for build well models. Data such as API, specific gravity of gas, GOR, watercut, reservoir pressure, etc.
Read the well data If we have that well data in a CSV file we read it like this:
And we get a table like this (on the left):
And on the right:
The data will be loaded to a dataframe called well_data
.
Practice anonymizing few wells We start by reading one well first from the dataframe:
Which yields the well at the top or first observation:
Now, since we want to get a feeling of the anonymization works, let’s use the R package digest and apply it to the first well:
This is the result:
There you go: your first incognito well!
Take a look at digest. It has three parameters: the well name, the hashing algorithm and the type of serialization. Don’ pay attention too much to these terms. They will be clarified later with more examples.
Let’s do a couple more wells:
And, this is the last well in the dataframe:
So far what we are seeing is that to each of our well names corresponds a hash name generated by the algorithm crc32. In this example, we are using a small hash, but there are many more hash algorithms that you will be able to use to make your anonymization stronger.
Anonymizing entire objects If you are anonymizing few items it is not a problem to do it manually, one by one. But how about when there are 10, or 20, or 100 or thousands of observations! Then we need something to automate or generalize the task. For that purpose, we use functions.
Our function anonymize is shown below. What it does is: takes an object x, applies a hash algorithm (crc32), and then applies digest to x.
Let’s make use of it.
So, now you have anonymized the 25 well names in one shot using a function.
Let’s see at the original well name column along with the anonymized name. We will add it to the well dataframe using the function apply:
What we see is that to each of the original well names it is corresponding an anonymized well name, all of the 25 wells.
Explaining the R code Let’s explain a little bit of the R code before we go any further.
The object well_data is the dataframe, which is, in essence, a table of rows and columns. Just like a spreadsheet. In our case, the rows are the wells or observations, and the columns correspond to the well variables. The dollar sign $ that you see right after well_data is a separator between the dataframe object and the name of the variable or column. For instance, well_data\(Wellname means "the variable Wellname that belongs to the table well_data". Another example: well_data\)IPR_RESPRES is the reservoir pressure column of the table; and so on. The assignment symbol <- works in the same way as the equal sign. The apply function is one of the more powerful functions in R as the other member of its family: sapply, lapply, vapply, tapply, mapply. If you take a look again what we did above we used already vapply and sapply. The next part of the expression is well_data[“Wellname”] which is just another way of writing well_data$Wellname, the variable Wellname of the dataframe well_data. The number 1 after the comma, means work on the row dimension. And, the last part of the expression anonymize is calling the function we built above. In other words, what we are telling R is “apply the function anonymize on the rows of the column Wellname and assign the result to the new column well_id of the dataframe well_data.”
The next sentence:
shows only two variables in the table: Wellname and well_id, the original and anonymized well names.
The next thing to do is anonymizing the other key columns in the table, those we don’t want to be associated with our numerical data to be shared: Company, Analyst, Field, Location and Platform. We will anonymize these variables too.
Anonymizing other key columns Let’s start first by moving the well_id to be the first column. There are 17 columns after we added well_id.
The anonymized column we added before it’s at the end. Do you see it? We will move to the start of the table like this:
Now, let’s anonymize the columns we don’t want anybody to see:
There it is! Our well data anonymized.
There is one last detail though. We have to remove the original Wellname column. Like this:
And as a last step, we save the anonymized well data in a new CSV file with:
This is how the final table looks after being imported:
The notebooks, code and datasets can be found at this location in GitHub: