Text Mining for Petroleum Engineering: Modern Techniques for Paper Research

Alfonso R. Reyes
(24 November 2017)

Online  Github


Last week I was talking with a colleague about some projects that could be deployed using R, and then the topic of well technical potential came up. I got hooked by the term and then decided to explore it a little bit more.

What is not better that going to the OnePetro website and search for papers on “technical potential”. I did and found 132 papers matching the term. Well, I was not going to purchase those 132 papers, you know. I had to reduce that number to a essential minimum. How?

That is the topic of today’s article. And for that purpose, we will be using text mining.

The toolbox

I will be using the following ingredients for this recipe. :)

  • The R package for text mining tm to convert the papers to a corpus and apply text mining functions.

  • The plotting package ggplot2 for showing a quick term comparison from the papers

  • My R package petro.One to connect to OnePetro and retrieve the metadata generated by the query and put that in a dataframe.

  • RStudio as an Integrated Development Environment for R.

All of these tools are open source and free. Never has been more true saying “free ride” that studying, learning and using state-of-the-art software than today. You will just have to invest some of your time.

How many papers are there?

Let’s find out how many papers contain the term “technical potential”. Here is the code in R:

# insert code

What the package petro.One does is connecting with OnePetro and submit the text “technical potential” as a query to the website, specifying that we want papers that only have those two words together with the parameter “how”. We could read the number of papers from the object num_papers.

We read the titles of the papers and the rest of the metadata and put that in a dataframe df. So, we get 132 papers as it is shown in the figure below:

# insert dataframe

Since the paper titles are too long I had to split the table in two to be able to show it here. The dataframe structure is essentially a table, very much looking like a spreadsheet with its rows and columns.

Selecting the papers

We will assume that the papers with stronger “technical potential” content having those words have in the title. Then, we will look at papers in the title.

At this point, we could do two things:

  • retrieve the papers that have technical potential in the title.
  • perform an additional 1-word and 2-word term analysis on the papers that match our query. They call them n-grams.

We will do first a text mining of the papers that gave a match on the title.

# insert code

What grep does is finding a word pattern in the column title of the dataframe df. The function grep is a very powerful function to match any kind of text patterns. You will be able to an equivalent in every modern programming language.

There are 3 papers that match the pattern “technical potential”:

# insert dataframe

These three papers are our best candidates from the 132 papers that matched our initial query. Next is finding which of the three papers is the one richer in “technical potential” content and focus our attention to it.

What we will do is reading inside the papers, the PDF files. Let’s retrieve those 3 papers.

Data Mining the PDF files

Once the papers have been downloaded we will verify that they are in our working directory:

# insert code

That gives us three PDF files. Pay attention to the object files. We will use it to create the mining corpus in a moment. A corpus is a term used in text mining that is equivalent to saying “a body of text”, or “a body containing multiple bodies of text”. In our case, the corpus will contain three PDF files.

Read the PDF files and inspect the corpus

The following operation is reading the papers that are in PDF format. R has a way to read PDF files through the function readPDF of the package tm.

The object papers is the corpus. The object papers.tdm is the term document matrix.

This is the result:

# insert code

The table at the bottom is the term document matrix. A matrix where the rows are the terms and the columns are the counts of those terms for each paper.

Observe that each of the PDF files has an identifier like this [[1]], [[2]] and [[3]]. Take a look above at the number of characters each document contains.

Now, let’s get the number of words or terms:

# insert code

which gives us:

# insert code

A summary table of our initial findings:

# insert code

We can see that the document with more content is gupta2017, the second quong1982, and the third, ruslan2014. But that’s not all; we will find something interesting later.

Find the most frequent terms in the papers

Now that the papers corpus has been converted to a term document matrix, we could continue with finding the most frequent terms:

# insert code

This is the result:

# insert code

Observe the frequency of the terms. What is happening here is that even though “technical potential” is found in the title of the paper quong1982, the paper is not rich inside in technical potential terms. The other two papers are stronger. We will put aside quong1982 and analyze only gupta2017 and ruslan2014.

Just to be sure, one more time: let’s find the score of all the papers given the terms “technical” and “potential”.

This is the result:

Well, that confirms our initial analysis; quong1982 is not the best candidate paper for studying “technical potential”. Look at its counts 4 for technical, and 7 for potential.

Term frequency analysis for the first paper

What we do here is building a dataframe of terms and frequency at which each occur.

This is the result:

# dataframe

Now, for the second paper:

# code

And this is the result:

# dataframe

We can see some differences between the two: such as the total number of terms or words, the frequency for each term, and that the terms and frequency differ in both documents.

To finalize, let’s plot terms vs frequency for both papers:

# code
# plot

In your opinion, which paper would be the best selection to start reading?

Conclusion

  • We rapidly determined, in our study for the term “technical potential”, which papers are the best candidates.

  • We used text mining, composed a document corpus and a term document matrix, to take that decision, and narrow down from 132 to 3 papers that search and the acquisition of those papers.

  • From the three selected papers, we were able to put aside one because the other two papers were even richer in the content we were looking for.

  • The application of text mining techniques help us to narrow down our research to selected papers, saving time and other resources.

This doesn’t mean that we should discard reading or analyzing the other 130 papers; but doing a deeper analysis of all the papers would require purchasing and downloading all of them. If this option is viable, we may find that other papers may contain interesting content for the subject of our research. In the case of this study, for reasons of time and budget, we came up rapidly to only those two papers.

References

  • This is the website for this article. You can find the original files, code, datasets and figures at GitHub repository.

  • Post also available through SPE-Connect.