R and Python commingled: Creating a PyTorch project with RPyStats. Season 1, Episode 5

Alfonso R. Reyes
(17 July 2019)

Online


In the previous episode we ended up calculating the accuracies for the MNIST digits model using PyTorch libraries called from a Rmarkdown notebook written in Python. In this episode, we will run the same example but in another notebook written in R named *mnist_digits_rstats.Rmd*, which was saved in the folder ./work/notebooks in the previous session. With RStudio open, let’s click on the file to open it.

No alt text provided for this image

You may notice some familiarity between the code in this notebook *mnist_digits_rstats.Rmd*, written in R:

No alt text provided for this image

versus this other code in the first notebook *mnist_digits_python.Rmd,* that we have already run, written in Python:

No alt text provided for this image

Both look familiar but they are not the same. Each language uses it own syntax and conventions that makes them unique and powerful. We will explain the details later. For now, we are interested in running the notebook in R.

Running the notebook in R or Python show similar characteristics. Since we are still exploring let’s continue running individual code chunks by clicking in the green arrow on the first block:

No alt text provided for this image

Then, the second chunk that loads the PyTorch libraries, numpy and Python built-in functions:

No alt text provided for this image

We don’t get any output in response because we haven’t explicitly ordered the script to do it. Then, we run the next chunk of code where we indicate the size of the training and testing batches, and the location of the training and testing datasets on disk. Note that the location of the datasets is the user home directory under the folder *~/mnist_png_full/*

No alt text provided for this image

After runnig the chunk, still we get no output because we are just assigning values to objects.

Now, it’s the turn of the fourth block, which will load the raw images from the training and testing datasets and assign them to two objects: train_dataset and test_dataset.

No alt text provided for this image

This time we get some output, which is informative about the number of images in the training and testing datasets, besides class and type of the objects being loaded, additionally to the description of the object *train_dataset*.

Continue running the individual chunks while observing the output and the comments of what that code does. Stop at line 325. We will pause to take a look at a piece of code.

The following chunk will serve us very well because it represents the confluence of Python and R. The class that is being defined here is named *LogisticRegressionModel*. This is the algorithm that will be used to train the model. This class is inheriting from another class: Linear, which is part of the torch.nn library. We will just add a method to the class, forward(), before we finish. What it is interesting here is that the machine learning class is entirely written in Python, not in R. I will not explain now why is has to be that way in this episode. The class (code in green), is then assigned to the *main* object in R, and immediately extracted from main itself, to an R object named *LogisticRegressionModel*, same as its relative in Python. Note that we used the dollar sign *$* to extract an object from main. This *$* sign is how you extract objects in R, while in Python we use the dot, like in *nn.Linear*.

No alt text provided for this image

This block doesn’t print anything but the next one will. Move your cursor to the next chunk, press on the green arrow. And you will get an output.

No alt text provided for this image

Besides indicating the size of the input and output, we are printing the settings of the class *LogisticRegressionModel*, which is confirming for us that the number of features in the input is 784 (number of pixels in a 28 by 28 matrix), and features in the output is 10 (number of digits from 0-9).

In fact, we are scratching the surface in terms of learning more about the internals of Python and its interaction with R. We are skipping lot of material for now. But you could learn a lot now just by comparing the notebooks in Python and R, chunk by chunk.

Our next stop, before closing this episode, is moving the cursor to the last chunk in line 409. Everything that has been prepared, defined, and set here has been with purpose of making the following loops to work: The epoch loop; the training dataset loop; and the testing dataset loop.

No alt text provided for this image

The loop we are most interested in is the one in the middle because it iterates throughout all the 60,000 hand-written digit images to learn.

No alt text provided for this image

You will find that the code in R in this chunk is pretty similar to its counterpart in Python. The difference is primarily how we address the objects, such as the iterator train_loader, iter_train_dataset, and train_obj. Everything else is pure PyTorch with R notation.

The other loop that is interesting is the inner loop, which calculates the accuracy after the training dataset went through an epoch.

No alt text provided for this image

The iteration on the testing dataset and addressing the objects is similar to the loop above. But this time we are calculating how many of the predicted images from the training job are matching the unseen digit images in the test dataset, and then calculate the percentage of correct predictions respect to the total number of images in the test dataset. This calculation is performed at the end of every epoch. There are five epochs.

If we run the chunk it will take few minutes and give us this output, which is pretty approximate to the results given by the Python Rmarkdown notebook.

No alt text provided for this image

One more thing we are going to do before closing this episode is printing the notebook to a HTML file or web page that you can view in a browser. To do this we will move our attention to the top part of the pane. Just below, to the right of the name of the notebook, there is a little blue icon with the label Knit.

No alt text provided for this image

Click on the little down arrow and you will see this menu pop up. Select the first option “Knit to HTML”. Let’s give a few minutes to R to perform the calculations and compile the notebook.

No alt text provided for this image

After 3 minutes - the time approximately takes to run the algorithm, you will get this window which is the RStudio built-in browser. Click on the button that says “Open in Browser”.

No alt text provided for this image

Now, the page with the results will open in a new window in the default browser, giving you the possibility of immediately sharing your work via the web.

No alt text provided for this image

There many more possibilities for publication, which is one the strongest points of R and RStudio. You could print to a PDF, to a Latex or TeX file, to a Word file, or to a set of slides with choice of various formats, including PowerPoint. The Rmarkdown notebook has extraordinary powers which do not end with the calculations in cells or chunks but expand to plenty of output formats. That one of the greatest advantages of using RPyStats.

Example repository

https://github.com/f0nzie/rpystats-apollo11