In the previous episode I described what we can achieve with RPyStats. In this episode, we will run an example -ready to run-, that I have prepared for you to experiment working at deployment level, meaning at one level above R and RStudio, including calls to Python libraries.
A master project
One thing that is different from the way we have been doing R or Python projects is that we add an extra layer of control above any other type of project, an R project, a Python project, an R package or a Python package. That is what we will start calling a master project. If you have a better name, please, feel free to suggest.
You create a master project with this rsuite command:
rsuite proj start -n rpystats-apollo11
The name of the project I chose is *rpystats-apollo11*.
How does a master project look? It will look like this:
Don’t worry. I will explain in detail how we get here.
This master layer is not totally a new concept; it is something that advanced users of R use very often, together with Unix *make,* to customize the build of a complex project. But this has escaped for years to the average data scientist. Until the package rsuite showed up in the R ecosystem.
If you are already familiar with RStudio projects, this new master project is nothing like it. Well, sort of. These are the objects in common with an RStudio project:
- .gitignore
- .Rhistory
- the R folder
- the project itself, rpystats-apollo11.Rproj
- the tests folder
There is a .git folder as well. You don’t see it here because it’s hidden . Every time that you create a master folder with rsuite, a Git version control folder .git is also created.
What is new in a RPyStats master project is this:
- A *config_temp.txt* file
- The deployment folder
- The logs folder
- The *packages* folder
- The *PARAMETERS* file
The folders deployment and logs are not something that the user can modify; it’s up to the rsuite engine after you send the commands to install the dependencies.
What is amazing though is seeing in real time how R packages are added to this folder to satisfy the dependencies. None of these packages is installed in the R global environment or physical machine; these packages only belong to the master project and no one else. The global environment remains unaltered, and, reciprocally, any change in the global environment does not affect the master project or any component under its umbrella.
All what is mentioned above is enough to make a complex R project more manageable and put it at reach of all data scientists and engineers.
Adding a taste of Python
When we send this command from the terminal, RSuite will start taking all the files it needs from the Anaconda installation that you have currently installed.
rsuite sysreqs install
After a successful completion, you will see a new folder: *conda*
This folder *conda* is not a symbolic link or shortcut to a conda environment; it is a fully independent, standalone Python installation. This is how the master project gets to be 100% reproducible. The closest you will find is a Docker container.
This how it looks when you start creating this standalone Python installation:
R packages dependencies
In order to make a master project, with its own project and packages, fully independent from the global environment we have to spell out the packages in one of the packages under the folder *packages*. To add the master package, or package provider, you first create it with:
rsuite proj pkgadd -n apollo11.pkg
The package *apollo11.pkg* is added under the packages folder:
To pull the R package dependencies, you have to be located under the master project folder and run this rsuite command:
rsuite proj depsinst
Here is a screenshot after issuing the command and its output:
Observe that rsuite installed 49 package dependencies. Where? Under the folder *deployment/libs*. This is a view of the folder.
The DESCRIPTION file under the package *apollo11.pkg* that makes this possible looks like this:
The keyword *Imports* take care of the R packages, while *SystemRequirements* takes care of Python and its own packages.
There are seven R packages (logging, abind, reticulate, testthat, dplyr, data.table, ggplot2) that we will use in this project, while in Python we go full machine learning with PyTorch and declare these requirements:
- Python 3.6.6
- pytorch-cpu
- torchvision-cpu
- matplotlib
- pandas
In this example, we chose Python 3.6.6 but we could have indicated 3.7 or lower, If your computer has a GPU, you could just change the keyword cpu. Notice also that numpy has not been explicitly declared because it is automatically installed by PyTorch.
Warning: if you omit the keyword cpu, PyTorch will install the GPU version of the packages. You may get an error when you run Python or R scripts calling the GPU library and your computer doesn’t have no GPU.
Building the child projects and packages
So, at this point we have installed the R dependencies and the Python dependencies, in addition to a fully standalone Python executable set. Next, is building the projects and packages under the master package control. We do this by issuing this command:
rsuite proj build
What you may have noticed is that is so far we haven’t touched much RStudio. Not yet. All the commands have been sent from the terminal. I think this is a good opportunity - if you are not familiar with the terminal or console-, to exercise it a little bit. The terminal will definitely enhance your data science powers in whatever the operating system you are working on.
What’s next
What we have done so far is showing a bird’s eye view of the construction of a RPyStats master project. What is next is to start populating the master project with sub-projects and sub-packages. Examples of them are: papers, thesis, ebooks, articles, data analysis, reports, web applications, server and clients, a blog, a machine learning algorithm demonstration, etc. Endless possibilities.
What you will be doing different with #RPyStats is adding an extra layer of control and supervision above any R or Python project, making them play in concert together.
Repository for this master project
All what we covered in this episode so far can be found in this GitHub repository: https://github.com/f0nzie/rpystats-apollo11.
To be able to reproduce the steps above you will have to have installed:
- R 3.6.0 or above
- RStudio 1.2+
- Rtools 3.5+
- Anaconda3, 3.6+
- RSuite client
- rsuite package
Example repository
[https://github.com/f0nzie/rpystats-apollo11](http://Example repository https://github.com/f0nzie/rpystats-apollo11)
Links
- R 3.6.0: https://cran.r-project.org/bin/windows/base/old/3.6.0/
- RSuite client downloads: https://rsuite.io/RSuite_Download.php
- RStudio: https://www.rstudio.com/
- Anaconda Python: https://www.anaconda.com/distribution/#download-section
- Oil Gains Analytics GitHub repository: https://github.com/f0nzie?tab=repositories