R and Python commingled: how to get the best of both worlds. Season 1, Episode 1

Alfonso R Reyes
(12 July 2019)

Online


I confess. I have been in a long term relationship with … Python. Sometimes feels like 10+ years- other times like 15+ years, if I count my sporadic adventures with the language.

Few years ago, I finally dared to explore other universes, and took the #rstats R route. I don’t regret it at all. It has been years of full productivity, challenges in learning the language, discovering its strong publishing tools (blogdown, bookdown, pkgdown, and the king of all: Rmarkdown), its science-oriented ecosystem, and, of course, making discoveries from data. Python has been a good solid base to start scripting solutions to daily problems we engineers find when working with data.

I have dedicated the past few weeks on learning, building, testing and experimenting with this new paradigm of #RPyStats; proving that it works beyond the experiment bench.

I am laying out the background because today I will be introducing a series of articles on a new concept of doing data science: commingling R and Python.

This is what I am aiming to cover:

  • Establishing an integrated environment where R and Python can both share code together, talking to each other, sharing objects, variables and data structures.
  • Using Anaconda and RStudio as the base tools of the integration.
  • Explaining how we can benefit of R and Python where R is the receiver platform - or host-, and Python the transmitter - or secondary package provider.
  • Showing how to build a data science productivity platform with the R packages reticulate and rsuite to effectively talk to Python packages, modules, scripts and code chunks.
  • Educating the data scientist on how to achieve full reproducibility using this paradigm RPyStats, where packages and environment remain immutable - in some way, disconnected from the ever-changing global environment or physical machine-; and where the updating process is controlled by the user. It will feel like having a Docker container without installing Docker!
  • Explain how to handle R objects and Python variables (r.data_table vs py$tensor).
  • Using the #*rsuite* package to prepare data science projects for deployment. You will start getting used to create master projects instead of isolated projects or packages.
  • Build a data science master control project containing R and Python packages, source packages, binary packages, scripts under Linux, Mac and Windows.
  • Learn, as in soccer, how to kick with both legs (left and right) in data science: (1) Using the terminal to build and deploy projects, and (2) using the RStudio GUI to view, edit, knit and organize multiple projects and packages.
  • Build an R+Python platform that will empower you to reproduce papers, articles, book chapters on data science and machine learning, making it easier to understand and getting introduce to Artificial Intelligence applications.
  • Produce a tighter integration between R and Python installing Python as an R SystemRequirement instead of a Conda environment. That will help you to make reticulate and Windows issues go away.

It doesn’t have to be R vs Python anymore!

It is happening now! And it is a confluence of events.

To me, the current state of R and Python, seems like an alignment of stars. RStudio’s commitment to integrate Python in its developing workflow; the presence of the #reticulate package that enables conversation between Python and R; #Anaconda opening decisively to software outside Python, making a more stable #conda; a relatively new version of #RStudio (1.2) that allows #Rmarkdown notebooks of Python code combined with #rstats chunks; being able to add a new layer on top of R, RStudio for deployment through the R package #rsuite making possible to install Python as a SystemRequirement of R.

To give you an example, the main driver for me to explore this new #RPython paradigm was the relatively weak presence of more #machinelearning libraries in R. I particularly love the #PyTorch way of building #neuralnetworks but it is not available in #rstats. That really plugged me in to demonstrate building a reliable platform that make #datascience even more #reproducible but this time with two languages. A proof of that is the ~10 bookdown ebooks that I have been able to put together, without being affected by changes in the global environment or physical machine which is a common pain in R. They will just run and run.

Example repository

https://github.com/f0nzie/rpystats-apollo11