R and Python commingled: how to get the best of both worlds. Season 1, Episode 1

I confess. I have been in a long term relationship with … Python. Sometimes feels like 10+ years- other times like 15+ years, if I count my sporadic adventures with the language.

Few years ago, I finally dared to explore other universes, and took the #rstats R route. I don’t regret it at all. It has been years of full productivity, challenges in learning the language, discovering its strong publishing tools (blogdown, bookdown, pkgdown, and the king of all: Rmarkdown), its science-oriented ecosystem, and, of course, making discoveries from data. Python has been a good solid base to start scripting solutions to daily problems we engineers find when working with data.

I have dedicated the past few weeks on learning, building, testing and experimenting with this new paradigm of #RPyStats; proving that it works beyond the experiment bench.

I am laying out the background because today I will be introducing a series of articles on a new concept of doing data science: commingling R and Python.

This is what I am aiming to cover:

Establishing an integrated environment where R and Python can both share code together, talking to each other, sharing objects, variables and data structures.
Using Anaconda and RStudio as the base tools of the integration.
Explaining how we can benefit of R and Python where R is the receiver platform - or host-, and Python the transmitter - or secondary package provider.
Showing how to build a data science productivity platform with the R packages reticulate and rsuite to effectively talk to Python packages, modules, scripts and code chunks.
Educating the data scientist on how to achieve full reproducibility using this paradigm RPyStats, where packages and environment remain immutable - in some way, disconnected from the ever-changing global environment or physical machine-; and where the updating process is controlled by the user. It will feel like having a Docker container without installing Docker!
Explain how to handle R objects and Python variables (r.data_table vs py$tensor).
Using the #*rsuite* package to prepare data science projects for deployment. You will start getting used to create master projects instead of isolated projects or packages.
Build a data science master control project containing R and Python packages, source packages, binary packages, scripts under Linux, Mac and Windows.
Learn, as in soccer, how to kick with both legs (left and right) in data science: (1) Using the terminal to build and deploy projects, and (2) using the RStudio GUI to view, edit, knit and organize multiple projects and packages.
Build an R+Python platform that will empower you to reproduce papers, articles, book chapters on data science and machine learning, making it easier to understand and getting introduce to Artificial Intelligence applications.
Produce a tighter integration between R and Python installing Python as an R SystemRequirement instead of a Conda environment. That will help you to make reticulate and Windows issues go away.

It doesn’t have to be R vs Python anymore!

It is happening now! And it is a confluence of events.

To me, the current state of R and Python, seems like an alignment of stars. RStudio’s commitment to integrate Python in its developing workflow; the presence of the #reticulate package that enables conversation between Python and R; #Anaconda opening decisively to software outside Python, making a more stable #conda; a relatively new version of #RStudio (1.2) that allows #Rmarkdown notebooks of Python code combined with #rstats chunks; being able to add a new layer on top of R, RStudio for deployment through the R package #rsuite making possible to install Python as a SystemRequirement of R.

To give you an example, the main driver for me to explore this new #RPython paradigm was the relatively weak presence of more #machinelearning libraries in R. I particularly love the #PyTorch way of building #neuralnetworks but it is not available in #rstats. That really plugged me in to demonstrate building a reliable platform that make #datascience even more #reproducible but this time with two languages. A proof of that is the ~10 bookdown ebooks that I have been able to put together, without being affected by changes in the global environment or physical machine which is a common pain in R. They will just run and run.

Example repository

https://github.com/f0nzie/rpystats-apollo11

R and Python commingled: how to get the best of both worlds. Season 1, Episode 1

It doesn’t have to be R vs Python anymore!

Example repository

Links