I confess. I have been in a long term relationship with … Python. Sometimes feels like 10+ years- other times like 15+ years, if I count my sporadic adventures with the language.
Few years ago, I finally dared to explore other universes, and took the #rstats R route. I don’t regret it at all. It has been years of full productivity, challenges in learning the language, discovering its strong publishing tools (blogdown, bookdown, pkgdown, and the king of all: Rmarkdown), its science-oriented ecosystem, and, of course, making discoveries from data. Python has been a good solid base to start scripting solutions to daily problems we engineers find when working with data.
I have dedicated the past few weeks on learning, building, testing and experimenting with this new paradigm of #RPyStats; proving that it works beyond the experiment bench.
I am laying out the background because today I will be introducing a series of articles on a new concept of doing data science: commingling R and Python.
This is what I am aiming to cover:
- Establishing an integrated environment where R and Python can both share code together, talking to each other, sharing objects, variables and data structures.
- Using Anaconda and RStudio as the base tools of the integration.
- Explaining how we can benefit of R and Python where R is the receiver platform - or host-, and Python the transmitter - or secondary package provider.
- Showing how to build a data science productivity platform with the R packages reticulate and rsuite to effectively talk to Python packages, modules, scripts and code chunks.
- Educating the data scientist on how to achieve full reproducibility using this paradigm RPyStats, where packages and environment remain immutable - in some way, disconnected from the ever-changing global environment or physical machine-; and where the updating process is controlled by the user. It will feel like having a Docker container without installing Docker!
- Explain how to handle R objects and Python variables (r.data_table vs py$tensor).
- Using the #*rsuite* package to prepare data science projects for deployment. You will start getting used to create master projects instead of isolated projects or packages.
- Build a data science master control project containing R and Python packages, source packages, binary packages, scripts under Linux, Mac and Windows.
- Learn, as in soccer, how to kick with both legs (left and right) in data science: (1) Using the terminal to build and deploy projects, and (2) using the RStudio GUI to view, edit, knit and organize multiple projects and packages.
- Build an R+Python platform that will empower you to reproduce papers, articles, book chapters on data science and machine learning, making it easier to understand and getting introduce to Artificial Intelligence applications.
- Produce a tighter integration between R and Python installing Python as an R SystemRequirement instead of a Conda environment. That will help you to make reticulate and Windows issues go away.
It doesn’t have to be R vs Python anymore!
It is happening now! And it is a confluence of events.
To me, the current state of R and Python, seems like an alignment of stars. RStudio’s commitment to integrate Python in its developing workflow; the presence of the #reticulate package that enables conversation between Python and R; #Anaconda opening decisively to software outside Python, making a more stable #conda; a relatively new version of #RStudio (1.2) that allows #Rmarkdown notebooks of Python code combined with #rstats chunks; being able to add a new layer on top of R, RStudio for deployment through the R package #rsuite making possible to install Python as a SystemRequirement of R.
To give you an example, the main driver for me to explore this new #RPython paradigm was the relatively weak presence of more #machinelearning libraries in R. I particularly love the #PyTorch way of building #neuralnetworks but it is not available in #rstats. That really plugged me in to demonstrate building a reliable platform that make #datascience even more #reproducible but this time with two languages. A proof of that is the ~10 bookdown ebooks that I have been able to put together, without being affected by changes in the global environment or physical machine which is a common pain in R. They will just run and run.
Example repository
https://github.com/f0nzie/rpystats-apollo11
Links
- R package rsuite: https://rsuite.io/
- RSuite downloads: https://rsuite.io/RSuite_Download.php
- RStudio: https://www.rstudio.com/
- Anaconda Python: https://www.anaconda.com/distribution/#download-section
- R package reticulate: https://github.com/rstudio/reticulate
- R package Rmarkdown: https://rmarkdown.rstudio.com/
- R package bookdown: https://bookdown.org/yihui/bookdown/
- R package blogdown: https://bookdown.org/yihui/blogdown/
- R package pkgdown: https://pkgdown.r-lib.org/
- Python package PyTorch: https://pytorch.org/
- Python package TorchVision: https://pytorch.org/docs/stable/torchvision/index.html
- Python package numpy: https://www.numpy.org/
- Python package matplotlib: https://matplotlib.org/
- Python package pandas: https://pandas.pydata.org/
- Oil Gains Analytics GitHub repository: https://github.com/f0nzie?tab=repositories