r, python, big data deployment

For any business embarking on a new big data analytics project, there will be many key questions that need answering. As well as the strategic queries, such as what use cases they intend to deploy the solution for and what the expected ROI will be, there are several key technology issues that need to be addressed.


Central to this will be the matter of what programming language they will use to develop their big data applications. There are a range of options for this, but among the most popular are Python and R. Both have their champions and detractors, so deciding which is best is a tricky decision.


It could be argued that choosing which to use is a matter of personal preference. While they each have their own pros and cons, if used correctly then either can help create powerful applications for the analysis of large quantities of data. But as with many things, it is rarely that simple.


The key use cases


Both Python and R are seeing their popularity among enterprise users grow rapidly, but their history is slightly different, and this tends to dictate the use cases they are most commonly deployed for. R, for instance, was originally designed by and for statisticians, and has been very popular among researchers and academics, while Python has a greater emphasis on productivity and readability.


This means that Python is often seen as easier if data analysis tasks need to be integrated with other applications, such as web apps or a production database. R, on the other hand, is mainly used when jobs require standalone analytics tasks, its interactive mode supporting flexible progressive interaction with data and the required analytics.


If companies are just setting out on their big data journey, R may be easier to use in the early stages, as statistical models can be created with just a few lines of code. However, once processes become more advanced, Python’s wide range of features can be hugely beneficial in creating algorithms for use in full production environments.


Usability and flexibility


The perceived user-friendliness of the two options, and how adaptable they are to the unique needs of the business, will also be major factors in their decision-making. In this regard, some people consider Python to have the advantage, as it is a common, easy-to-understand language that many programmers will be familiar with.


As a general purpose language, it is easy and intuitive, with an emphasis on readability. By contrast, R has a steeper learning curve that can make using it a daunting task for individuals coming to it for the first time. Without third-party tools to improve its performance, it is also a slow platform, making it unsuitable for applications where fast results are needed.


On the other hand, R is a very mature language and interactive workbench, with a rich, well-established ecosystem, and it also has strong graphical capabilities that make visualising data much easier. Python is still playing catch-up in both these regards.


Taking a pragmatic approach


However, the programming languages share many positives. For instance, both are distributed under an open-source license, meaning they are free for anyone to download and get started with. This also means there are many online communities developing advanced tools and offering support to their users.


Ultimately, it’s impossible to say definitively which of R or Python is the best programming language for big data operations. They both have a wide range of features and quirks, so the decision of which to use should be based on what you intend to do, as well as the skills you already have.


So, for instance, if you’ve determined that, say, Python is better suited to a certain task, but your data scientists already have some familiarity with R, you need to decide if it will be better for programmers to stick to what they know, rather than trying to learn a new tool.


Both solutions should be more than capable of handling most business big data tasks, so you need to look at what skills you have, what’s already in use in your industry and what problems you need to solve when evaluating your options.


And then there is Scala but that is a whole other topic…