Big Data Analytics with R, Python and SAS on Hadoop

One of the side effects of the exploding world of analytic innovation is that taking advantage of the latest techniques often requires learning a new set of programming languages and tools.  Waiting for your favorite analytic tool vendor to catch up and provide an integrated solution isn’t usually an option.   Leading analytic teams will inevitably need to use multiple tools to support their business needs, so the best approach is to embrace diversity and create a flexible infrastructure that can operationalize models authored by a wide range of programming languages and tools.

 

Many studies have listed SAS, R and Python as the most widely used programming languages among the big data analytics professionals, primarily because of their ability to take advantage of the enormous processing power and speed of the Hadoop environment for large data sets.

SAS, for example, has been the undisputed market leader in commercial analytics space. The software offers huge array of statistical functions, has good GUI for people to learn quickly and provides awesome technical support. However, it ends up being an expensive option and is not always enriched with latest statistical functions.

R, on the other hand, is the open source counterpart of SAS, which has traditionally been used in academics and research. Because of its open source nature, latest techniques get released quickly. There is a lot of documentation available over the internet and it is a very cost-effective option.

With its origination as an open source scripting language, Python analytics usage has grown over time. Today, it sports a large number of libraries and functions for almost any statistical operation/model building. Since introduction of pandas, it has also become very strong in operations on structured data.

Python is used widely in web development, so if you are in an online business, using Python for web development and analytics can provide synergies. SAS, on the other hand, used to have a big advantage of deploying end-to-end infrastructure analytics solutions (visual analytics, data warehouse, data quality, reporting and analytics), however this has recently been mitigated by integration/support of R on platforms like SAP HANA and Tableau (although the R integration with these platforms is still away from the seamless integration of SAS).

In this blog, we will briefly introduce R, Python and SAS and describe their integration with Hadoop for the purposes of Big Data Analytics

R is perhaps the first choice of most data scientists and modelers because R was designed for this very purpose. R has robust support for machine learning algorithms and its library base keeps on growing every day. There will hardly be any discipline – be it bio-technology or geo-spatial processing – where a ready package will not be available in R. R is also fairly easy to learn and has good tutorials and support available – though being free product, support is often in forums and examples rather than in documentations. R is also great for data and plot visualizations, which is almost always necessary for data analysis.

Like Python, R is hugely popular (one poll suggested that these two open source languages were between them used in nearly 85% of all Big Data projects) and supported by a large and helpful community. Where Python excels in simplicity and ease of use, R stands out for its raw number crunching power. R is often called “a language for statisticians built by statisticians”. If you need an esoteric statistical model for your calculations, you’ll likely find it on CRAN — its Comprehensive R Archive Network. Its widespread adoption means you are probably executing code written in R every day, as it was used to create algorithms behind Google, Facebook, Twitter and many other services.

However, if you are not a data scientist and haven’t used Matlab, SAS, or OCTAVE before, it can take a bit of adjustment to be productive in R. While it’s great for data analysis, it is less good at more general purposes. You would construct a model in R, but you would consider translating the model into Scala or Python for production, and you’d be unlikely to write a clustering control system using the language.

R has literally thousands of extension packages that allow statisticians to undertake specialized tasks, including text analysis, speech analysis, and tools for genomic sciences. The center of a thriving open-source ecosystem, R has become increasingly popular as programmers have created additional add-on packages for handling big datasets and parallel processing techniques that have come to dominate statistical modeling today. Selected packages include:

  • Parallel, which helps R take advantage of parallel processing for both multicore Windows machines and clusters of POSIX (OS X, Linux, UNIX) machines.
  • Snow, which helps divvy up R calculations on a cluster of computers, which is useful for computationally intensive processes like simulations or AI learning processes.
  • Rhadoop and Rhipe, which allow programmers to interface with Hadoop from R, which is particularly important for the MapReduce function of dividing the computing problem among separate clusters and then re-combining or reducing all of the varying results into a single answer.

R is used in industries like finance, health care, marketing, business, pharmaceutical development, and more. Industry leaders like Bank of America, Bing, Facebook, and Foursquare use R to analyze their data, make marketing campaigns more effective, and reporting.

Integrating Hadoop with R
Hadoop is the most important framework for working with Big Data. The best part of Hadoop is that it is scalable and can be deployed for any type of data in various varieties like structured, unstructured and semi-structured type.

But in order to derive analytics from it, there is a need to extend its capabilities and hence the integration with R programming is a necessity in order to get Big Data analytics. The most common ways in which Hadoop and R can be integrated are as follows.

RHadoop: this is an Open Source package provided by Revolution Analytics. These come with three packages that can be readily used for R analysis and working with Hadoop framework data:

The rmr package is the one that provides the MapReduce functionality to the Hadoop framework by writing the Mapping and Reducing codes in R language;
The rhbase package lets you get the R database management capability with integration with HBase; and
The rhdfs package lets you get the file management capabilities by integration with HDFS.
RHIPE: this is an integrated programming environment that is developed by the Divide and Recombine (D&R) for analyzing large amounts of data. RHIPE stands for R and Hadoop Integrated Programming Environment.

The RHIPE lets you work with R and Hadoop integrated programming environment. You can use Python, Java or Perl to read data sets in RHIPE. There are various functions in RHIPE that lets you interact with HDFS.

ORCH: this is the Oracle R Connector which can be used to exclusively work with Big Data in Oracle appliance or on non-Oracle framework like Hadoop.

The Oracle R Connector for Hadoop can be used for deploying R on Oracle Big Data Appliance or for non-Oracle frameworks like Hadoop with equal ease. The ORCH lets you access the Hadoop cluster via R and also to write the Mapping and Reducing functions. You can also manipulate the data residing in the Hadoop Distributed File System.

HadoopStreaming: this is the R Script available as part of the R package on CRAN. This intends to make R more accessible to Hadoop streaming applications. Using this you can write MapReduce programs in a language other than Java.

The Hadoop Streaming lets you write MapReduce codes in R language making it extremely user-friendly. Java might be the native language for MapReduce but it is not suited for high speed data analysis needs of today and hence there is a need for faster mapping and reducing steps with Hadoop and this is where Hadoop Streaming comes into the picture wherein you can write the codes in Python, Perl or even Ruby. The Hadoop Streaming then converts the data to how the users need it. When you are doing data analysis using R there is no need to write codes from the command line interface as the data will be sitting on the background and all you need to do is create data, partition data and compute summaries.

Python
Another free language/software, Python has great capabilities overall for general purpose functional programming. Python works well for web-scrapping, text processing, file manipulations, and simple or complex visualizations. Compared to R, Python is advancing in dealing with structured data and analytical models, and it doesn’t support data visualization in as much detail as R. However, as opposed to R, Python is a traditional object-oriented language, so most developers will be fairly comfortable working with it, whereas first exposure to R or Scala can be quite intimidating.

Python is one of the most popular open source (free) languages for working with the large and complicated datasets needed for Big Data. It has become very popular in recent years because it is both flexible and relatively easy to learn. Like most popular open source software it also has a large and active community dedicated to improving the product and making it popular with new users.

Python has been very popular in academia for more than a decade, especially in areas like Natural Language Processing (NLP). As a result, if you have a project that requires NLP work, you’ll face a huge number of choices, including the classic NTLK, topic modeling with GenSim, or the blazing-fast and accurate spaCy. Similarly, Python scores well when it comes to neural networking, with Theano and Tensorflow; then there’s scikit-learn for machine learning, as well as NumPy and Pandas for data analysis.

There’s Juypter/iPython too — the Web-based notebook server that allows you to mix code, plots, and, well, almost anything, in a shareable logbook format. This had been one of Python’s killer features, although these days, the concept has proved so useful that it has spread across almost all languages that have a concept of Read-Evaluate-Print-Loop (REPL), including both Scala and R.

Python is a high-level language, meaning that the creators automated certain housekeeping processes in order to make code easier to write. Python has robust libraries that support statistical modeling (Scipy and Numpy), data mining (Orange and Pattern), and visualization (Matplotlib). Scikit-learn, a library of machine learning techniques very useful to data scientists, has attracted developers from Spotify, OKCupid, and Evernote, but can be challenging to master.

Integrating Hadoop with Python
Two of the most popular ways to integrate Hadoop with Python are Hadoopy and Pydoop.

Hadoopy: a Python wrapper for Hadoop Streaming written in Cython

Hadoopy is simple, fast, and readily hackable. It has been tested on 700+ node clusters.

The goals of Hadoopy are:

Similar interface as the Hadoop API (design patterns usable between Python/Java interfaces);
General compatibility with dumbo to allow users to switch back and forth;
Usable on Hadoop clusters without Python or admin access;
Fast conversion and processing;
Stay small and well documented;
Be transparent with what is going on;
Handle programs with complicated .so’s, ctypes, and extensions;
Code written for hack-ability;
Simple HDFS access (e.g., reading, writing, ls);
Support (and not replicate) the greater Hadoop ecosystem (e.g., Oozie, whirr);
Killer Features (unique to Hadoopy):

Automated job parallelization ‘auto-oozie’ available in the hadoopyflow project (maintained out of branch);
Local execution of unmodified MapReduce job with launch_local;
Read/write sequence files of TypedBytes directly to HDFS from python (readtb, writetb);
Allows printing to stdout and stderr in Hadoop tasks without causing problems (uses the ‘pipe hopping’ technique, both are available in the task’s stderr);
Works on clusters without any extra installation, Python, or any Python libraries (uses Pyinstaller that is included in this source tree).
Hadoopy is used in several use cases and tools, such as in searching computer help using screenshots and keywords, in web-scale computer vision using MapReduce for multimedia data mining, in Vitrieve (a visual search engine) and in Picarus (a Hadoop computer vision toolbox).

Pydoop: Writing Hadoop Programs in Python

Installed as a layer above Hadoop, the open-source Pydoop package enables Python scripts to do big data work easily. Pydoop Script enables you to write simple MapReduce programs for Hadoop with mapper and reducer functions in just a few lines of code. When Pydoop Script isn’t enough, you can switch to the more complete Pydoop API, which provides the ability to implement a Python Partitioner, RecordReader, and RecordWriter. In addition, Pydoop makes it easy to interact with HDFS (Hadoop Distributed File System) through a Pydoop HDFS API (pydoop.hdfs), which allows you to retrieve information about directories, files, and several file system properties. The Pydoop HDFS API makes it possible to easily read and write files within HDFS by writing Python code. In addition, the lower-level API provides features similar to the HadoopC HDFS API, and so you can use it to build statistics of HDFS usage.

Pydoop might not be the best API for all Hadoop use cases, but its unique features make it suitable for a wide range of scenarios and it is being actively improved.

SAS – Statistical Analysis Software
SAS is one of the most common tools out there for data processing and model development. When analytics function started emerging in the financial service sector couple of decades ago, SAS became common choice because of its simplicity and lot of support and documentation. SAS comes handy both for step by step data processing and automated scripting. All is well, except, SAS wasn’t and isn’t cheap. SAS also has limited capabilities for visualizations and almost no support for parallelization.

Integrating Hadoop with SAS
We will now describe the integration between SAS and the major Big Data Hadoop platforms, such as Hortonworks, Cloudera and MapR. We will also discuss some challenges facing while deploying SAS on the Hortonworks platform, and highlight best practices when using SAS on Hortonworks and Hive.

SAS developers/architects can easily access and manage Hadoop data. For example, data injection is made easy by using SAS tools such as SAS Data loader for Hadoop or SAS Code. Developers can easily access and integrate data from Hadoop, perform SAS processing on the Hadoop cluster by using MapReduce, or lift data from HDFS to memory and perform massive parallel distributed data processing. They can also perform any complicated analytical calculations and utilize interactive exploratory analytics.

SAS Modules
SAS consists of the following five modules (Figure 1): User interface, Metadata, Data Access, Data processing, and HDFS File System storage.

Figure 1. SAS modules

SAS Data Access
Hadoop data can be accessed in three ways (Figure 2): by using SAS/Access to Hadoop (i.e. From), by using the SAS Embedded Process (SAS In-DB) accelerator (i.e. In), or by using the SAS In-Memory Data Access (i.e. With).

Figure 2. Accessing Hadoop data with SAS

SAS Data Processing
Data processing in SAS involves four steps (Figure 3): Accessing and managing data, Exploring and visualizing data, Analyzing and modeling data, and Deploying and integrating data.

Figure 3. SAS data processing

SAS users can access the Hadoop HDFS or Hive data using the SAS Access engine (MR/TEZ) and SAS/Access to impala. They can also read/write and Integrate data to and from Hadoop using SAS Hadoop Data Loader.

SAS users interactively explore and visualize Hadoop data using SAS Visual Analytics, which allows exploring all types of data stored in Hadoop and sharing results quickly via Web reports, mobile devices or via MS Office products.

SAS users analyze and model data using modern statistical and machine-learning methods to uncover patterns and trends in the Hadoop data. SAS Visual Statistics provides an interactive, intuitive web-based interface for building both descriptive and predictive models of data of any size. SAS high-performance analytics provides in-memory capabilities so that users can develop a superior analytical model by using all the data. This is achieved by loading data into memory and running complex analytical models on the distributed in-memory data to quickly get results.

SAS users can automatically deploy and score analytical models in its parallel environment by using SAS Scoring Accelerator for Hadoop. The accelerator minimizes time to production for predictive models by automating model deployment in Hadoop. It also reduces data movement by bringing the scoring process to the Hadoop data, and produces faster answers by using the distributed processing in Hadoop. In addition, the SAS Event Stream Processing Engine puts intelligence into this process by continuously querying data streams to examine and filter data, even if millions of events are processed every second. By analyzing data as it arrives, users can detect anomalies and uncover useful data more quickly – and publish filtered data to their Hadoop cluster.

Challenges during installation of SAS on Hortonworks
Make sure that Hortonworks Ranger is disabled while installing SAS on Hortonworks. We need to install KMS manually as the Ranger KMS is not supported by SAS. We need to manually write scripts to generate the audit report and manually grant permission to users and/or groups. We need to make sure that auto Kerberos has been enabled for SAS users, as this will generate the Kerberos key tabs when we login to the cluster. We need to request to the system admin to allow access to SAS ports so that the access is not blocked by the firewall.

SAS on Hortonworks Best Practices
Make sure that SAS Embedded Process is installed on the Hadoop cluster. It is possible for DATA step code to be executed in parallel against the input data residing on the HDFS file system and thus enable the DATA step to be run inside Hadoop. To enable this, set the DSACCEL system option to ANY.

Using SAS in-database processing, we can run scoring models, SAS procedures, DS2 thread programs, and formatted SQL queries directly inside the data store by eliminating data movement between each node.

DS2 takes advantage of threaded processing, so it can bring significant performance gains running directly inside distributed databases via the SAS In-Database Code Accelerator. We can run a DS2 program directly to a data source using SAS In-Database Code Accelerator (multi-core on multiple nodes processing). Most of the SAS Embedded Process processing triggers MapReduce jobs, but with the latest release of SAS Data Loader for Hadoop, some specific data integration and data quality operations can be executed by using Apache Spark instead of the MapReduce framework.

Apache Tez is another data processing framework available in Hortonworks distributions. It can also be used as an alternative to MapReduce and is used by Hortonworks to dramatically improve the MapReduce speed, while maintaining MapReduce’s ability to scale to petabytes of data.

We can use SASHDAT which is the SAS High Performance proprietary format whose purpose is to allow operations between data in memory and data in HDFS to be as fast as possible. It will provide the best performance for the following tasks: Load HDFS data in SAS LASR Analytic Server, Save SAS LASR Analytic Server table in HDFS, and run a High Performance procedure (such as PROC HPDS2, HPSUMMARY, HPLOGISTIC, HPREG, HPGENSELECT) on data stored in HDFS.

Recommended best practice is running a simple CREATE TABLE AS SELECT (CTAS) in PROC SQL as one of these cases in Hadoop. Using the option DBIDIRECTEXEC will force specific operations (table create or delete) to be passed to the DBMS.

SAS on Hive Best Practices
Avoid merging SAS data with Hive data. It is recommended that you transform the SAS data set in a Hive table and run the merge inside Hive to leverage distributed processing and avoid network traffic between SAS workspace server node and the Hadoop cluster.

The ORDER BY clause on a large table is very costly in Hive. SAS code optimization can avoid the use of the ORDER BY statement in Hadoop.

SAS can work with the ORC file format in Hive with Default ZLIB compression. SAS option allowing you to create a file in an alternative format is ‘DBCREATE_TABLE_OPTS= stored as orc’ in the LIBNAME statement.

Data partitioning is often used when working with traditional database management systems to improve performances of queries. Hive supports partitioned data. Partitioning data in Hive can improve query performance. It is recommended that you use partitioning on low cardinality variables used often in the queries.