The Brite Group has been accredited as a CMMI Level 2 Organization!
The Brite Group Inc. has achieved a successful appraisal rating of Level 2 for Services by Capability Maturity Model Integration (CMMI).
The Brite Group Inc. has achieved a successful appraisal rating of Level 2 for Services by Capability Maturity Model Integration (CMMI).
U.S. government agencies, along with private corporations, are increasingly adapting cloud computing for data management. For many years organizations have struggled with traditional architectures due to high costs and ongoing maintenance. Companies such as Amazon, IBM, and Google, to name a few, are working closely with government agencies to help them accomplish their data management goals, by offering secured cloud services. With that comes a rigorous FedRAMP authorization that allows companies to handle government data in the cloud.
The Federal Risk and Authorization Management Program, or FedRAMP, is a government-wide program that provides a standardized approach to security assessment, authorization, and continuous monitoring for cloud products and services. In recent years many products have gone through the FedRAMP process. Currently, there are approximately 90 authorized FedRAMP products and 62 are in the process of authorization.
The US government has long prioritized electronic use of data. The 2012 Presidential Memorandum – Managing Government Records – called for a digital transition in government. Additionally, the memorandum called for research on the use of automated technologies to “reduce the burden of records management responsibilities”. Demands for data management and data analysis are growing rapidly, and government agencies need an architecture that can be cost efficient, energy efficient, scalable, fast, and secured.
Foundations for Data Management Architecture
Data integration helps government agencies gather data residing in different sources to provide business users with a unified view of their business domain. In recent years, government agencies have adopted an integrated architecture to improve data collection and process automation.
Data quality management and Metadata management are two other key processes government agencies are incorporating in their organizations. Data quality management empowers an agency to take a holistic approach to managing data quality across the enterprise. It proactively monitors and cleanses data across the enterprise to maximize the return on investment in data. Lastly, Metadata management processes collect metadata from a data integration environment and provide a visual map of the data flows within the environment. These three processes (data integration, data quality management and metadata management) provide a solid foundation to having a great data management architecture. Informatica and Talend, for example, are some of the companies that for many years have proven experience helping government agencies achieve business goals with data management products.
Cloud Data Management Principles
Cloud data management is a way to manage data across cloud platforms, either together with an on-premises storage infrastructure or without it. The cloud typically serves as a data storage tier for disaster recovery, backup, and long-term archiving.
With data management in the cloud, resources can be purchased as needed. Data can also be shared across private and public clouds, as well as with the on-premises storage infrastructure. While some platforms can manage and use data across cloud and on-premises environments, cloud data management takes into account that the data stored on-premises and the data stored in the cloud can be subject to completely different practices.
Indeed, data stored in the cloud has its own rules for data integrity and security. Traditional data management methods may not apply to the cloud, so having management in place designed for the particular requirements of the cloud is vital.
Typical Cloud Data Management components are the following:
1. Automation and orchestration, including services for application migration, provisioning and deploying virtual machines images and instances, and configuration management;
2. Cost management, including services for cloud instance right sizing and user chargeback and billing;
3. Performance monitoring of the compute, storage, networking and application infrastructure;
4. Security, including services for identity and access management (IAM), encryption, and mobile/endpoint security; and
5. Governance and compliance, including risk assessment/threat analysis, audits, and service and resource governance.
The benefits of using cloud data management include consolidation of processes such as backup, disaster recovery, archiving and analytics, as well as cost savings. Some cloud data management companies also offer ransomware protection, by keeping data and applications native to the platform in a secure, immutable format.
Best Practices for Cloud Data Management
The journey to the cloud can take many forms and follow diverse paths. An increasing number of organizations are making strategic commitments to the cloud as a preferred computing platform. These commitments involve a wide range of use cases, from operations to analytics to compliance. For example, the survey for TDWI’s Emerging Technologies Best Practices Report revealed that many enterprises already have cloud-based solutions for data warehousing (35% of respondents), analytics (31%), data integration (24%), and Hadoop (19%).
Organizations may move their entire application portfolio or just a single application to the cloud. However, the best practices for data management don’t go away, in fact, they are more important than ever. As more organizations begin their journey to the cloud, they need to plan how they will apply the best practices of data management to ensure that cloud-based, data-driven use cases are successful for end users while also complying with enterprise governance and data standards. The good news is that existing best practices work well in cloud environments, although adjustments and upgrades to existing skills and tool portfolios are usually needed.
Here are some of the key best practices to consider when deploying Cloud Data Management:
· Identify cloud use cases carefully;
· The cloud should not be treated separately, rather the data should be managed and governed holistically, regardless of the data’s platform or location;
· Substantial metadata management infrastructure is a must before deploying to the cloud;
· Choose data integration platforms carefully with cloud in mind by giving priority to data integration requirements for clouds;
· Data management should be designed for hybrid platforms and not entirely for on-premise or cloud platforms; and
· Organizational changes (skills, processes, and people) before embracing the cloud is vital.
Review of the cloud based data management services providers
Rubrik, which brands itself as “the Cloud Data Management Company,” is considered a major cloud data management player. The vendor’s Cloud Data Management scale-out platform uses a single interface to manage data across public and private clouds. In addition to offering ransomware recovery, Rubrik’s platform is the first one to support hybrid cloud[A2] environments.
Products that have traditionally excelled in on-premise implementations of data management solutions have also been adapting[A3] their services for the cloud. For example, Informatica offers robust cloud data management services, such as integrated contact verification, cloud test data management, customer, product and supplier 360 cloud services, and cloud data quality radar. Tableau Online provides self-service analytics in the cloud, and allows users to share and collaborate by interacting, editing and authoring on the web, to connect and access data from a variety of cloud databases, and to stay secure in the cloud. Talend, on the other hand, offers two types of products for cloud data management: a fully functional, open source data management solution offering data integration, data profiling and master data management; and a subscription-based data management solution for organizations with enterprise-scale data management needs.
Other vendors in the cloud data management space include Oracle Cloud, which offers a comprehensive data management platform for traditional and modern solutions that combine a diverse set of workloads and data types; Commvault, which has a cloud data management platform that supports multiple clouds and on-premises data; and VMware and its vRealize Suite, which delivers cloud data management for hybrid cloud environments. Komprise and Red Hat’s Cloud Forms platform also focus on cloud-based data management.
One of the side effects of the exploding world of analytic innovation is that taking advantage of the latest techniques often requires learning a new set of programming languages and tools. Waiting for your favorite analytic tool vendor to catch up and provide an integrated solution isn’t usually an option. Leading analytic teams will inevitably need to use multiple tools to support their business needs, so the best approach is to embrace diversity and create a flexible infrastructure that can operationalize models authored by a wide range of programming languages and tools.
Many studies have listed SAS, R and Python as the most widely used programming languages among the big data analytics professionals, primarily because of their ability to take advantage of the enormous processing power and speed of the Hadoop environment for large data sets.
SAS, for example, has been the undisputed market leader in commercial analytics space. The software offers huge array of statistical functions, has good GUI for people to learn quickly and provides awesome technical support. However, it ends up being an expensive option and is not always enriched with latest statistical functions.
R, on the other hand, is the open source counterpart of SAS, which has traditionally been used in academics and research. Because of its open source nature, latest techniques get released quickly. There is a lot of documentation available over the internet and it is a very cost-effective option.
With its origination as an open source scripting language, Python analytics usage has grown over time. Today, it sports a large number of libraries and functions for almost any statistical operation/model building. Since introduction of pandas, it has also become very strong in operations on structured data.
Python is used widely in web development, so if you are in an online business, using Python for web development and analytics can provide synergies. SAS, on the other hand, used to have a big advantage of deploying end-to-end infrastructure analytics solutions (visual analytics, data warehouse, data quality, reporting and analytics), however this has recently been mitigated by integration/support of R on platforms like SAP HANA and Tableau (although the R integration with these platforms is still away from the seamless integration of SAS).
In this blog, we will briefly introduce R, Python and SAS and describe their integration with Hadoop for the purposes of Big Data Analytics.
R is perhaps the first choice of most data scientists and modelers because R was designed for this very purpose. R has robust support for machine learning algorithms and its library base keeps on growing every day. There will hardly be any discipline – be it bio-technology or geo-spatial processing – where a ready package will not be available in R. R is also fairly easy to learn and has good tutorials and support available – though being free product, support is often in forums and examples rather than in documentations. R is also great for data and plot visualizations, which is almost always necessary for data analysis.
Like Python, R is hugely popular (one poll suggested that these two open source languages were between them used in nearly 85% of all Big Data projects) and supported by a large and helpful community. Where Python excels in simplicity and ease of use, R stands out for its raw number crunching power. R is often called “a language for statisticians built by statisticians”. If you need an esoteric statistical model for your calculations, you’ll likely find it on CRAN — its Comprehensive R Archive Network. Its widespread adoption means you are probably executing code written in R every day, as it was used to create algorithms behind Google, Facebook, Twitter and many other services.
However, if you are not a data scientist and haven’t used Matlab, SAS, or OCTAVE before, it can take a bit of adjustment to be productive in R. While it’s great for data analysis, it is less good at more general purposes. You would construct a model in R, but you would consider translating the model into Scala or Python for production, and you’d be unlikely to write a clustering control system using the language.
R has literally thousands of extension packages that allow statisticians to undertake specialized tasks, including text analysis, speech analysis, and tools for genomic sciences. The center of a thriving open-source ecosystem, R has become increasingly popular as programmers have created additional add-on packages for handling big datasets and parallel processing techniques that have come to dominate statistical modeling today. Selected packages include:
R is used in industries like finance, health care, marketing, business, pharmaceutical development, and more. Industry leaders like Bank of America, Bing, Facebook, and Foursquare use R to analyze their data, make marketing campaigns more effective, and reporting.
Hadoop is the most important framework for working with Big Data. The best part of Hadoop is that it is scalable and can be deployed for any type of data in various varieties like structured, unstructured and semi-structured type.
But in order to derive analytics from it, there is a need to extend its capabilities and hence the integration with R programming is a necessity in order to get Big Data analytics. The most common ways in which Hadoop and R can be integrated are as follows.
RHadoop: this is an Open Source package provided by Revolution Analytics. These come with three packages that can be readily used for R analysis and working with Hadoop framework data:
RHIPE: this is an integrated programming environment that is developed by the Divide and Recombine (D&R) for analyzing large amounts of data. RHIPE stands for R and Hadoop Integrated Programming Environment.
The RHIPE lets you work with R and Hadoop integrated programming environment. You can use Python, Java or Perl to read data sets in RHIPE. There are various functions in RHIPE that lets you interact with HDFS.
ORCH: this is the Oracle R Connector which can be used to exclusively work with Big Data in Oracle appliance or on non-Oracle framework like Hadoop.
The Oracle R Connector for Hadoop can be used for deploying R on Oracle Big Data Appliance or for non-Oracle frameworks like Hadoop with equal ease. The ORCH lets you access the Hadoop cluster via R and also to write the Mapping and Reducing functions. You can also manipulate the data residing in the Hadoop Distributed File System.
HadoopStreaming: this is the R Script available as part of the R package on CRAN. This intends to make R more accessible to Hadoop streaming applications. Using this you can write MapReduce programs in a language other than Java.
The Hadoop Streaming lets you write MapReduce codes in R language making it extremely user-friendly. Java might be the native language for MapReduce but it is not suited for high speed data analysis needs of today and hence there is a need for faster mapping and reducing steps with Hadoop and this is where Hadoop Streaming comes into the picture wherein you can write the codes in Python, Perl or even Ruby. The Hadoop Streaming then converts the data to how the users need it. When you are doing data analysis using R there is no need to write codes from the command line interface as the data will be sitting on the background and all you need to do is create data, partition data and compute summaries.
Another free language/software, Python has great capabilities overall for general purpose functional programming. Python works well for web-scrapping, text processing, file manipulations, and simple or complex visualizations. Compared to R, Python is advancing in dealing with structured data and analytical models, and it doesn’t support data visualization in as much detail as R. However, as opposed to R, Python is a traditional object-oriented language, so most developers will be fairly comfortable working with it, whereas first exposure to R or Scala can be quite intimidating.
Python is one of the most popular open source (free) languages for working with the large and complicated datasets needed for Big Data. It has become very popular in recent years because it is both flexible and relatively easy to learn. Like most popular open source software it also has a large and active community dedicated to improving the product and making it popular with new users.
Python has been very popular in academia for more than a decade, especially in areas like Natural Language Processing (NLP). As a result, if you have a project that requires NLP work, you’ll face a huge number of choices, including the classic NTLK, topic modeling with GenSim, or the blazing-fast and accurate spaCy. Similarly, Python scores well when it comes to neural networking, with Theano and Tensorflow; then there’s scikit-learn for machine learning, as well as NumPy and Pandas for data analysis.
There’s Juypter/iPython too — the Web-based notebook server that allows you to mix code, plots, and, well, almost anything, in a shareable logbook format. This had been one of Python’s killer features, although these days, the concept has proved so useful that it has spread across almost all languages that have a concept of Read-Evaluate-Print-Loop (REPL), including both Scala and R.
Python is a high-level language, meaning that the creators automated certain housekeeping processes in order to make code easier to write. Python has robust libraries that support statistical modeling (Scipy and Numpy), data mining (Orange and Pattern), and visualization (Matplotlib). Scikit-learn, a library of machine learning techniques very useful to data scientists, has attracted developers from Spotify, OKCupid, and Evernote, but can be challenging to master.
Two of the most popular ways to integrate Hadoop with Python are Hadoopy and Pydoop.
Hadoopy: a Python wrapper for Hadoop Streaming written in Cython
Hadoopy is simple, fast, and readily hackable. It has been tested on 700+ node clusters.
The goals of Hadoopy are:
Killer Features (unique to Hadoopy):
Hadoopy is used in several use cases and tools, such as in searching computer help using screenshots and keywords, in web-scale computer vision using MapReduce for multimedia data mining, in Vitrieve (a visual search engine) and in Picarus (a Hadoop computer vision toolbox).
Pydoop: Writing Hadoop Programs in Python
Installed as a layer above Hadoop, the open-source Pydoop package enables Python scripts to do big data work easily. Pydoop Script enables you to write simple MapReduce programs for Hadoop with mapper and reducer functions in just a few lines of code. When Pydoop Script isn’t enough, you can switch to the more complete Pydoop API, which provides the ability to implement a Python Partitioner, RecordReader, and RecordWriter. In addition, Pydoop makes it easy to interact with HDFS (Hadoop Distributed File System) through a Pydoop HDFS API (pydoop.hdfs), which allows you to retrieve information about directories, files, and several file system properties. The Pydoop HDFS API makes it possible to easily read and write files within HDFS by writing Python code. In addition, the lower-level API provides features similar to the HadoopC HDFS API, and so you can use it to build statistics of HDFS usage.
Pydoop might not be the best API for all Hadoop use cases, but its unique features make it suitable for a wide range of scenarios and it is being actively improved.
SAS – Statistical Analysis Software
SAS is one of the most common tools out there for data processing and model development. When analytics function started emerging in the financial service sector couple of decades ago, SAS became common choice because of its simplicity and lot of support and documentation. SAS comes handy both for step by step data processing and automated scripting. All is well, except, SAS wasn’t and isn’t cheap. SAS also has limited capabilities for visualizations and almost no support for parallelization.
We will now describe the integration between SAS and the major Big Data Hadoop platforms, such as Hortonworks, Cloudera and MapR. We will also discuss some challenges facing while deploying SAS on the Hortonworks platform, and highlight best practices when using SAS on Hortonworks and Hive.
SAS developers/architects can easily access and manage Hadoop data. For example, data injection is made easy by using SAS tools such as SAS Data loader for Hadoop or SAS Code. Developers can easily access and integrate data from Hadoop, perform SAS processing on the Hadoop cluster by using MapReduce, or lift data from HDFS to memory and perform massive parallel distributed data processing. They can also perform any complicated analytical calculations and utilize interactive exploratory analytics.
SAS consists of the following five modules (Figure 1): User interface, Metadata, Data Access, Data processing, and HDFS File System storage.
Figure 1. SAS modules
SAS Data Access
Hadoop data can be accessed in three ways (Figure 2): by using SAS/Access to Hadoop (i.e. From), by using the SAS Embedded Process (SAS In-DB) accelerator (i.e. In), or by using the SAS In-Memory Data Access (i.e. With).
Figure 2. Accessing Hadoop data with SAS
SAS Data Processing
Data processing in SAS involves four steps (Figure 3): Accessing and managing data, Exploring and visualizing data, Analyzing and modeling data, and Deploying and integrating data.
Figure 3. SAS data processing
SAS users can access the Hadoop HDFS or Hive data using the SAS Access engine (MR/TEZ) and SAS/Access to impala. They can also read/write and Integrate data to and from Hadoop using SAS Hadoop Data Loader.
SAS users interactively explore and visualize Hadoop data using SAS Visual Analytics, which allows exploring all types of data stored in Hadoop and sharing results quickly via Web reports, mobile devices or via MS Office products.
SAS users analyze and model data using modern statistical and machine-learning methods to uncover patterns and trends in the Hadoop data. SAS Visual Statistics provides an interactive, intuitive web-based interface for building both descriptive and predictive models of data of any size. SAS high-performance analytics provides in-memory capabilities so that users can develop a superior analytical model by using all the data. This is achieved by loading data into memory and running complex analytical models on the distributed in-memory data to quickly get results.
SAS users can automatically deploy and score analytical models in its parallel environment by using SAS Scoring Accelerator for Hadoop. The accelerator minimizes time to production for predictive models by automating model deployment in Hadoop. It also reduces data movement by bringing the scoring process to the Hadoop data, and produces faster answers by using the distributed processing in Hadoop. In addition, the SAS Event Stream Processing Engine puts intelligence into this process by continuously querying data streams to examine and filter data, even if millions of events are processed every second. By analyzing data as it arrives, users can detect anomalies and uncover useful data more quickly – and publish filtered data to their Hadoop cluster.
Make sure that Hortonworks Ranger is disabled while installing SAS on Hortonworks. We need to install KMS manually as the Ranger KMS is not supported by SAS. We need to manually write scripts to generate the audit report and manually grant permission to users and/or groups. We need to make sure that auto Kerberos has been enabled for SAS users, as this will generate the Kerberos key tabs when we login to the cluster. We need to request to the system admin to allow access to SAS ports so that the access is not blocked by the firewall.
Make sure that SAS Embedded Process is installed on the Hadoop cluster. It is possible for DATA step code to be executed in parallel against the input data residing on the HDFS file system and thus enable the DATA step to be run inside Hadoop. To enable this, set the DSACCEL system option to ANY.
Using SAS in-database processing, we can run scoring models, SAS procedures, DS2 thread programs, and formatted SQL queries directly inside the data store by eliminating data movement between each node.
DS2 takes advantage of threaded processing, so it can bring significant performance gains running directly inside distributed databases via the SAS In-Database Code Accelerator. We can run a DS2 program directly to a data source using SAS In-Database Code Accelerator (multi-core on multiple nodes processing). Most of the SAS Embedded Process processing triggers MapReduce jobs, but with the latest release of SAS Data Loader for Hadoop, some specific data integration and data quality operations can be executed by using Apache Spark instead of the MapReduce framework.
Apache Tez is another data processing framework available in Hortonworks distributions. It can also be used as an alternative to MapReduce and is used by Hortonworks to dramatically improve the MapReduce speed, while maintaining MapReduce’s ability to scale to petabytes of data.
We can use SASHDAT which is the SAS High Performance proprietary format whose purpose is to allow operations between data in memory and data in HDFS to be as fast as possible. It will provide the best performance for the following tasks: Load HDFS data in SAS LASR Analytic Server, Save SAS LASR Analytic Server table in HDFS, and run a High Performance procedure (such as PROC HPDS2, HPSUMMARY, HPLOGISTIC, HPREG, HPGENSELECT) on data stored in HDFS.
Recommended best practice is running a simple CREATE TABLE AS SELECT (CTAS) in PROC SQL as one of these cases in Hadoop. Using the option DBIDIRECTEXEC will force specific operations (table create or delete) to be passed to the DBMS.
Avoid merging SAS data with Hive data. It is recommended that you transform the SAS data set in a Hive table and run the merge inside Hive to leverage distributed processing and avoid network traffic between SAS workspace server node and the Hadoop cluster.
The ORDER BY clause on a large table is very costly in Hive. SAS code optimization can avoid the use of the ORDER BY statement in Hadoop.
SAS can work with the ORC file format in Hive with Default ZLIB compression. SAS option allowing you to create a file in an alternative format is ‘DBCREATE_TABLE_OPTS= stored as orc’ in the LIBNAME statement.
Data partitioning is often used when working with traditional database management systems to improve performances of queries. Hive supports partitioned data. Partitioning data in Hive can improve query performance. It is recommended that you use partitioning on low cardinality variables used often in the queries.
The Brite Group Inc. is in conformance with ISO 9001:2015 Quality Management System.
In 2016 the U.S. Energy Information Administration (EIA) unveiled the results of a revolutionary new hourly survey for electric power data – the first hourly data collection by any federal statistical agency. The data is collected from 66 electric systems (balancing authorities) and while hourly data is being produced by most balancing authorities, this is the first time it is being done for all the 48 contiguous states in a standardized format. The nearly real-time data provides actual and forecast demand, net generation, and the power flowing between the electric systems.
Why is this data collection important and how was it accomplished?
Due to the lack of sufficient cost-effective storage, electricity is produced at the moment it is used. This makes analysis of the U.S. electric grid challenging. Plus, the previous data collections could only provide monthly data that lagged several weeks in production and annual data that lagged by several months. The new, hourly visibility into the functioning of the U.S. electric grid has several potential applications such as 1) Assisting federal and state & local governments in tuning their electric regulation policies in a more timely manner, 2) Providing nearly real-time information on the recovery of the electric grid after a major disturbance (e.g. hurricanes), 3) Evaluating the impact of renewable power, smart grid, and demand response programs on the power industry, and more.
The technology responsible for the data collection is the Informatica Platform consisting of Informatica Data Exchange, Informatica PowerCenter Data Integration, and Informatica Data Quality along with Informatica B2B for parsing XML files. The automated data feeds come into the Informatica Data Exchange in XML format and are then converted into a relational database format for further processing.
Josh Bollinger, Informatica Developer at The Brite Group, Inc.