ISG Software Research Analyst Perspectives

EMC Looks to Be Pivotal for Big Data

Written by Mark Smith | Mar 6, 2013 2:42:03 PM

The big-data landscape just got a little more interesting with the release of EMC’s Pivotal HD distribution of Hadoop. Pivotal HD takes Apache Hadoop and extends it with a data loader and command center capabilities to configure, deploy, monitor and manage Hadoop. Pivotal HD, from EMC’s Pivotal Labs division, integrates with Greenplum Database, a massively parallel processing (MPP) database from EMC’s Greenplum division, and uses HDFS as the storage technology. The combination should help sites gain from big data a key part of its value in information optimization.

Greenplum and EMC have been working with Hadoop technology to provide robust database and analytic technology offerings. EMC is using Hadoop and HDFS as a foundation to support a new generation of information architectures, on top of which the company provides a value-added layer of data and analytic processing to support a range of big data needs. The aim is to address one of the benefits of big data technology, which is to increase the speed of analysis; our big data benchmark research found that to be a key benefit for 70 percent of organizations.

EMC is placing a bet by building its distribution on top of Apache Hadoop 2.02, which has yet to be officially released. The company is testing its software on a thousand-node cluster to ensure it will be ready. While EMC calls Pivotal HD the most powerful Hadoop distribution, it is one of many new providers that are building on Hadoop technologies and commercializing it for organizations looking for direct support and services or looking for value-added technology on top of Hadoop. Oddly, however, EMC’s new offering appears to be competitive with its own licensing of MapR for a product it calls Greenplum MR.

EMC is calling the advanced database processing technology with Pivotal HD a new name of HAWQ. It provides the ability to use ANSI SQL in an optimized manner against big data through a query parser and optimizer with its own HAWQ nodes process query execution against HDFS data nodes. HAWQ also has its own Xtension Framework for adaptability to other technologies. HAWQ improves upon the performance of regular SQL as it is a specialized technology to manage distributed and optimized queries to data in  Hadoop.

By supporting SQL as the language to get to Hadoop, HAWQ simplifies standardized access to big data through this approach that provides query optimization through its query planning and pipelining methods. Providing a SQL interface and an ODBC connection is not new; many Hadoop distributions now provide ODBC connectivity, including Cloudera, Hortonworks and MapR. EMC, however, uses its optimized query and SQL connection in HAWQ as an accelerator, which lets it stack its software technology up against any data and analytic technology, not just Hadoop. The question for organizations thinking about making an investment in this approach is whether they are limiting their access to future Hadoop advancements by investing in HAWQ technology that operates with only the Pivotal HD distribution or does the gains provide immediate value to separate any Hadoop challenges in optimizing its infrastructure. It is my belief that if an organization adopts this path of HAWQ, it will need to ensure it invests in an information architecture that includes integration technology at the HDFS level, as businesses will inevitably be operating against varying flavors of Hadoop.

Another area of differentiation EMC promises for HAWQ is in the area of performance. EMC claims exponential performance improvement using its query optimizer and SQL versus using Hive to access HDFS or Cloudera Impala and native Hadoop. In fact it claims 19 to 648 times faster performance using its own benchmark. Since these benchmarks were not run independently, it is hard to place significant value in them for now. I made inquiries to many Hadoop software providers, including Cloudera, and they said these metrics are probably not that accurate and invited performance comparisons against their technologies. Clearly these benchmarks should have been released to the Hadoop community for its members to design optimized queries using Hive for more accurate comparisons, but EMC is hoping that its results will entice IT professionals to try it for themselves.

EMC’s stature in the market and its work with a broad range of technology partners makes it an important player in the big data market. Tableau Software is one of those partners, providing discovery on data from HAWQ and Pivotal HD for analytics. Cirro also announced support for Pivotal HD, enabling a new generation of what I call big data integration. These partners are good examples and provide EMC a more complete stack of technologies for operating in a more enterprise approach for big data from analyst to connectivity to other data sources.

EMC can deploy its big data technology across a variety of deployment methods, including public cloud with OpenStack and Amazon Web Services (AWS), private cloud using VMware, and on-premises. Our big data research shows faster growth planned for hosted (59%) and software as a service (65%) than for future on-premises deployments. While EMC is not allowed to publicly mention its customer references, and I have yet to validate them, the company says they include some of the largest banks and manufacturers.

Meanwhile, the Hadoop community’s new project Tez provides an alternative to bypass MapReduce to improve performance. It uses Hadoop YARN for a more efficient run time and better performance for queries. Also, the Stinger Initiative is a project to improve interactive query support for Hive.

EMC acknowledges open source efforts that focus on improving the performance of accessing HDFS and look forward to those advancements and where they can be extracted into its Pivotal HD product but points to its query optimizer and ANSQ SQL as a better approach. It also did not deny that its performance comparisons could have been more optimized. But EMC is betting that its HAWQ efforts and its reliance on the next release of Apache Hadoop 2 will place it in a good market position, leveraging open source technology that is expected to be released in 2013.

This move to introduce Pivotal HD Enterprise and HAWQ is clearly an opportunity to accelerate EMC’s efforts. Greenplum’s technology needed assistance to grow its adoption as it competes with approaches that encompass not only Hadoop but also in-memory, appliance and RDBMS technology. Only time will tell how EMC’s focus on big data with Pivotal HD and HAWQ will play out. The battle among big data providers continues to be very competitive, with dozens of approaches. As each company moves from experimentation to development to production, it must carefully determine what technology will best meet its unique needs. Organizations should evaluate HAWQ and Pivotal HD on not just the merits of performance or providing SQL access but on the architectural and management needs of IT that span from adaptability, manageability, reliability and usability and the business value that should be ascertained with this technology compared to other Hadoop and big-data technology approaches.

Regards,

Mark Smith

CEO & Chief Research Officer