Hadoop Summit is the biggest event on the West Coast centered on Hadoop, the open source technology for large-scale data processing. The conference organizers, Hortonworks, estimated that more than 2,400 people attended, which if true would be double-digit growth from last year. Growth on the supplier side was even larger, which indicates the opportunity this market represents. Held in Silicon Valley, the event attracts enterprise customers, industry innovators, thought leaders and venture capitalists. Many announcements were made – too many to cover here. But I want to comment on a few important ones and explain what they mean to the emerging Hadoop ecosystem and the broader market.
Hortonworks is a company spun off by the architects of Yahoo’s Hadoop implementation. Flush with $50 million in new venture funding, the company announced the preview distribution of Apache Hadoop 2.0. This represents a fundamental shift away from the batch-only approach to processing big data of the previous generation. In particular, YARN (Yet Another Resource Manager; Yahoo roots are evident in this name) promises to solve the challenge of multiple workloads running on one cluster. YARN replaces the Hadoop Data Platform (HDP) job scheduler. In that system, a MapReduce job sees itself as the only tenant on HDFS, the Hadoop file system, and precludes any other workload. In YARN, MapReduce becomes a client of the resource manager, which can allocate resources according to differing workload needs. According to Bob Page, product manager at Hortonworks, and Shaun Connolly, VP of corporate strategy, this mixed workload capability opens the door to additional ISV plug-ins including advanced analytics and
As noted, various vendors announced their own Hadoop advances at the summit. For one, Teradata continues to expand its Hadoop-based product portfolio and its Unified Data Architecture that I covered recently. The company introduced the Teradata Appliance for Hadoop as well as support for Hadoop utilizing Dell’s commodity hardware. While another Hadoop appliance in the market may not be big news, the commitment of Teradata to the Hadoop community is important. Its professional services work in close partnership with Hortonworks, and now they will offer full scoping and integration services along with the current support services. This enables Teradata to maintain its trusted advisor role within accounts while Hortonworks can take advantage of a robust services and account management structure to help create new business.
Quentin Clark, Microsoft’s VP for SQL Server, gave a keynote address
Datameer made the important announcement of version 3.0 of its namesake flagship product with a celebrity twist. Olympic athlete Sky Christopherson presented a keynote telling how the U.S. women’s cycling team, a heavy underdog, used Datameer to help it earn a silver medal in London. Following that, Stephen Groschupf, CEO of Datameer and one of the original contributors to Nutch (Hadoop’s predecessor), discussed advances in 3.0, which include a variety of advanced analytic techniques such as clustering, decision trees, recommendations and column dependencies. The ability to do these types of advanced analytics and visualize the data natively on Hadoop is not currently available in the market. My coverage of Datameer from last year can be found here.
Splunk announced Hunk, a tool that integrates exploration and visualization in Hadoop and will enable easier access for ‘splunking’ Hadoop clusters. In this tool, Splunk introduces a virtual indexing technology in which indexing occurs in an ad-hoc fashion as it is fed into a columnar data store. This enables analysts to test hypotheses through a “slice and dice” approach once the initial search discovery phase is completed. Sanjay Meta, VP of marketing for Splunk, explained to me how such a tool enables faster time-to-value for Hadoop. Currently there are multiple requests for data resting in Hadoop, but it takes a data scientist to access them. By applying Splunk’s tools to the Hadoop world, the data scientists can move on to more valuable tasks while users trained in Splunk can register and address such requests. Hunk is still somewhat technical in nature and requires specific Splunk training, but the demonstration showed a no-code approach that through the user-friendly Splunk interface returns robust descriptive data in visual form, which can then be worked with in an iterative fashion. My most recent analysis of Splunk can be found here.
Pentaho also made several announcements. The biggest in terms of market impact is that it has become the sole ETL provider for Rackspace’s Hadoop-as-a-service initiative, which aims to deliver a full big data platform in the cloud. Pentaho also announced the Pentaho Labs initiative which will be the R&D arm for the Pentaho open source community. This move should lift both the enterprise and the community, especially in the context of Pentaho’s recent acquisition of Webdetails, a Portuguese analytics and visualization company active in Pentaho’s open source community. The company also announced Adaptive Big Data Layer, which provides a series of plug-ins across the Hadoop ecosystem including all of the major distributions. And a new partnership with Splunk enables read/write access to the Splunk data fabric. Pentaho also is providing tighter integration with MongoDB (including aggregation frameworks) and the Cassandra DBMS. Terilyn Palanca, director of Pentaho’s big data product marketing, and Rebecca Shomair, corporate communications director, made the point that companies need to hedge their bets within the increasingly divergent Hadoop ecosystem and that Pentaho can help them reduce risk in this regard. Mark Smith’s most recent analysis of Pentaho can be found here.
In general what struck me most about this exciting week in the world of Hadoop are the divergent philosophies and incentives at work in the market. The distributions of Map R, Hortonworks, Cloudera and Pivotal (Greenplum) continue to compete for dominance with varying degrees of proprietary and open source approaches. Teradata is also becoming a subscription reseller of Hortonworks HDP to provide even more options to its customers. Datameer and Platfora are taking pure-play integrated approaches, and Teradata, Microsoft and Pentaho are looking at ways to marry the old with the new by guarding current investments and adding new hadoop based capabilities. Another thing that struck me was that no Business intelligence vendors had a presence outside of visual discovery provider, Tableau. This is curious given that many vendors this week talked about responding to the demands of business users for easier access to Hadoop data. This is something our research shows to be a buying trend in today’s environment: Usability is the most important buying criterion in almost two out of three (64%) organizations. Use cases say a lot about usability and people talking on stage and off about their Hadoop experiences increased dramatically this year. I recently wrote about how much has changed in the use of big data in one year, and the discussions at Hadoop Summit confirmed my thoughts. Hortonworks and it ecosystem of partners are now able to further gain opportunity to meet a new generation of big data and information optimization needs. At the same time, we are still in the early stages of turning this technology to business use that requires a focus on use cases and gaining benefits on a continuous basis. Disruptive innovations often take decades to be fully embraced by organizations and society at large. Keep in mind that it was only in December 2004 that Google Labs published its groundbreaking paper on MapReduce.
Regards,
Tony Cosentino
VP and Research Director