Datameer , a Hadoop-based analytics company, had a major presence at recent Hadoop Summit, led by CEO Stephan Groschupf’s keynote and panel appearance. Besides announcing its latest product release, which is an important advance for the company and its users, Datameer’s outspoken CEO put forth contrarian arguments about the current direction of some of the distributions in the Hadoop ecosystem.
The challenge for the growing ecosystem surrounding Hadoop, the open source processing paradigm, has been in accessing data
At the conference, Datameer made the announcement of version 3.0 of its namesake product with a celebrity twist. Olympic athlete Sky Christopherson presented a keynote telling how the U.S. women’s cycling team, a heavy underdog, used Datameer to help it earn a silver medal in London. Following that introduction, Groschupf, one of the original contributors to Nutch (Hadoop’s predecessor), discussed features of Datameer 3.0 and what the company is calling “Smart” analytics, which include a variety of advanced analytic techniques such as clustering, decision trees, recommendations and column dependencies.
Our benchmark research into predictive analytics shows that
Datameer’s column dependencies can show analysts relationships between different column variables. The output appears much like a correlation matrix, but uses a technique called Mutual Information. The key benefit of this technique over a traditional correlation approach is that it allows comparison between different types of variables, such as continuous and categorical variables. However, there is a trade-off in usability: The numeric output is not represented by the correlation coefficient with which many analysts are familiar. (I encourage Datameer to give analysts a quick reference of some type to help interpret the numbers associated with this less-known output.) Once the output is understood, it can be useful in exploring specific relationships and testing hypotheses. For instance, a company can test the hypothesis that it is more vertically focused than competitors by looking at industry and deal close rates. If there is no relationship between the variables, the hypothesis may be dismissed and a more horizontal strategy pursued.
The other technique Datameer spoke of is recommendation, also known as next best offer analysis; it is a relatively well known technique that has been popularized by Amazon and other retailers. Recommendation engines can help marketing and sales teams increase share of wallet through cross-sell and up-sell opportunities. While none of these four techniques is new to the world of analytics, the novelty is that Datameer allows this analysis directly on Hadoop, which incorporates new forms of data including Web behavior data and social media data. While many in the Hadoop ecosystem focus on descriptive analysis related to SQL, Datameer’s foray into more advanced analytics pushes the Hadoop envelope.
Aside from the launch of Datameer 3.0, Groschupf and his team used Hadoop Summit to espouse the position that the SQL approach of many Hadoop vendors is a mistake. The crux of the argument is that Hadoop is a sequential access technology (much like a magnetic cassette tape) in which a large portion of the data must be read before the correct data can be pulled off the disk. Groschupf argues that this is fundamentally inefficient and that current MPP SQL approaches do a much better job of processing SQL-related tasks. To illustrate the difference he characterized Hadoop as a freight train and an analytic appliance database as a Ferrari; each, of course, has its proper uses. Customers thus should decide what they want to do with the data from a business perspective and then chose the appropriate technology.
This leads to another point Groschupf made to me: that the big data discussion is shifting away from the technical details to a business orientation. In support of this point, he showed me a comparison of the Google search terms “big data” and “Hadoop.” The latter was more common in the past few years, when it was almost synonymous with big data, but now generic searches for big data are more common. Our benchmark research into business technology innovation shows a similar shift in buying criteria, with about two-thirds (64%) of buyers naming usability as the most important priority. By the way, a number of Ventana Research blogs including this one have focused on the trend of outcome based buying and decision making.
For organizations curious about big data and what they can do to take advantage of it, Datameer can be a low-risk place to start exploring. The company offers a free download version of its product so you can start looking at data immediately. The idea of time-to-value is critical with big data, and this is a key value proposition for Datameer. I encourage users to test the product with an eye to uncover interesting data that was never available for analysis before. This will help build the big data business use case especially in a bootstrap funding environment where money, skills and time are short.
Regards,
Tony Cosentino
VP and Research Director