Monday, February 24, 2014

Strata '14 Santa Clara: Lots of data, not so much analytics

While the quality of the conference was excellent overall, there were too many data pipeline platforms and database types showcased. As a quantitatively-oriented person, I really don't care that much for the latest in-memory database or being able to pipeline my data graphically. In my work, I do everything I can to abstract out data handling particulars. I just care that my algorithms work. I realize I do need to know something about the underlying infrastructure if I want to be able to maximize performance...but why should I??

Now some vendors do have some analytic capability in their platform, but why should I rely on their implementation? I should be able to easily apply an algorithm of my choosing which is built on some kind of abstraction and the vendor should support this abstraction. Furthermore, I should be able to examine analytic code (It's great that 0xdata recognizes that as it is open-source).

This is the 'big data' picture I'm seeing; and I'm not liking the silos each vendor tries to make and the (current?) focus on data platforms. The VC panel emphasized that the value from 'big data' is going to come from applications (which of course relies on analytics). Maybe the reported data scientist shortage has something to do with this?

Please inform me if I'm missing something.

This is in contrast to what I've seen at PyData where perhaps the culture of attendees is more quantitative and technical with a firm grasp of mathematics as well as computing issues. In that conference infrastructure use was dictated by analytics in a top-down fashion.