Tuesday, December 24, 2013

PyData NYC 2013.. Breeding Ground for Highly-Innovative Data Science Tools

I recently attended the PyData conference in New York City. I'd like to give my general impression (as opposed to showing you details you can get by visiting the links on the website). The impression I want to express is that of the culture at the conference: hacky..in a good way. That is, tools written by non-specialists to tackle problems facing their application domain where they are specialists in. This is partly due to the flexibility of Python and partly due to the highly-technical profile of the attendees of the conference.

This is just an impression so it's very subjective and I'm biased due to my enthusiasm for Python and my scientific computing background which Python has a firm grip on. But try comparing PyData to Strata which is a higher-level view of the data science world with R being the main analysis tool. The broader data science world is a collision of computer science and statistics both in theory and the tools used. While the narrower PyData world has its roots in the more harmonious scientific computing world where math meets computing albeit the math is generally less statistical.

Until data science curricula are developed, I believe the scientific computing background is a better foundation for data science than statistics alone or computer science on its own. Computational, data, and domain expertise are present in skilled scientific programmers, some of whom attended the conference. The caliber of talent at the conference was astounding. Attendees could, for example, talk about specific CPU and GPU architectures, database systems, compiled languages, distributed systems, GUIs; as well as talk about monte carlo techniques, machine learning, and optimization. Such broad knowledge of all of these areas is important for the implementation of a speedy scientific workflow which happens to be necessary for data science as well.

I'm also going to claim there are more tech-savvy scientists than there are tech-savvy statisticians. This isn't to diminish the importance of statisticians in data science but the computational pride of statisticians is in the comprehensive but slow and inelegant R language. Meanwhile scientific programmers know about, and have built eco-systems around, C/C++, Fortran, and Python and all the CS-subjects associated with these languages including parallel programming, compilation, and data structures. This is just the natural result of statisticians traditionally working with "small data" while scientific programmers often work with "big computation".

The mastery of these issues within the same community is what allows for innovative products such as scidb, bokeh, numba, and ipython and all the various small-scale hackery presented at the conference.

Perhaps I should hold off on making such claims until I go to the next Strata conference but this is a blog journal and I'm very late in writing about PyData!