PyDataNYC 2013 – A Summary of a Fantastic Conference for Data Community

PyData NYC 2013 was a two-day conference this past weekend (Saturday and Sunday, 11/9 and 11/11) with a day of tutorials on Friday. Saturday and Sunday featured keynotes each morning and three tracks of talks and workshops. JP Morgan graciously provided space in the financial district of New York City (complete with a gorgeous 60th floor room for keynotes) and Continuum Analytics appeared to be one of the primary sponsors. Based on the nature and content of the talks and the speakers themselves, it was quite clear that this conference is by and for practitioners, those individuals writing code and munging data every day.

This post won’t try to detail every talk or every slide but highlight some portions of the keynotes and discuss other trends that emerged.

General Trends and Thoughts

I saw two geneal trends at this conference:

1. Python Does Data

Python now robustly covers the full software stack for data science, full stop. If you have been using SPSS, SAS, Stata, Matlab, or even R, Python is now a very legitimate alternative for your data analysis needs that does not suffer the limitations of being a DSL (domain specific language originally designed for a small set of tasks). This set of Python tools includes but is not limited to:

Pandas - data analyis library for Python including dataframes
Numpy - a rich foundational library for scientific and numerical computing in Python
SciPy - a rich foundational library for scientific computing, complimenting Numpy
StatsModels – rich statistical methods for Python
scikits-learns – the machine learning tool set
NLTK – the Natural Language Tool Kit for NLP
PyMC3 - a module for Bayesian statistical modeling
iPython - an interactive, easier to use Python interface
iPython Notebook
Spyder – an IDE focused on data and numerical computation

In terms of ease of use, Continuum Analytics has solved a major issue by packaging together the key Python components listed above into a single, easy to install distribution of Python named Anaconda.

2. Python Eats The World

While Python handles the statistical data niche that tradititional packages such as SPSS, Stata, SAS, and even R have historically ruled, it handles everything else as well scaling from research code all the way to large scale production systems handling true big data.

Think of it this way, if you were a young but unusually wise graduate student tasked with some form of numerical computation or data analysis, would you want to use a domain specific language good for one thing or a programming language that opened numerous doors (this is all assuming this individual actually has a choice)?

Keynote Day 1 – Peter Wang

Peter Wang, the CEO of Continuum Analytics gave what I would consider one of the best data talks that I have heard in the last few years. Leveraging his physics background, Peter discussed quite literally the full data stack, from data visualization to the circuits flipping bits performing computations.

Why this Community Now?

Peter discussed the answer to the fundamental question of “why this community now” (or, put less succinctly, why is Python and data–big or otherwise–exploding now). Succinctly, he sees it as the perfect storm of disruptions in:

storage technologies – mostly cloud based although fast SSD’s don’t hurt
processor cycle availability – again, almost unlimited CPU hours available in the cloud
virtually unlimited data generation happening continuously – currently from the web and mobile applications but the Internet of Things will only add to the data generation
traditional BI tools not up to par
demonstrated clear value in large datasets

Data Comprehension as a Core Competency

To put it in managerial terms, if you went back in time to 1996 before the ubiquity of the Internet, predictions were made such as:

Business that build network-oriented capability into their core will fundamentally outcompete and destroy their competition

At the time, some people argued vehemently with these predictions.

Peter believes and I whole heartedly agree that a similar data revolution is afoot. Data is becoming core; data comprehension is a must in a quantized world. Thus, while some may disagree, it is safe to say that

Businesses that don’t understand data and that do not have data comprehension as a core competency will quickly be extinguished by those that do.

Next, Peter also discussed how the velocity and volume of large data overwhelms old school ETL in business data processing. In the past, the work flow perspective reflected how a factory works. Data was like a train traveling from one station to the next, getting transformed or stored at each stop in the journey. Now that data is so large, moving the train from station to station simply requires to much effort and time. Thus, the train is kept in one place and the platforms and stations are being moved to it.

Data Science as Scientific Computing 2.0

Peter also discussed the idea that data science was really scientific computing 2.0 as scientists have been using data and computers and algorithms for practically longer than everyone else. He even half-jokingly offered up an equation for it from datagravity.org.

Whether you agree or disagree with his point, it is hard to argue that there are many, many lessons that data practitioners and data scientists not coming from science can learn from those that have come before. In particular, Peter highly recommended a paper led by Jim Gray at Microsoft entitled Scientific Data Management in the Coming Decade. This jem, writting in 2004/2005, is full of absolutely fascinating insights.

The Future

While there were many additional points in this key note that were equally fascinating, I will leave you with the one that Peter used to hint at things to come. Each layer of software abstraction built on top of the hardware–from the firmware to the assembler to the kernel to the operating system to the programming language and applications–are a lie. Those constructs exist to hide implementation details from the higher levels to enable more rapid advancements. However, these abstractions constrain what can be done and can cause significant performance penalties.

Add to this the facts that

numerous projects have virtualized or attempted to virtualize various levels of these abstractions, fuzzying the boundaries between these layers, and
some cutting edge research is underway to compile code directly into hardware layouts, completely cutting out the middlemen

and you are left wondering what might just be coming in the future. As many great minds come to similar conclusions, look at some of the thoughts behind the Julia language here.

NumFOCUS

NumFOCUS is a new 501(c)3 non-profit organization designed to support the scientific and data software stack in Python including Numpy, Matplotlib, Scipy, Pandas, and more. In their own words:

NumFOCUS supports and promotes world-class, innovative, open source scientific software. Most individual projects, even the wildly successful ones, find the overhead of a non-profit to be too large for their community to bear. NumFOCUS provides a critical service as an umbrella organization which removes the burden from the projects themselves to raise money.

…

NumFOCUS aims to ensure that money is available to keep projects in the scientific Python stack funded and available. So if you find value in these tools and have always wanted to give back, donating to NumFOCUS gives you a way of supporting either a specific project of your choice or all of these great codes at once!

Miscellaneous

There were simply too many things to of interest to discuss in depth. I won’t go much further than listing them out here but definitely recommend clicking through.

PyMC is a python module for Bayesian statistical modeling and model fitting which focuses on advanced Markov chain Monte Carlo fitting algorithms. Its flexibility and extensibility make it applicable to a large suite of problems.

Theano

Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently.

That is all for now. As a reward for making it this far, here is a link to virtually every presentation given at the event! I was going to discuss the keynote given by Brian Sanger, a professor at Cal Poly State University who is highly involved in the iPython project, but will save that for a separate post.

The post PyDataNYC 2013 – A Summary of a Fantastic Conference for Data Community appeared first on Data Community DC.

PyDataNYC 2013 – A Summary of a Fantastic Conference for Data Community

General Trends and Thoughts

1. Python Does Data

2. Python Eats The World

Keynote Day 1 – Peter Wang

Why this Community Now?

Data Comprehension as a Core Competency

Data Science as Scientific Computing 2.0

The Future

NumFOCUS

Miscellaneous

MADlib

Parakeet

PyMC3

Theano

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112