Computing Workflows, Data Science, and such


Jupyter and SageMathCloud

Long ago, I mentioned how to install TensorFlow on SageMathCloud. Today, I decided to give SageMathCloud a proper stream. It’s a good thing I did, because it’s gotten a lot better in the past year.

SageMathCloud has been a convenient way to use IPython Notebooks with a lot of prebuilt computing tools for awhile now. But recently, they’ve upped their game on machine learning libraries that come pre-installed. You no longer need a hacky workaround to install TensorFlow, it comes free! Prefer Theano? No problemo; it’s got that too. It even has a version of Keras so you can get going quickly. This is all in addition to the solid NumPy/SciPy/Matplotlib support. It’s a great platform.

I have only a couple complaints. First, it doesn’t play nice with Firefox anymore (maybe it never did?). So you’ll have to install Chrom[ium] or IE or something to use it. Not a big deal. Also, whiel it does have several machine learning libraries, the versions are a bit dated. This might be based on the Anaconda version they’re running, but it was a bummer to see Keras at version 1.1.2 on PyPi but only find version 0.3.2 on SageMathCloud. Not a big deal, but a number of things have changed since then. Chasing versions, however, I know would be a Sissyphean task given the pace of these libraries. Maybe I’ll just have to live with it.

TensorFlow and CNTK Revisited

It’s been two weeks since I wrote as I decided to combine that past two streams into one post. They’re both 2nd looks at post libraries to see how far they’ve come in about a year: Google TensorFlow and MicroSoft CNTK.

The TensorFlow episode was also the 1-year anniversary of Dan Does Data! Since I had gotten a new hard-drive since then, I actually needed to reinstall TensorFlow anyway, so I took the opportunity to do it live. Being much more familiar with the process these days (and already have CUDA and other things installed), it was a pretty simple process involving pip.

I also took suggestions from the audience on what future videos you’d like to see. There’s more libraries out there to explore, more machine learning techniques to learn about, and of course, a world of applied problems to investigate. The winner by far was applied problems. While I’ve covered font recognition and a self-driving car (and have more to come in that area), which problem to tackle next is still open. If you have an opinion (or an active problem, with data) drop me a line!

CNTK was due for another look because they updated the library with Python bindings, something I asked for in the original episode. Microsoft delivered, so I gave it a whirl. I was hoping for a painless install and, already having a bunch of other libraries, tried to install it directly by grabbing their archived whl file. That claimed to install correctly but wouldn’t import. After fighting with it for a bit, I gave up and followed their “scripted” install instructions. These basically download and setup an entire Anaconda distribution in a special environment. Especially for just learning a library, this feels like massive overkill. But I appreciate that they have the option, else I wouldn’t have gotten CNTK running at all.

The framework itself feel very similar to Theano or TensorFlow. You specify a computation graph of nodes, along with a loss function and input nodes. Then you set it to train given an optimizer of some kind. I personally prefer the syntax of TensorFlow to CNTK, but the differences were mainly aesthetic. Their claim that they’re much faster than other libraries is not something I could verify, so take that as you will.

If you’re just getting started with machine learning, I’d suggest sticking with Theano or TensorFlow at this point. They seem to have simpler installs (though still not trivial) and more community support. Better yet, use Keras on top of one of those so you can focus on the fun data science part, and not on coding minutiae!

Dan Does Google Data Studio

I decided to take a break from machine learning and instead poke at Google Data Studio. GDS is a platform for building simple dashboards for data visualization. So it comes with a bunch of built-in graphs for you to plop wherever on a page, and you can quickly change what data is display or restrictions on it.

I’ll be honest: I haven’t had much use for building slick looking dashboards. I find creating one graphic with the key data is more than enough to get an important idea across. But, if I were creating lots of general graphs, Google Data Studio makes it easy and pretty. It’s got a reasonable set of basic graphs (bar scatter, time series, globe, etc) and makes filtering pretty simple. Plus, it includes little interactions like giving you data point details when you hover over it. So it’s meant for you to create a dashboard and share with whoever your client is.

So, I don’t think I’ll be using Google Data Studio for daily data science, but for compiling summary statistics, it’s pretty nice. If you serve a corporate overlord that wants nice looking charts, this is a great digital solution to get your points across without spending too much time fiddling.

Autonomous Car Data Collection Platform

Previously, I’ve used data from existing sources to train our autonomous tractor. This was convenient since it allowed me to get familiar with what variables would be important to me and to get something going just as our vehicle was physically coming together.

But the differences in physics and simulated nature of some of the data never exactly fit right for Otto. The difference in top speed and turning radius alone would muck things up. While machine learning is powerful and can often fill in gaps in human understanding, this is a case where it makes more sense to collect our own data. Proper experimental design can drastically reduce the need for intensive computation. This is one step in that direction.

The overall design of this data collection is pretty simple. Every iteration, the webcam on Otto will snap a picture, then we receive the current information (speed, acceleration, steering angle, gas, etc) from the FUBARino board, followed by a little processing. Every N steps, we save it off to a file and start a new chunk.

During the stream itself, obs had control of my webcam (to stream my face to viewers), so I couldn’t test that part of the problem. And perhaps more concerning, we haven’t nailed down what information will actually be sent from the FUBARino in what format. So this code remains a skeleton for now. But it’s better to have this skeleton than nothing at all.

I believe I have a simple RNN working with just a single frame of lookback. At least, based on the number of parameters, I think that’s what must be happening. It’s always a little hard to tell if I’m loading the data correctly for these kinds of models.

I also learned about TimeDistributed layers, which can ease some of the burden of implementing more complex RNNs by handling the multiple inputs to each time. This would have been useful when doing Font Recognition, but at least doing it “manually”, I understood exactly what was happening. Tradeoffs of using libraries.

Autonomous Car RNN Model

Ah, Recurrent Neural Networks. So natural to describe (“Use some features computed previously in this computation too!”), but always tricky to implement in the proper libraries. I took a whack at it this time, with mixed results.

I believe I have a simple RNN working with just a single frame of lookback. At least, based on the number of parameters, I think that’s what must be happening. It’s always a little hard to tell if I’m loading the data correctly for these kinds of models.

I also learned about TimeDistributed layers, which can ease some of the burden of implementing more complex RNNs by handling the multiple inputs to each time. This would have been useful when doing Font Recognition, but at least doing it “manually”, I understood exactly what was happening. Tradeoffs of using libraries.

Page 5 of 16