Computing Workflows, Data Science, and such


Crop Disease Mode

This special episode of DanDoesData takes a look at a Crop Disease Identification Problem. It’s special because it’s sponsored by GrowMobile,LLC/NerdFarmer. This is only a quick stream, not a formal report, so thanks to these folks for donating to FUBAR Labs.

Anyway, this data is about 20k images of leaves. There are 38 different classes showcasing different diseases. The goal is to train a classifier to recognize the disease based on a single image. Sounds easy, right? There’s a few complications. The images are all different sizes. One dimension is always 256 pixels, but the other varies from 256 up to about 516. To handle this, I decided to cut the center out of the long dimension to square things up. This may cut out interesting data, but for a first pass, hopefully it’s ok. An alternative would be to inflate the smaller dimension with empty data. You can then resize them all to standard dimensions, here 224x224. Once you deal with that, you’ll also notice that the photos are all taken at different angles and rotations. Oh well; hopefully that doesn’t impact too much of the disease markings, but it also means that generating more data by rotating images might be particularly useful. Always little issues coming up.

The model itsef I whipped up pretty quick. 5 layers of convolutions with max pooling followed by a few dense layers and the final softmax to predict 38 probabilities. After training for 1 epoch (15 minutes), the final accuracy was about 15%. Pretty crummy but better than random. And this was just a first pass model to get the process down. For production use, one should study what others have done, fine tune a model with more (or fewer!) layers, and train for a while. It’s not so different from a certain Font Recognition problem.

Nvidia Autopilot Model

Happy New Year! We’re back with a brand new model, same old data. Nvidia posted a model of their End to End Learning for Self-driving Cars last year, and I sought to recreate it in Keras. The paper is a pretty easy read and the model isn’t too complicated. The biggest trip ups for me were that their convolutional layers, rather than having a stride of 1 and doing max-pooling with stride 2 (so computing 4 things and then taking the “strongest”), they just have a convolution with a stride of 2 (so always taking the top left box in a 2x2 grid, basically). But this was pretty easy to do in Keras. The output function, as I noted based on SullyChen’s Tensorflow implementation was an arctangentfunction. Graphically, this is similar to hyperbolic tangent, so I just used that (tanh is part of vanilla Keras while atan isn’t). Doing custom activations isn’t hard, but that wasn’t the purpose of this stream.

The results came out pretty well. At very least, it matched similar “large” models I had created earlier. One thing to note is that I used my compressed dataset from DeepDrive, so my images are smaller and I have fewer interior values to flatten and pass to the first dense layer. Ultimately, this model suffers from the same issue as my previous models: it doesn’t stray much from driving straight. Not that this behavior is inherenty wrong; it does go left and right, just not as much as I might have expected. But hey, 90% of driving is going straight, ya?

Bare Minimum of Matplotlib

Matplotlib is a great data plotting package you can use interactively via Python. While I’ve been using on this stream since day 0 (and long before then professionally), I had never taken the time to walk through the “bare minimum” needed to make it useful. Thinking there might be other tools and methods needing a small amount of explanation to be really powerful, I started this as a new series.

In this case, there are two key features of matplotlib that every data scientist should know. The first is how to plot a simple scatter plot. With a single list, you can plot vs index by doing plt.plot(foo_list). That covers a surprising number of use cases when you just need to get a quick visual on some data. For 2-dimensional data, you probably want plt.plot(x,y,'o') unless you really care about the order of data points. The 2nd salient feature is handling heat map, which is a fancy way to say “color a grid based on the value in each box”. So if you’ve got a pixelated image, that’s basically a heat map with 3 color channels. matplotlib makes this easy with plt.pcolormesh(foo_array). You can change the color settings, show actual images, and do other cool stuff, but I find this handles nearly all situations. These functions don’t cover everything, but they’re something that everyone should cover.

Crashing and Burning with Autoencoders

Sometimes, you crash and burn. I wanted to take a look at autoencoders this week since that’s an area I haven’t worked with much. Things were going fine as I created a simple model with 32 encoded features. The results were crummy, sure, but it ran alright.

But I wasn’t careful about my memory management. For data, I was using the text images from my font classification project. I store the pixels as unsigned bytes, or np.uint8’s. Each image is 16 * 64 = 1024 pixel and so takes up that many bytes. There are about 500000 images total, so that’s more than 500,000,000 bytes, or about half a GB. No big deal. Ah, but then I divided by 255 to convert these to floats, no big deal, right? Wrong, when I did the conversion in NumPy, I didn’t specify a datatype, I got np.float64, which uses 8 bytes per element. So now my data is 4 GB. Still not too bad, until I made two copies of it, one flattened and one with the image dimensions. My machine nominally has 8 GB of memory, but between this indulgence, streaming, and various programs, everything froze.

Important lesson: be careful about your memory. Python will hide things from you, but that can make it that much easier to hurt yourself.

Interactive Computing Workflow Intro

Every time I install a machine learning library, I complain about the “Jenga tower of dependencies” to make everything work. This time, I nearly ate my words as I demonstrated setting up and using my own custom data science workflow. But, it all worked out in the end. The gears of IPython, tmux, vim, and slimux all meshed beautifully to produce an elegant workflow, even in the pure terminal environment.

Of course, my workflow is described in much greater detail over at my other site. I still plan to publish that as a proper ebook at some point, but still need to do more editing. If you’d like to see it commercially available, let me know!

Page 4 of 16