With the (Blue, Green, Red) issue clarified regarding the DeepDrive data, I was able to start reading in the dataset proper and do some training. Nothing fancy yet; I just wanted to modify the model enough to accept this type of data.
So, the model did something sane, which was good to see. I only had 1000 samples in this slice of the data, and I didn’t bother to train a model that was supposed to be good. I’m just glad to see that it didn’t blow up in my face.
Also, just today I saw a new autonomous car dataset came out from comma.ai. This data includes pretty much the same kind of thing as in DeepDrive, but it’s 7 hours of actual highway driving in various conditions. The authors were also kind enough to publish the code and everything for their model. High fives to them! I’ll definitely be taking a look at their model and this data in a future livestream.
Tipped off a good friend from FUBAR, I found a reasonable data set to train a self-driving car. DeepDrive is full video and information of a vehicle taken from the video game, Grand Theft Auto V. This includes the speed, delta speed, accelerator, and other key data points in addition to the frame of the graphics at that time. This is everything I wanted except for the rangefinder, and that wasn’t required anyway.
As DeepDrive is run by a single awesome guy, it’s not as fully described as it could be. Most of the data is self-explanatory, but I spent a good portion of this video figuring out the image format. I’m most used to Red, Green, Blue formats, but that never quite gave me the expected results. I could get close by inverting the color, but it still looked off.
Thankfully, Craig from DeepDrive heard my plight. He told me that Caffe, the ML library he was using, expect images in Blue, Green, Red format! Fixing this gives realistic-looking images when plotted normally.
This episode didn’t really feature much machine learning, but that’s ok. This is “Dan Does Data” because understanding one’s data is key to any modeling problem. Often, sifting through the data itself takes more time than coding a model, especially given modern libraries. But it’s still important to look at the data.
As a one-time civil engineer, self-driving cars hold a place in my heart. As a commuter, I also hunger for the freedom that they will create. Not only the freedom to nap on my way to and from work, but the freedom to stop owning a car, the freedom for the elderly and disabled to get wherever they want to go, and the freedom from reptilian side of my brain when it comes to parking. So I’m excited to be working on a Powerwheels Racing autonomous car with FUBAR Labs.
This video just plans out the model inputs/outputs and slaps together some skeleton code. I think we have the basics of a decent model with the following inputs at time t-1:
Acceleration/Braking
Speed
Distance to object in front
Steering wheel angle
Webcam front view
We can combine all these inputs into a neural net (or just a linear model) using a graph model. The webcam images we will probably put through a few convolutional layers first, and then dump everything into some densely connected layers. Once we get down to some true core features, we extract our desired outputs at time t, namely:
Gas pedal/braking
Steering wheel angle
The hard part will be getting a good data set for this. Most data I’ve seen aimed at self-driving cars is trying really hard to do object detection with LIDAR and other fancy methods. I’m just looking for something with these simple parameters. And this isn’t even worrying about the specific scenario of Powerwheels racing, which is more like go-karts than real cars. As with all models, it’s never what you initially imagine, but you always have something.
This is it, a combined Convolutional/Recurrent model. The plain RNN was ok, but I really want to feed interesting subfeatures to the recurrence, not raw pixel values. This way, the RNN only has to worry about important features, not computing the features and remembering the past both.
This turned out to be more difficult to implement than I bargained. Essentially, we want to run the same convolutional weights against every block (in time) of the input image. But Keras doesn’t inherently support this.
What it does support is graph-based models. Much like pure Theano or TensorFlow, you can specify general computation graphs and still benefit from the automatic differentiation to do your optimization. So I had to read in the 8 “blocks” of each image as separate inputs, apply the same convolution to each, then merge the results. Finally, to get the shape LSTM expects (a time x number of features array), I Reshape‘d the merged intermediate output and dumped into the same LSTM from last week.
After some fighting with the exact input model.fit expects (a list of the inputs, meaning swapping axes between the data points and the blocks), I finally got the model to run. It cranked through about 2/3 of the training data (~300000 examples), getting about 60% accuracy. I was impressed that this was in the first epoch of the stupidest, yet most complex, model I could design in an hour.
Unfortunately, I ran into a problem with my GPU. A memory error came up and corrupted the model. This not only killed the model fitting, but it also kept me from being able to access the weights. It’s possible that my GPU ran out of memory, but I think that’s unlikely. Even with 64-bit floats, there’s only 20k parameters to the model, which would be 160 kB of memory, way less than my card supports. So it’s still a mystery.
While there may be an encore presentation, I think this wraps up “For Well Thou Know’st”. We started with a simple font classification model of logistic regression, got even simpler, and then ramped up complexity of a challenging data set. If I can get some serious results, I’ll publish them in a journal somewhere.
After a break last week, I stormed back into this font modeling problem. The jittered convolutional model gave reasonable results but required a huge model. To get around this, I reframed the problem as a time-series. Rather than one large 16 x 128 array of pixels, split an image up into 8 chunks, each one 16 rows by 8 columns. Now compute some features using that chunk’s pixels, and some features extracted from the previous chunks. We slowly build up confidence about the font by looking at 1-ish character at a time. Sounds almost human.
Since I just barely got the model to work in 5 minutes of overtime this video, I didn’t have a chance to train for more than a handful of epochs. But it seemed to be doing a bangup job even given that paltry amount of training time.
Next week, I’ll refine this model. In the end, we’ll have a decent font recognizer that can work on a small slice of text.