Computing Workflows, Data Science, and such


Jittered Model

What I didn’t mention last week was that I tried some simple models with the rebuilt data…and they still were way too accurate to be believable. Even filling up the horizontal space left the vertical space for models to cheat with. To get around this, I remembered a common technique when dealing with image data: jitter it. That is, for any given image, shift it around a few pixels and use that image as well.

Doing this yields a much more reasonable model with a typical multilayer perceptron type model. The accuracy was still well above random guessing, but it wasn’t perfect either.

I also tried a convolutional neural net model on this data. As expected, this performed very well, getting nearly 100% accuracy. Fonts have lots of clues that give away their identity at the small scale.

One thing I still want to try is looking at a small slice of the image, say 8 pixels wide. Then I’ll run a small net on this to develop some confidence about the overall nature of the class. Moving onto the next slice, we’ll remember some state and do it over again. This should be faster to train (fewer weights) and works on variable length data.

Rewriting the Script

Further investigation into the data set of “For well thou know’st” revealed that using just the overall width and height of space taken up by the font accounted for most of the predictive power.

This is really an artifact of how I created the data, so I don’t think it’s fair to model that way. To solve this, I remade the data set using a variable number of characters for each font. Instead of always using 16 characters, now I use as many as necessary to fill up 64 pixels wide (which is half the size of the previous data), up to 32 characters. Because I had been careful about creating and reading in my data in the first place, this didn’t take long at all.

Hark! A Model

“For well thou know’st” now has a model. Not a very deep nor complex model, but a model. I created a simple logistic regression on the entire 128 by 16 pixels, which gave alternately 60% or 40% accuracy. I think it was stuck in some local maxima where only a couple glasses were really having their weights adjusted. Still, it was a start.

Just for giggles, I decided to upgrade the logistic regression to a simple single hidden layer neural network with 16 neurons. And I trained for all of 2 epochs. The result: 97% accuracy on the validation set! I think the difference is that the optimization space could update all the neuron weights (or enough of time) to capture features of all the font classes. Not too shabby for a few extra seconds of effort.

Looking more closely at the neuron weights and some example images, I suspect it’s just finding how tall and wide some of the text renderings are. Narrow fonts are centered and rarely reach into the edges, so those pixels have negative weight. Likewise, tall fonts reach above others, so those pixels should have positive weight for some neurons. Ultimately, I believe that logistic regression should have been able to do something similar, but kept getting stuck. Maybe if I prime it with sane starting weights, it will do a decent job.

Because this simple model did so well, I may have to reevaluate the research problem. Right now, the model is really only keying in on a couple size features, but I really want to learn what features are particular to different types of fonts. This may mean remaking the dataset to fill up the available image space (so a variable number of characters, even with a constant size image). Or, I might step away from neural nets and try to build a model with as simple of features as possible. That is, can I classify these fonts knowing just the overall size of the text area? Two features, five classes? Maybe.

Reading in Shakespeare

“For well thou know’st, to my dear doting heart”

William Shakespeare, Sonnet 131

I’m embarking on a new research project: to recognize a font from a snippet of Shakespeare set in that font. The above quote happened to be one of the example phrases I looked it, and I think captures the spirit of machine learning.

This episode was mostly reading in data. I had prepped the dataset already (hand crafted using PIL), but I explored how to read in a hdf5 file in python. All the code for this project is available on Github.

Caffe Segfault

Caffe and I just don’t get along it seems. Though I gave in and used the Python2 Docker image for caffe, I could not get basic models to run. This was supposed to be a CPU-only image, but all the failure cited CUDA problems.

I wanted to check out Caffe since it was another machine learning library that I heard a lot about, but never tried. But I guess it just wasn’t meant to be. Next week, it’s a fresh new research problem; “Dan (actually) Does Data”, for once.

Page 9 of 16