Ok, I hope I don't become infamous for non-solutions, but I like to share what didn't work because that can be almost as instructive as what did work.
As TTT is a trivial and largely uninteresting game (as has been discussed), I was more interested in the journey than the result. I decided to brush up on my Q-Learning and so coded up a Q learner AI. I then pitted my q learner against itself and let it go for several hours. (I didn't have the time to observe it closely or the inclination to implement some reporting to make pretty graphs or whatnot) I expected that when I came back it would be drawing most of the time with itself, but it was not. It seemed to be playing more or less decent games, with each one winning roughly half the time, but very rarely would they draw. I think perhaps they were taking more risks than they ought to have, or perhaps I had my constant of randomness on their decisions too high.
Of course it's also entirely possible I have an implementation bug.
The other possiblity that I'm afraid of is a theoretical bug. I'm not entirely sure q-learning is appropriate for this solution the way I implemented it. My concern is that the way I implemented it it might not be a true MDP, because the state transition is not deterministic. (The reward/next state pair is taken from when it's our turn again, and what the other player does is nondeterministic)
I left the laptop home and turned off, but I'd be happy to send the code later if anyone's interested in seeing it.