Research: Bioinformatic E. coli Simulation
Learning machine learning
Since I was still pre-med, I was doing a post-baccalaureate at UC Davis, and this was right when neural networks were starting to take off. Some very clever people found a way to make computers do things that weren't explicitly programmed, and I was fascinated.
I did a little final project for a course where I used the pyBrain library to code a neural network for breast cancer classification. It was able to classify breast cancer aspirate samples based on cell morphological features as either benign or malignant with 97% accuracy. Although when I turned in my assignment, I sweated some bullets. Basically I showed my professor the code, and let him randomly pick samples to analyze, the first classification was fine, but the second one was Wrong!!! 3% chance and it just so happened I hit bingo. Luckily the next nine worked fine and he had to tell me to stop, because I really wanted him to believe it was working. Please don't give me a B.....
Anyways I was also interested in combining this rapidly advancing computer technology with the molecular biology background I already knew. So I took a course in bioinformatics, and ended up doing some research at my professor's lab. I got to work with Javier who was a great guy to work with. Extremely smart, kind, and was willing to help teach me how these equations and algorithms worked. Although honestly a lot of times my questions were about what some obscure bioinformatic acronym meant. You'd think you'd just read the original paper that defined the term, but it was hard to find because you get other search results where people just used the term. (Try figuring out what FBA or flux balance analysis means, not just finding the equations, but what each variable in that equation specifically represents. Crazy stuff, but cool.)
This was when I really started to sink my teeth into programming and needed to write a regular expression compiler to remap enzyme interactions into Matlab expressions that were compatible with the rest of the E. coli model. Didn't know what all those words meant initially, but I do now!
Describing biology
Beyond the coding aspect, it was also interesting that I needed to develop some way to faithfully represent the output of metabolic enzyme interactions. Basically speaking, I needed to make up new bioinformatics. Specifically I needed to deal with how the interactions where the output of a metabolic reaction may require Enzyme1 AND Enzyme 2 or the other case involving Enzyme 1 OR Enzyme2. Here's an example:
( AcrA (b0463) and AcrB (b0462) and TolC (b3035) ) or ( AcrA (b0463) and AcrD (b2470) and TolC (b3035) )
In the end, I settled on (E1 AND E2) is represented as min(E1,E2), because in these cases the rate-limiting step will determine the overall reaction rate. Whereas (E1 OR E2) is represented as E1 + E2, because if either enzyme suffices then both enzymes contribute to the overall reaction. What's important is that even when there are more complex interactions, it still holds true and provides a simple way to compile it into Matlab expressions. So:
((E1 AND E2) OR E3) would be min(E1, E2) + E3
((E1 OR E2) AND E3) would be min((E1+E2), E3)
And that's bioinformatics being created right in front of your eyes. The papers are pretty dense, but if it's distilled into simple terms you can really see that magic unfold.