Accepted to Carnegie Mellon, Stanford, Cornell, and the University of
Washington, and rejected from MIT and Berkeley
It is important to find algorithms that run efficiently on large datasets, since
many applications require large amounts of data. Biologists have to fish
through billions of base pairs to pry information out of the human genome.
The Square Kilometer Array of telescopes will produce exabytes of data every
day.
Unfortunately, because these datasets are so huge, it can be prohibitively
expensive to run algorithms on them. In grad school I would like to find
algorithms that help us process data at this massive scale.
This problem became personally important to me when I entered the (mock)
Netflix challenge for a class project. With 100 million movie ratings, any
algorithm I implemented took many hours to run, even with 12 GB of RAM.
That's when I grew frustrated and started dreaming of ways to make big data
smaller. I came across some papers by Li et al. on conditional random
sampling [1-2], which convinced me that sketching was a really good idea. I
was excited to hear sketching could be used to compute Euclidean distances,
since in some of my algorithms computing Euclidean distance was the
slowest step. I was also pretty excited that it could be useful for optimizing
database performance, since I spent a whole term learning about databases.
After exploring this further I found Jelani Nelson's papers on data streams [3-
4], which introduced me to the idea of computing summary statistics of data
streams in sublinear space. I was amazed that he could approximate the
norm of a changing vector without keeping track of what all the coordinates
were. I was also impressed by Cormode and Muthukrishnan's paper on the
count-min sketch [5], which did the same for inner products and various
other queries. Reading these papers made me want to work on similar topics
in grad school.
I'm very interested in big data, but also have a lot of other interests. Since
high school I've done research in genetics, bioinformatics, psychology,
cognitive science, and pure math. I think my interdisciplinary interests would
be good for you because they let me look at problems from a different
perspective. People who have only been exposed to computer science will
probably think like a computer scientist would. But as an interdisciplinary
researcher I can think like a computer scientist, a biologist, or a
mathematician. I also feel my math classes and research have made me
comfortable with abstraction, which would be very useful in a computer
science PhD program.
When I was fifteen I did a project where I modeled the dynamics of the selfish
gene Medea. Our lab had built a gene that could spread very quickly through
a population of flies. I calculated how quickly the gene would spread under a
wide range of conditions. This work led to three journal papers, one of them
in Science.
While I was on medical leave from Caltech (2008-2010), I did a project where
I modeled the dynamics of therapist-client interactions. These dynamics can
be very complicated, so we simplified them by only considering the
therapist's and client's "happiness." My mentor related these variables to
each other using nonlinear differential equations. I solved these equations,
which led to two journal papers and two posters.
The summer of my sophomore year I did a project about Tutte polynomials
and the Kontsevich conjecture. The Kontsevich conjecture states that the
number of roots of a certain type of polynomial (over a finite field) is
polynomially dependent on the size of the field. Even though the conjecture
failed for graph polynomials, we thought it might work for Tutte polynomials.
It turns out it didn't, but I did find some intermediate results that were
published in the International Journal of Geometric Methods in Modern
Physics.
The summer before I entered college my mentor and I modified the
bioinformatics tool BLAST to better detect similarities between proteins. We
did get some improvement, at least on the few data sets we tried, but we
didn't achieve our main goal, which was to look for motor proteins in
bacteria. A couple years later I did another bioinformatics project at Protabit,
where I benchmarked their software against other software, and I did a small
project at MIT where I ran experiments on how people learn words.
Stanford is my top choice school because it's a large school with world-class
faculty in a variety of fields. This would be good for me because my interests
are broad and interdisciplinary in nature. At Stanford I could apply big data to
problems in biology, machine learning, natural language processing, or any
field that calls for it - and still get a great advisor. My objective in pursuing a
PhD at Stanford is to do good work on interesting problems and eventually
become a professor.
Citations
[1] Li, Ping, Kenneth W. Church, and Trevor J. Hastie. "One sketch for all:
Theory and application of conditional random sampling." Neural Information
Processing Systems. 2008.
[2] Li, Ping, Kenneth W. Church, and Trevor J. Hastie. "Conditional random
sampling: A sketch-based sampling technique for sparse data." Advances in
neural information processing systems 19 (2007): 873.
[3] Kane, Daniel M., Jelani Nelson, and David P. Woodruff. "On the exact
space complexity of sketching and streaming small norms." Proceedings of
the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms.
Society for Industrial and Applied Mathematics, 2010.
[4] Harvey, Nicholas JA, Jelani Nelson, and Krzysztof Onak. "Sketching and
streaming entropy via approximation theory." Foundations of Computer
Science, 2008. FOCS'08. IEEE 49th Annual IEEE Symposium on. IEEE, 2008.
[5] Cormode, Graham, and S. Muthukrishnan. "An improved data stream
summary: the count-min sketch and its applications." Journal of
Algorithms55.1 (2005): 58-75.