Projects
Code for Community
January 2021 - December 2021
At code for community we serve to build high quality, low traffic web applications for local non profit organizations in Boston. We consist of teams for frontend, backend, design and outreach. I was on the first team of backend developers at the groups inception. At the outset we were determined to work with numerous applications for numerous organizations. Colored by the modularity obsessed Khoury College software cirriculum we hoped to achieve this while duplicating the least amount of effort. For the first backend developers, this looked like designing and implementing the highest quality generic backend, meant to be extended by each web app to come. With this in mind, the focus was on optimal code architecture and test coverage. It was important to us to stay as true to pure Java as possible in order to ease the onboarding expenses for future developers, who might be coming from only course work experience and minimal knowledge of some of the heavy frameworks used in industry. To this end we onlu enlisted lightweight tech and basic apis, like Vert.x, maven, and postgres. Our goal was to let the Java speak for itself. Some of my most foundational contributions to this project were in architecture, end point implementation, data models and database schema design, exception handling design patterns and client side session handling through json web tokens. In this effort I had implemented my own JWT api in Java from scratch. Here I became deeply familiar some of the intricacies of HTTP, client side work, hashing/checksumming and security. I truly had a blast working in a completely flat team structure and having the ability to plan how an application ought to be designed from the get-go.
Breaking RSA
Final Project in Artificial Intelligence (group of 1)
Inspired by my recent course work in probabalistic analysis of algorithms and number theory, I wanted to see how tools from machine learning could be applied to RSA and integer factorization. In many ways this study is agnostic to RSA rather a statistical approach to understanding discrete math. Machine learning and statistical techniques as applied to number theoretic problems is quite a novel and green area. As my keystone artificial intelligence project I wanted to delve into it on my own. Recall that the efficacy of RSA lies in the fact that an integer \(i\) who is the product of two distinct primes, when taken large enough, is very difficult to factor. The current best algorithm for factorizing integers checks all possible factors (i.e \((O(\sqrt{i}))\)). Is there a place where a factor is more likely to be? To investigate this I hoped to be able to infer the location of the smaller prime factor on the number line. This approach needed some normalization to make sense. We know that the smaller prime factor must be less that \(\sqrt{i}\), so by defining the location of the smaller prime factor as a proportion of its value to \(\sqrt{i}\) we achieve the desired normalization of output. Obviously, each possible \(i\) is distinct and so trying to perform some inference on the input value of \(i\) itself is completely arbitrary and useless. The key insight was to not correlate \(i\) directly, but a function on \(i\) whose output descibes \(i\) in some way, independent of \(i\) increasing in value. Thus, I hypthosized for any \(i\) there exists functions of the form \(f : \mathbb{Z} \rightarrow \mathbb{Q}^n\) where \(f(i)\) would be correlated to \(i\)’s smaller prime factor’s location, proportional to \(\sqrt{i}\). The simplist example of such a function would be to just take the residue of \(i\) mod 2, so \(f\) in this case would have a codomain of \(\{0,1\}\). Do odd and even number's smaller prime factors differ in a statistically significant way? I refered to such functions as "characteristic functions" as they were meant to describe some characteristic of \(i\) independent of its magnitude. I formulated many such function from the simple to complex. It was critical that these functions were infact characteristic meaning independent to a growing \(i\) as well as having a codomain of reasonable size. Too small or too large the codomain, the more likely to overfit data or lack sufficient significance. To restate, the inputs of my model were \(f(i)\) and the goal was to correlate \(f(i)\) to \(\frac{p}{\sqrt{i}}\), we refer to the latter value as the "Boyer value under \(f\)", denoted \(B_f(i)\). Another import remark is that predicting the exact value of \(B_f(i)\) is futile, rather we want to find a distribution of where \(B_f(i)\) could be. In this way we thought of the Boyer value as a random variable. The less variant \(B_f(i)\) is, the faster we can factor \(i\).
The next step was to find the best model to perform inference. I toyed with some Bayesian approaches but it seemed too naive (machine learning joke :)). The first critical assumption I made was to assume that \(B_f(i)\), for fixed \(f\), exhibited the same form of distribution, regardless of \(f(i)\). In this case, I further assumed that all \(B_f(i)\) were gaussian with uniform variance. This allowed me to leverage a standard linear regression. I also considered different distributions as well as permitting the variance to be a function of \(f(i)\). I explored gaussian parameter estimation as well as some parameter estimation for non guassian distributions. I did not delve too deeply in this path as it became difficult to normalize results accross different choices of \(f\), due to their varying sizes of codomains. I stuck to the idea of \(B_f(i)\) being normally distributed with fixed variance. In addition to standard linear regression I decided to use a the non-parametric k-nearest neighbors regression technique, to give an angle with relaxed assumptions.
With the model in place, I now needed to run the data on my various ideas for characteristic functions and interpret the results. The most notable was the function \(b_n : \mathbb{Z} \rightarrow \mathbb{R}^n\) where \(b_n(i)\) was the the residues of \(i\) of the \(n\) closest non factor primes to \(\sqrt{i}\) divided by \(\sqrt{i}\). This function yielded linear and non-parametric regressions able to predict the location of the smaller factor with an \(R^2\) of about .11 in linear regression, and about .2 in k-nearest regression. These results were low, but not negligible. I believe with more time and thoughtful consideration of the statistical models (or perhaps just different characteristic function) could yield higher correlations and perhaps to conjectures about the distribution of prime factors given \(i\). If one could formally show that prime factors given \(i\) are not just random but concentrated according to some distribution, it would be one of the greatest mathematical breakthroughs ever. While my results are inconclusive, I was enthralled with brushing up against the mathematical unknown.
Most of the work in this project was foundational. Much time and care was spent in defining the problem statement and the mappings of the input and outputs of the models. It was important to make sure we would not arrive at some arbitrary or vacuous truth through poor definitions of inputs and outputs. What I found most exciting about my work is that it is quite modular (both the code and the problem statement). It is simple to define a new characteristic function or plug in a different model to test ones own hyposthesis and see the results. I also gained a familiarity with many machine learning techniques in my exploration of models, as well as some hands on experience scikit-learns regression module.
Take a look at my code and paper here. Also take a look at my resume for experiences in the workplace.
Other projects
- Concert ticket reselling; an algorithmic approach, Python, numpy, matplot, various web data apis
- Predicting wine quality with l1-regression and LP solver, Python, CVXOPT
- Online terminal dungeon game; random level generation with flock seperation behaviorPython, TCP, json, MST
- Twitter sentiment analyzer, Python, SQL, matplot, quantopian, twitter api
- JSON web token implementation, Java and native libraries
- Matrix theory and graphs, Python, numpy
- Wellness blog, Python, Django
- Automated personal business accounting system, Google sheets, Managerial and Forecasting accounting principles