Problem Setting
Ordinary Least Square (OLS), L2regularization and L1regularization are all techniques of finding solutions in a linear system. However, they serve for different purposes. Recently, L1regularization gains much attention due to its ability in finding sparse solutions. This post demonstrates this by comparing OLS, L2 and L1 regularization.
Consider the following linear system:
\[Ax = y\]
where \(A \in \reals^{m \times n}\), \(m\) is the number of rows (observations) and \(n\) is the number of columns (variable dimension), \(x\) is the variable coefficients and \(y\) is the response. There are three cases to consider:
 \(m=n\). This is the commonseen case. If \(A\) is not degenerate, the solution is unique.
 \(m>n\). This is called overdetermined linear system. There is usually no solutions, but an approximation can be easily found by minimizing the residue cost \(\norm{Axy}^2_2\) using least square methods, and it has a nice closedform solution \(x_{ls}=(A^T A)^{1} A^T y\). In L2regularization, we add a penalize term to minimize the 2norm of the coefficients. Thus, the objective becomes: \[\min_x \norm{Axy}^2_2 + \alpha \norm{x}_2\] where \(\alpha\) is a weight to decide the importance of the regularization.
 \(m<n\). This is called underdetermined linear system. There is usually no solution or infinite solutions. This is where it get interesting: when we have some prior knowledge in the solution structure, such as sparsity, we can have a ‘metric’ to find a better solution among a whole bunch. The objective is thus: \[\min_x \norm{Axy}^2_2 + \alpha \norm{x}_1\] The optimization technique for the above problem is called lasso, and there is an advanced version called elastic net, which combines the L2 and L1 regularization together, hoping to get the advantages of both: L1 regularization finds sparse solution but introduces a large Mean Square Error (MSE) error, while L2 is better at minimizing MSE.
An Example
In the following, we show their performances by solving a simple case.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 

Output:
1 2 

The above code snippets generates an underdetermined matrix \(A\), and a sparse coefficients which has 200 variables but only 10 of them are nonzeros. Noises are added to the responses. We then run the proposed three methods to try to recover the coefficients. It then generates two plots:
 The first plot is as shown in the top. As we can see, OLS does a very bad job, though the MSE is minimized to zero. L2regularization do find some of the sparks, but there are also lots of nonzeros introduced. Finally, L1regularization finds most of the nonzeros correctly and resembles the original coefficients most.
 The second plot shows how similar the recovered coefficients by lasso and elastic nets resemble the original coefficients. As we can see, both of them can recover most parts, while elastic nets contain some small ‘noise’. However, elastic net yields a slightly better MSE than lasso.
Probing Further
Scikit has some excellent examples on regualarization (1, 2). Quora has an excellent discussion on L2 vs L1 regualarization. I found the top three answers very useful in understanding deeper, especially from the Bayesian regularization paradigm perspective by thinking the regularization as MAP (Maximum A Posteriori) that adds a Laplacian (L1) or Gaussian (L2) prior to the original objective.