Understanding ML with Likelihood Maximization

12 Aug 2025

My understanding of any machine learning model is that the model is a simple mathematical or logical formulation that computes the prediction $y$ for given input $x$ . In this post we will shift perspective & consider our model to compute a probability distribution over possible outputs, $y$ given the input $x$ .

We define a function $f$ that maps each $x_i$ to a probability distribution $Pr(y|x=x_i)$ . This probability distribution is defined by parameter(s) $\theta_i$ . For example $\theta_i$ could be the parameters of a normal distribution, $\mu$ and $\sigma$ . Each $x_i$ will have a corresponding probability distribution with parameters $\theta_i$ , since we are computing conditional probability distributions. We can simplify $f$ to just output the distribution’s parameter, $\theta_i$ . For an $(x_i, y_i)$ pair in our training set, the conditional probability of the the true $y$ , given the corresponding $x$ , $Pr(y=y_i|x=x_i)$ should be the highest under the probability distribution defined by $\theta_i$ . This makes intuitive sense, since the true $y$ should have the highest probability for the corresponding $x$ .

If $\theta_i$ were defining a normal distribution (where $\theta$ would just be the mean, $\mu$ ), then ⇒

$Pr(y_i|x_i)$ = $\exp(-(y_i-\theta_i)^2)/\sqrt(2\pi \sigma^2)$ ——————> eq. 1

Since we want $Pr(y=y_i|x=x_i)$ to be very high for all $(x_i, y_i)$ pairs ; we have to find the $f$ that maximises the product of all the $(x_i, y_i)$ pairs together or →

\underset i \Pi[ Pr(y_i|x_i) ]

We can see that $Pr(y|x_i)$ is a function of $y$ & $\theta_i$ (from eq.1).

$\theta_i$ is given by our function $f$ . This implies that eq.1 → $Pr(y|x_i)$ is a function of $y$ and $f$ . Therefore, we can simplify the above product of probabilities to -

\underset f argmax. \underset i \Pi[ \exp(-(y_i-f)^2)/\sqrt(2\pi \sigma^2)]

Find $f$ such that the right hand side product of probabilities is maximum.

Instead of maximising the products, we can simplify to maximising the sum of logs of the probabilities →

\underset f argmax. \underset i \Sigma[ log( \exp(-(y_i-f)^2)/\sqrt(2\pi \sigma^2))]

Simplifying the log and removing constants ( $\sigma$ is taken as constant to simplify, more on this later)

\underset f argmax. \underset i \Sigma[ (-(y_i-f)^2)]

Right now, we are maximizing the summation, we can also go the other way & add a negative sign at the front and minimize instead.

\underset f argmin. \underset i \Sigma-[ (-(y_i-f)^2)]

Simplifying further →

\underset f argmin. \underset i \Sigma[ (y_i-f)^2]

We land with the formula for Ordinary Least Squares regression. This basis of thinking can be extended to any scenario. In our example above, we considered $\sigma$ to be a constant which is the case for homoscedastic data. In the case of heteroscedastic data, $\sigma$ could be varying with $x_i$ , in which case the function $f$ would yield both $\mu$ & $\sigma$ ; and our minimization formulation would have both these terms.

We are essentially saying that there is a whole range of possible $y$ values for each $x$ . We are trying to find a probability distribution that peaks at the true $y$ . Our function $f$ is just a mapping from $x$ to the distribution.

Doesn’t this Bayesian perspective of looking at machine learning with distributions make more intuitive sense than saying → find the line that best matches our data points with the least sum of all errors 😵‍💫