Understanding ML with Likelihood Maximization
12 Aug 2025

My understanding of any machine learning model is that the model is a simple mathematical or logical formulation that computes the prediction yy for given input xx. In this post we will shift perspective & consider our model to compute a probability distribution over possible outputs, yy given the input xx.

We define a function ff that maps each xix_i to a probability distribution Pr(yx=xi)Pr(y|x=x_i). This probability distribution is defined by parameter(s) θi\theta_i. For example θi\theta_i could be the parameters of a normal distribution, μ\mu and σ\sigma. Each xix_i will have a corresponding probability distribution with parameters θi\theta_i, since we are computing conditional probability distributions. We can simplify ff to just output the distribution’s parameter, θi\theta_i. For an (xi,yi)(x_i, y_i) pair in our training set, the conditional probability of the the true yy, given the corresponding xx, Pr(y=yix=xi)Pr(y=y_i|x=x_i) should be the highest under the probability distribution defined by θi\theta_i. This makes intuitive sense, since the true yy should have the highest probability for the corresponding xx.

If θi\theta_i were defining a normal distribution (where θ\theta would just be the mean, μ\mu ), then ⇒

Pr(yixi)Pr(y_i|x_i) = exp((yiθi)2)/(2πσ2)\exp(-(y_i-\theta_i)^2)/\sqrt(2\pi \sigma^2) ——————> eq. 1

Since we want Pr(y=yix=xi)Pr(y=y_i|x=x_i) to be very high for all (xi,yi)(x_i, y_i) pairs ; we have to find the ff that maximises the product of all the (xi,yi)(x_i, y_i) pairs together or →

Πi[Pr(yixi)]\underset i \Pi[ Pr(y_i|x_i) ]

We can see that Pr(yxi)Pr(y|x_i) is a function of yy & θi\theta_i (from eq.1).

θi\theta_i is given by our function ff . This implies that eq.1 → Pr(yxi)Pr(y|x_i) is a function of yy and ff. Therefore, we can simplify the above product of probabilities to -

afrgmax.Πi[exp((yif)2)/(2πσ2)]\underset f argmax. \underset i \Pi[ \exp(-(y_i-f)^2)/\sqrt(2\pi \sigma^2)]

Find ff such that the right hand side product of probabilities is maximum.

Instead of maximising the products, we can simplify to maximising the sum of logs of the probabilities →

afrgmax.Σi[log(exp((yif)2)/(2πσ2))]\underset f argmax. \underset i \Sigma[ log( \exp(-(y_i-f)^2)/\sqrt(2\pi \sigma^2))]

Simplifying the log and removing constants (σ\sigma is taken as constant to simplify, more on this later)

afrgmax.Σi[((yif)2)]\underset f argmax. \underset i \Sigma[ (-(y_i-f)^2)]

Right now, we are maximizing the summation, we can also go the other way & add a negative sign at the front and minimize instead.

afrgmin.Σi[((yif)2)]\underset f argmin. \underset i \Sigma-[ (-(y_i-f)^2)]

Simplifying further →

afrgmin.Σi[(yif)2]\underset f argmin. \underset i \Sigma[ (y_i-f)^2]

We land with the formula for Ordinary Least Squares regression. This basis of thinking can be extended to any scenario. In our example above, we considered σ\sigma to be a constant which is the case for homoscedastic data. In the case of heteroscedastic data, σ\sigma could be varying with xix_i, in which case the function ff would yield both μ\mu & σ\sigma; and our minimization formulation would have both these terms.

We are essentially saying that there is a whole range of possible yy values for each xx. We are trying to find a probability distribution that peaks at the true yy. Our function ff is just a mapping from xx to the distribution.

Doesn’t this Bayesian perspective of looking at machine learning with distributions make more intuitive sense than saying → find the line that best matches our data points with the least sum of all errors 😵‍💫