My understanding of any machine learning model is that the model is a simple mathematical or logical formulation that computes the prediction for given input . In this post we will shift perspective & consider our model to compute a probability distribution over possible outputs, given the input .
We define a function that maps each to a probability distribution . This probability distribution is defined by parameter(s) . For example could be the parameters of a normal distribution, and . Each will have a corresponding probability distribution with parameters , since we are computing conditional probability distributions. We can simplify to just output the distribution’s parameter, . For an pair in our training set, the conditional probability of the the true , given the corresponding , should be the highest under the probability distribution defined by . This makes intuitive sense, since the true should have the highest probability for the corresponding .
If were defining a normal distribution (where would just be the mean, ), then ⇒
= ——————> eq. 1
Since we want to be very high for all pairs ; we have to find the that maximises the product of all the pairs together or →
We can see that is a function of & (from eq.1).
is given by our function . This implies that eq.1 → is a function of and . Therefore, we can simplify the above product of probabilities to -
Find such that the right hand side product of probabilities is maximum.
Instead of maximising the products, we can simplify to maximising the sum of logs of the probabilities →
Simplifying the log and removing constants ( is taken as constant to simplify, more on this later)
Right now, we are maximizing the summation, we can also go the other way & add a negative sign at the front and minimize instead.
Simplifying further →
We land with the formula for Ordinary Least Squares regression. This basis of thinking can be extended to any scenario. In our example above, we considered to be a constant which is the case for homoscedastic data. In the case of heteroscedastic data, could be varying with , in which case the function would yield both & ; and our minimization formulation would have both these terms.
We are essentially saying that there is a whole range of possible values for each . We are trying to find a probability distribution that peaks at the true . Our function is just a mapping from to the distribution.
Doesn’t this Bayesian perspective of looking at machine learning with distributions make more intuitive sense than saying → find the line that best matches our data points with the least sum of all errors 😵💫