Binary and Multinomial Logistic Regression as GLMs and Neural Networks
Sep 3, 2020
Take-aways: logistic regression is the special (binary) case of the softmax regression, whereas the logistic loss is the special (binary) case of the cross-entropy loss. Logistic regression can be viewed as a single neuron, with the sigmoid activation function whereas the softmax regression can be viewed as a single layer of neuron, with the softmax activation function.
1. Logistic regression as generalized linear model
We are familiar with logistic regression as a member of the exponential family where all PDFs take the form
The PMF of the Bernoulli random variable, parameterized by the success probability ϕ is
Matching the two PDFs, we have
The point here is to identify the function g that maps the natural parameter η, (i.e., xTθ), to the canonical parameter ϕ, which parameterize the PMF of the Bernoulli random variable above.
Note that g here is the sigmoid function, thus explaining why we are using a sigmoid function in logistic regression and why the coefficients we get are in terms of log odds.
2. Logistic regression as neural network
Logistic regression can also be seen as a neural network with only one neuron. What makes it a NN is that the logistic regression consists of two components: a linear component and a non-linear component. Here the linear component is the affine function XTw+b whereas the nonlinear component is a sigmoid activation function 1+e−z1, where z in this case would be the affine function output.
3. Loss function for logistic regression
We want to find the model parameter θ that maximize the likelihood of getting what we observed. θ can be computed using gradient ascent on the log likelihood function or gradient descent on the cost function. Here, given hθ(x)=g(xTθ)=1+e−θTx1 and m being the number of samples,
4 Softmax as generalized linear model
The PMF of a single Multinomial trial, parameterized by the success probability ϕ=ϕ1,ϕ2,...,ϕk is
With some manipulation of of the PMF function, we would identify the function g that maps natural parameter to the canonical parameter, and thus solving hθ(x).
5. Softmax regression as neural network
Softmax can also be seen as a neural network with only a single layer of neuron, with the number of neurons in this layer being the number of total classes. Each neuron in this layer has its own linear part, where the weight matrix wi and offset bi are unique. The outputs of these neurons would go into a non-linear activation function (can be viewed as the generalized version of the sigmoid function). This activation function g is determined by mapping the natural parameter of the exponential family to the canonical parameter that parameterizes the probability distribution of a single multinomial trial. The predicted y^ is a n-vector where n equals the total number of neurons/classes.
5. Loss function for softmax regression
We want to find the model parameter θ that maximize the likelihood of getting what we observed. θ can be computed using gradient ascent on the log likelihood function or gradient descent on the cost function. Here, n is the number of samples.
Taking the negative of the MLE function, we could get the cross-entropy loss function: