Binary and Multinomial Logistic Regression as GLMs and Neural Networks

Take-aways: logistic regression is the special (binary) case of the softmax regression, whereas the logistic loss is the special (binary) case of the cross-entropy loss. Logistic regression can be viewed as a single neuron, with the sigmoid activation function whereas the softmax regression can be viewed as a single layer of neuron, with the softmax activation function.

1. Logistic regression as generalized linear model

We are familiar with logistic regression as a member of the exponential family where all PDFs take the form

pY(y;η)=b(y)eηTT(y)ea(η)p_Y(y;\eta) = \frac{b(y)e^{\eta^TT(y)}}{e^{a(\eta)}}

The PMF of the Bernoulli random variable, parameterized by the success probability ϕ\phi is

pY(y;ϕ)=ϕy(1ϕ)1y=elog(ϕ1ϕ)y+log(1ϕ)p_Y(y;\phi)=\phi^y(1-\phi)^{1-y}=e^{log(\frac{\phi}{1-\phi})y+log(1-\phi)}

Matching the two PDFs, we have

η=log(ϕ1ϕ)ϕ=11+eη\eta=log(\frac{\phi}{1-\phi})\Rightarrow \phi=\frac{1}{1+e^{-\eta}}

The point here is to identify the function gg that maps the natural parameter η\eta, (i.e., xTθx^T\theta), to the canonical parameter ϕ\phi, which parameterize the PMF of the Bernoulli random variable above.

Borrowed from CSE229; Thanks to 吴恩达

Note that gg here is the sigmoid function, thus explaining why we are using a sigmoid function in logistic regression and why the coefficients we get are in terms of log odds.

p(X)1p(X)=eXTθlog(p(X)1p(X))=XTθ\frac{p(X)}{1-p(X)}=e^{X^T\theta}\Rightarrow log(\frac{p(X)}{1-p(X)})={X^T\theta}

2. Logistic regression as neural network

  • Logistic regression can also be seen as a neural network with only one neuron. What makes it a NN is that the logistic regression consists of two components: a linear component and a non-linear component. Here the linear component is the affine function XTw+bX^Tw+b whereas the nonlinear component is a sigmoid activation function 11+ez\frac{1}{1+e^{-z}}, where z in this case would be the affine function output.
logistic regression as NN

3. Loss function for logistic regression

We want to find the model parameter θ\theta that maximize the likelihood of getting what we observed. θ\theta can be computed using gradient ascent on the log likelihood function or gradient descent on the cost function. Here, given hθ(x)=g(xTθ)=11+eθTxh_\theta(x)=g(x^T\theta)=\frac{1}{1+e^{-\theta^Tx}} and mm being the number of samples,

L(θ)=p(yx;θ)=i=1m(hθ(x(i)))y(i)(1hθ(x(i)))1y(i)logL(θ)=i=1my(i)loghθ(x(i))+(1y(i))log(1hθ(x(i)))J(θ)=1mi=1my(i)loghθ(x(i))(1y(i))log(1hθ(x(i)))L(\theta)=p(y|x;\theta)=\prod_{i=1}^{m}(h_\theta(x^{(i)}))^{y^{(i)}}(1-h_\theta(x^{(i)}))^{1-y^{(i)}} \\ logL(\theta)=\sum_{i=1}^{m}y^{(i)}logh_\theta(x^{(i)})+(1-y^{(i)})log(1-h_\theta(x^{(i)}))\\J(\theta)=\frac{1}{m}\sum_{i=1}^{m}-y^{(i)}logh_\theta(x^{(i)})-(1-y^{(i)})log(1-h_\theta(x^{(i)}))

4 Softmax as generalized linear model

The PMF of a single Multinomial trial, parameterized by the success probability ϕ=ϕ1,ϕ2,...,ϕk\phi={\phi_1,\phi_2,...,\phi_k} is

p(y;ϕ)=ϕ11{y=1}ϕ21{y=2}ϕk1{y=k}=ϕ11{y=1}ϕ21{y=2}ϕk1i=1k11{y=i}=ϕ1(T(y))1ϕ2(T(y))2ϕk1i=1k1(T(y))i\begin{aligned} p(y ; \phi) &=\phi_{1}^{1\{y=1\}} \phi_{2}^{1\{y=2\}} \cdots \phi_{k}^{1\{y=k\}} \\ &=\phi_{1}^{1\{y=1\}} \phi_{2}^{1\{y=2\}} \cdots \phi_{k}^{1-\sum_{i=1}^{k-1} 1\{y=i\}} \\ &=\phi_{1}^{(T(y))_{1}} \phi_{2}^{(T(y))_{2}} \cdots \phi_{k}^{1-\sum_{i=1}^{k-1}(T(y))_{i}} \end{aligned}

With some manipulation of of the PMF function, we would identify the function gg that maps natural parameter to the canonical parameter, and thus solving hθ(x)h_{\theta}(x).

ηi=logϕiϕk\eta_{i}=\log \frac{\phi_{i}}{\phi_{k}}
ϕi=eηij=1keηj\phi_{i}=\frac{e^{\eta_{i}}}{\sum_{j=1}^{k} e^{\eta_{j}}}
hθ(x)=[exp(θ1Tx)j=1kexp(θjTx)exp(θ2Tx)j=1kexp(θjTx)exp(θk1Tx)j=1kexp(θjTx)]h_{\theta}(x)=\left[\begin{array}{c}\frac{\exp \left(\theta_{1}^{T} x\right)}{\sum_{j=1}^{k} \exp \left(\theta_{j}^{T} x\right)} \\ \frac{\exp \left(\theta_{2}^{T} x\right)}{\sum_{j=1}^{k} \exp \left(\theta_{j}^{T} x\right)} \\\vdots \\ \frac{\exp \left(\theta_{k-1}^{T} x\right)}{\sum_{j=1}^{k} \exp \left(\theta_{j}^{T} x\right)}\end{array}\right]

5. Softmax regression as neural network

Softmax can also be seen as a neural network with only a single layer of neuron, with the number of neurons in this layer being the number of total classes. Each neuron in this layer has its own linear part, where the weight matrix wiw_i and offset bib_i are unique. The outputs of these neurons would go into a non-linear activation function (can be viewed as the generalized version of the sigmoid function). This activation function gg is determined by mapping the natural parameter of the exponential family to the canonical parameter that parameterizes the probability distribution of a single multinomial trial. The predicted y^\hat{y} is a nn-vector where n equals the total number of neurons/classes.

Softmax as neural network

5. Loss function for softmax regression

We want to find the model parameter θ\theta that maximize the likelihood of getting what we observed. θ\theta can be computed using gradient ascent on the log likelihood function or gradient descent on the cost function. Here, n is the number of samples.

(θ)=i=1nlogp(y(i)x(i);θ)=i=1nlogl=1k(eθlTx(i)j=1keθjTx(i))1{y(i)=l}\begin{aligned}\ell(\theta) &=\sum_{i=1}^{n} \log p\left(y^{(i)} \mid x^{(i)} ; \theta\right) \\&=\sum_{i=1}^{n} \log \prod_{l=1}^{k}\left(\frac{e^{\theta_{l}^{T} x^{(i)}}}{\sum_{j=1}^{k} e^{\theta_{j}^{T} x^{(i)}}}\right)^{1\left\{y^{(i)}=l\right\}}\end{aligned}

Taking the negative of the MLE function, we could get the cross-entropy loss function:

J(θ)=i=1nlog(eθlTx(i)j=1keθjTx(i))=log(yi^)J(\theta)=-\sum_{i=1}^{n}log\left(\frac{e^{\theta_{l}^{T} x^{(i)}}}{\sum_{j=1}^{k} e^{\theta_{j}^{T} x^{(i)}}}\right) = -log(\hat{y_i})

Peeta Li
Peeta Li
PhD Student