Article

Alvin Wan

Brian, Digital Ocean

Mar. 1, 2019

# Understanding Neural Networks

## Neural Networks

$$\text{neural network}: \text{face} \rightarrow \text{emotion}$$

To start, we can think of neural networks as predictors. Each network accepts data $X$ as input and outputs a predicted value $\hat{y}$. The model is parameterized by weights $w$, meaning each model uniquely corresponds to a different value of $w$, just as each line uniquely corresponds to a different value of $m, b$.

$$\hat{y} = f(X; w)$$

On top of this output, we then define a loss function.

$$L(\hat{y}, y)$$

Recall that our goal is to minimize the loss by changing $w$. Plugging in our definition of $\hat{y}$, we obtain a new expression.

$$\min_w L(f(X; w), y)$$

To solve this objective, we can take the derivative, set to 0, and solve. However, unlike with least squares, we are not guaranteed a closed-form solution. In other words, it may be impossible to solve for $w$ after setting the derivative to 0. As a result, we use an alternative optimization procedure called stochastic gradient descent. In short, we start from a random $w$, which we will call $w_0$, and incrementally update $w_i$. We iteratively apply the following rule:.

$$w_{i+1} = w_i - \alpha_i \nabla_w$$

Each $w_{i+1}$ is computed using the previous $w_i$, gradient and a learning rate $\alpha_i$.

For now, we define $\nabla_w = \frac{\partial L}{\partial w} \Big|_x$In this way, we can obtain an iteratively improved neural network parameterized by $w_i$. With this cursory overview of the optimization algorithm, we can now discuss the neural network itself. Start with the inputs $x_1, x_2 \cdots x_n$.

As stated before, the neural network simply denotes a series of computations. The fundamental unit in this computation graph is the node. Say our neural network is precisely one node.

First, we discuss the input to the node, $S$. Each incoming edge has a scalar weight $w_i$. The input $S$ to the node is simply a weighted sum of all inputs

$$S = \sum_{i=1}^n w_i x_i$$

The node itself represents the application of a nonlinear function $g$ to the input. We call this function an activation. Then our node's output is the following:

$$\hat{y} = g(S)$$

This completes our neural network. We promised earlier that a neural network is a predictor, $\hat{y} = f(X; w)$. We have one such possible predictor now, which is $\hat{y} = g(\sum_{i=1}^n w_i x_i)$.

Say we stack many of these nodes. This set of nodes forms a fully connected layer. There are other popular neural network layers as well. Another layer we will use is called the convolutional layer. For brevity in the meantime, we can think of a convolutional layer as an edge or feature detector.

« Back to all posts