The Linear Layer
We now introduce Linear
layers, a neural network layer abstraction that allows us to quickly build feedforward networks with ease. For various historical and non-historical reasons, you may see other deep learning resources or libraries refer to these as dense or perceptron layers, but they all mean basically the same thing. For our purposes, a Linear
layer will be one that simply applies a linear (sometimes called affine) transformation to the input. That is, for an input $X \in \mathbb{R}^{m \times n}$, they'll apply a transformation that looks like this
$$\text{Linear}(X) = W X + b$$
We call $W$ the weight matrix for the layer and $b$ the bias vector. If you view $W$ as applying a linear map to $X$, then $b$ allows us to shift that mapping off the origin. This is key to the representational power of the affine transformation. You can refer to the linked Wikipedia article to learn more about the interesting properties of affine transformations, but some of the key delineators are that they preserve
- Collinearity (points lying on the same line, lie on that same line after the transformation)
- Parallelism (parallel lines remain parallel after the transformation)
- Convexity (convex sets in the domain remain convex after the transformation is applied)
Now that we've introduced the Linear
layer, let's work on its implementation within our FlameFlower library.
The Implementation
First things first, let's recall the Module
class of our nn
library. This provides a basic construction for neural network "parts" that we can string together to make a full model. As such, our Linear
layer class will inherit from nn.Module
. Now we can look toward the class __init__
method. If you look back to the layer definition, all we need to do is specify the two parameters which comprise it: $W$ and $b$. Remember, $X$ is a matrix with rows containing our training examples and columns containing the features. Therefore, the number of columns of $X$ will be the input_size
of our layer. Instead of multiplying on the left by $W$ as in the layer definition, we'll actually multiply by it on the right in order to make the dimensions work out. Therefore, we'll have input_size
number of rows in $W$ and output_size
number of columns. Having this figured out, let's start the implementation of __init__
The __init__
Method
def __init__(self, in_size, out_size):
super(Linear, self).__init__()
self.in_size = in_size
self.out_size = out_size
Remember that because we're inheriting from nn.Module
we need to use the Python built-in super()
function to instantiate the parent class.
Next, let's add a couple enhancements to the above __init__
implementation. Let's defer handling of actually initializing the model parameters, W
and b
, to a private method _init_parameters()
. Finally, let's pass a keyword argument use_bias
to __init__
which allows us to specify whether we want to use the bias in the layer. The updated __init__
should now look like this.
def __init__(self, in_size, out_size, use_bias=True):
super(Linear, self).__init__()
self.in_size = in_size
self.out_size = out_size
self.use_bias = use_bias
self._init_params()
The _init_parameters
Method
Now, let's turn our attention to actually filling the W
and b
parameters with their initial values. If you've read (which you should have!) the lesson on parameter initialization, you'll know that there are various schemes that can be used for sampling the initial values of weight matrices (vectors). We'll implement these in a separate module called init.py
. The _init_parameters
method will instead just take an optional init_fn
keyword argument which will pass a reference to a desired parameter initialization function which handles all the value sampling. The default such function we'll use will be called glorot_uniform
. This will handle the initialization of W
. For b
, we'll just initialize it to a vector of zeroes. This is a pretty commonly used practice for setting initial biases which works pretty well.
Another thing we want to do is ensure that we wrap W
and b
as Tensor
objects. This will ensure that they're tracked by autograd
and will be optimized via backpropagation during neural network training.
Finally, we'll want to call self.new_param(param_name, param)
for each of the parameters we initialize. This is an underlying method of the Module
class and allows the parameters to be tracked as part of the module, so that they can be used by Optimizer
s (more on these later). Let's see what all of this looks like in code.
def _init_params(self, init_fn=None):
if not init_fn:
init_fn = init.glorot_uniform
self.W = Tensor(init_fn(self.in_size, self.out_size))
self.b = Tensor(tl.zeros((1, self.W.shape[1])))
self.new_param('W', self.W)
if self.use_bias:
self.b = Tensor(tl.ones((1, self.W.shape[1])))
self.new_param('b', self.b)
The forward
Method
Now it's time for the implementation bread and butter. If you'll recall from the Module
section, every Module
must implement a forward()
method which specifies the model computation when called on an input. In our case, we just implement the simple equation from the Linear
layer definition. Remember, we can use @
as an alias for Numpy matrix multiplication. The code looks as follows.
def forward(self, X):
if self.use_bias:
return X @ self.W + self.b
else:
return X @ self.W
The Entire Thing (Imports and All)
from .module import Module
from flamethrower.autograd import Tensor
import flamethrower.autograd.tensor_library as tl
import flamethrower.autograd.tensor_library.random as tlr
import flamethrower.nn.initialize as init
class Linear(Module):
def __init__(self, in_size, out_size, use_bias=True):
super(Linear, self).__init__()
self.in_size = in_size
self.out_size = out_size
self.use_bias = use_bias
self._init_params()
def _init_params(self, init_fn=None):
if not init_fn:
init_fn = init.glorot_uniform
self.W = Tensor(init_fn(self.in_size, self.out_size))
self.b = Tensor(tl.zeros((1, self.W.shape[1])))
self.new_param('W', self.W)
if self.use_bias:
self.b = Tensor(tl.ones((1, self.W.shape[1])))
self.new_param('b', self.b)
def forward(self, X):
if self.use_bias:
return X @ self.W + self.b
else:
return X @ self.W