Learning Computer Vision Week 4

Week 4 of documenting my AI/ML learning journey (Sept 29 - Oct 5)

Previously…

  • Relationships between the output functions and the types of regression

  • How SVMs work

  • Overfitting vs Underfitting

  • HOG (Histogram of Oriented Gradients)

Neural Networks aren’t that complicated…until you get into backpropagation.

Ugh.

That’s then I forced myself to learn multivariable calculus, and I actually kinda succeeded in it (nah not really, I was just a high schooler, and still am a high schooler). But of course, because of all the cool computers and libraries we have now, we don’t have to worry about all that dirty calculus, hooray!

Disclaimer

Most of these code samples in this newsletter have been taken from IBM’s course (listed in the Resources section), and thus aren’t original snippets by me.

Tuesday, October 1st

Happy October!

Today I dove in to the differences between performances with the ReLU activation functions vs the sigmoid activation function.

I worked with some code examples that tested this comparison, using the same Fully Connected Neural Network structure:

Sigmoid

class Net(nn.Module):
    def __init__(self, D_in, H1, H2, D_out):
        super(Net, self).__init__()
        self.linear1 = nn.Linear(D_in, H1)
        self.linear2 = nn.Linear(H1, H2)
        self.linear3 = nn.Linear(H2, D_out)
    
    # Prediction
    def forward(self,x):
        x = torch.sigmoid(self.linear1(x)) 
        x = torch.sigmoid(self.linear2(x))
        x = self.linear3(x)
        return x

ReLU

class NetRelu(nn.Module):
   
    def __init__(self, D_in, H1, H2, D_out):
        super(NetRelu, self).__init__()
        self.linear1 = nn.Linear(D_in, H1)
        self.linear2 = nn.Linear(H1, H2)
        self.linear3 = nn.Linear(H2, D_out)
    
    # Prediction
    def forward(self, x):
        x = torch.relu(self.linear1(x))  
        x = torch.relu(self.linear2(x))
        x = self.linear3(x)
        return x

D_in

This is the input size of the first/input layer.

H1 and H2

These are the output size of the first layer/input size of the second layer, and output size of the second layer/input size of the third layer, respectively.

D_out

This is the output size of the third/output layer.

What is a “Fully Connected Neural Network”?

Wdym, “fully connected layers”?

Neural networks can have many layers, and fully connected layers are one of four different types of layers, here are types and what they do (I explain more in the next day).

  1. Fully Connected (Dense): These layers are the ones that you’ll usually learn first, and are the ones that are the easiest to understand. Fully-connected layers are where every neuron in this layer is connected to every neuron in the previous and next layers.

  2. Convolutional Layers: Convolutional layers consist of a set of learnable filters/kernels that slide over the input data, calculate the new value for the output pixel value though an equation that the filter/kernel currently has, effectively extracting special features form the input (usually an image).

  3. Recurrent Layers: Built for more complex tasks like predicting stock market changes, recurrent layers are designed to work with sequential data by maintaining a “memory” to do so. In order to do such a task, they have connections that feed back into themselves, where they calculate new outputs based not only new inputs, but also based on taking previous inputs when producing output. (LSTMs can be more effective than RNNs in certain cases, but that’ll come up later)

  4. Pooling Layers: Often paired with convolutional layers, pooling layers reduce the size of the input by “summarizing” regions and help reduce the number of parameters and calculation needed in the network; this can also be thought of as “resizing” or “condensing” the data of an image. One example of a pooling method is max pooling, where the maximum value from each “region” of pixels (usually a square) is used as the value that will be a part of the output, effectively “representing” that region of pixels.

Wednesday, October 2nd

I finished off the lab today, here are some snippets:

def train(model, criterion, train_loader, validation_loader, optimizer, epochs=100):
    i = 0
    useful_stuff = {'training_loss': [], 'validation_accuracy': []}  
    # Number of times we train on the entire training dataset
    for epoch in range(epochs):
        for i, (x, y) in enumerate(train_loader):
            optimizer.zero_grad()
            z = model(x.view(-1, 28 * 28))
            loss = criterion(z, y)
            loss.backward()
            optimizer.step()
            # Saves the loss
            useful_stuff['training_loss'].append(loss.data.item())

# Later in the code...
training_results = train(Net(28 * 28, 50, 50, 10), nn.CrossEntropyLoss(), torch.utils.data.DataLoader(dataset=train_dataset, batch_size=2000, shuffle=True), torch.utils.data.DataLoader(dataset=validation_dataset, batch_size=5000, shuffle=False), torch.optim.SGD(model.parameters(), lr=learning_rate), epochs=10)

for i, (x, y) in enumerate(train_loader)

This iterates though each batch (a part of the whole input) in the train_loader, which in this context is an object of PyTorch’s DataLoader class.

optimizer.zero_grad()

Resets the calculated gradient value; it adds up if it’s not reset.

z = model(x.view(-1, 28 * 28))

Makes a prediction on the image tensor by flattening it to a 1 by 28*28 tensor.

loss = criterion(z, y)

Calculate the loss between the prediction and actual class.

loss.backward()

Calculates the gradient value with respect to each weight and bias.

optimizer.step()

Updates the weight and bias according to calculated gradient value.

!! OOP Appearance !!

The DataLoader class (by PyTorch)

The DataLoader class is one that makes it much easier to train a neural network model using batches, because when making a DataLoader object, it automatically splits the dataset (the dataset parameter) you give it into the specificized number of batches (the batch_size parameter for how much samples that would be in a batch), meaning you can easily iterate though each batch. To be more specific, each batch will come as a tuple in the DataLoader object, comprised of a tensor with the shape (batch_size, feature_dimensions) for the input features and a tensor with the shape (batch_size,) or (batch_size, num_classes) for the target labels.

What is an epoch? (I also kinda accidentally explain backpropagation here)

All artificial neural networks have backpropagation, which is basically calculus magic that the computer does behind the scenes to find if each neurons’ weight (a number that is multiplied by the input) and bias (a number that is added to the input) should increase or decrease and by how much in order to get a more accurate, desirable result where the correct output neuron (class, number, image, etc.) will have a higher value (probability of being correct based on what the model will be thinking).

An epoch is simply one “round” of forward passing (algebra) and backpropagation (multivariable calculus), and the more epochs you do, the more accurate your model will probably be, but it will take more time and the rate of improving accuracy will exponentially decrease.

As you can see, the FCNN with ReLU as an activation function learned much more effectively than the FCNN with sigmoid as an activation function.

Thursday, October 3rd

Today, I discovered the concept of Receptive Fields, which is the name for the size of a region in the input that produces a pixel value in the activation map.

The larger the receptive field, the less of those would be needed for that layer.

Another concept that I learned but didn’t quite explain fully is flattening layers; it’s where the input data (i.e. image) is taken, and then is transformed into a 1D form (array/list) where the top row (y-axis = 0) is “taken off”, and then the second row is placed right after the first, and then the third row, and so on.

A quick visual representation of flattening

Friday, October 4th

Today, I looked at an example of some CNN code.

class CNN(nn.Module):
    
    # Contructor
    def __init__(self, out_1=16, out_2=32):
        super(CNN, self).__init__()
        # The reason we start with 1 channel is because we have a single black and white image
        # Channel Width after this layer is 16
        self.cnn1 = nn.Conv2d(in_channels=1, out_channels=out_1, kernel_size=5, padding=2)
        # Channel Wifth after this layer is 8
        self.maxpool1=nn.MaxPool2d(kernel_size=2)
        
        # Channel Width after this layer is 8
        self.cnn2 = nn.Conv2d(in_channels=out_1, out_channels=out_2, kernel_size=5, stride=1, padding=2)
        # Channel Width after this layer is 4
        self.maxpool2=nn.MaxPool2d(kernel_size=2)
        # In total we have out_2 (32) channels which are each 4 * 4 in size based on the width calculation above. Channels are squares.
        # The output is a value for each class
        self.fc1 = nn.Linear(out_2 * 4 * 4, 10)
    
    # Other functions such as forward and activations...

Saturday, October 5th

I was pretty busy today with a tournament and HoCo, so I didn’t do much today; that aside, I learned what common CNN architectures there were.

CNNs can have any amount of layers, with any type of layer, and with any amount of neurons for each layer, so how do researchers and machine learning engineers determine the best way to make CNN with all of this customization at their disposal? Well, that’s where CNN architectures come in; CNN architectures are…well…architectures or “models” of CNNs that have been tried and true in terms of consistently achieving a certain goal for a model, whether it be having a high accuracy or low training time.

LeNet-5

The most successful use case of the LeNet-5 is the MNIST Dataset of handwritten digits, where it’s best in using normally grayscale images: its first layer is a convolutional layer, where a 5 by 5 filter with a stride 1 and results in a volume of 28 by 28 outputs. The next layer is a pooling layer with 14 by 14 outputs. It repeats itself with a filter and pooling layer until it gets to the fully connected layers where it flattens to create 120 neurons and another with 84 neurons while using a sigmoid activation function to produce an output.

Visualization of LeNat-5, by IBM.

AlexNet

AlexNet, invented in 2012, has about a 63.3% accuracy, which was better than any other CNN architecture at the time, where it takes an input of a picture with 227 by 227 pixels, has a convolution layer with 25 channels sized at 11 by 11 pixels, and then has another convolution layer with 256 channels sized at 5 by 5 pixels, then a pooling layer, a convolution layer with 384 channels with size 3 by 3 pixels, a pooling layer, a convolution layer with 384 channels of size 3×3, a convolution layer with 256 channels with size 3×3, a pooling layer, two fully-connected layers with 4096 neurons back-to-back, a fully-connected layer with 100 neurons, and then finally, a softmax function.

VGGNet

VGGNet is a much deeper architecture/model compared to the ones I’ve discussed so far:

For a more in-depth explanation of VGGNet, I looked in the GeeksForGeeks explanation for it (click the image)

The main reason why there was such an increase in the number of layers in architectures such as VGGNet, was that by reducing the size of the filters for convolutional layers and instead adding more layers with smaller filters, it greatly reduced the amount of parameters that the model has to handle, while also reducing the number of operations and math that the model had to do.

ResNet

Now we’re are going into the deep learning side of artificial intelligence, where the layers become more and more numerous and complex: this where the vanishing gradient problem begins to show up. The vanishing gradient problem is where, in backpropagation (where the model “learns”), the calculus (more specifically the chain rule) behind the “learning” does a lot of math and has to multiply many small numbers—weights—together, resulting in even smaller gradients (numbers) where the model can’t tell much from those microscopic numbers; these small numbers are often the result of weights being raised to large powers e.g. 0.2 to the 12th power.

One of these fixes is using the ReLU (rectified linear unit) activation function in place of sigmoid and/or tanh, but another solution is to use ResNet: ResNet stands for “Residual Network”, where “residual layers” or “skip connections” allow the gradient to bypass/skip certain layers, so in backpropagation, the gradients have multiple “shortcut” ways of getting to the neurons they need to go to.

A visual for how the concept of ResNet could work

Lessons Learned

The DataLoader class is a big part of PyTorch, and is very useful.

There are things like ReLU (activation function) and ResNet (residual net architecture) that help circumvent the vanishing gradient problem, especially in deeper neural networks with more layers.

That there are many different neural network architectures, and I learned some specific models and architectures that were revolutionary and are still popular.

Resources

Course I followed: