Learning Computer Vision Week 3

Week 3 of documenting my AI/ML learning journey (Sept 22 - Sept 28)

Neural Networks can learn a lot of things, but how exactly do they learn them? This was the question I dived into later this week, where I found out that neural networks was just a bunch of algebra (at least, the first part of it); it’s so simple, a person even invented a tic-tac-toe neural network out of a bunch of matchboxes!

Monday, September 23

Today I learned about the SoftMax vs ArgMax output functions, which are processes at the end of a model’s classification program where either the program will return the value (index/class) that has the largest probability (ArgMax), or the whole list of probabilities (SoftMax), similar to logistic regression. Thus, ArgMax is mainly used when you just need to know the most probable category, while SoftMax is mainly used when you need a full probability distribution across the classes. These different output functions help give data that in turn will assist in training a model to where when shown test data, the models will consistently be able to classify the correct type most of the time.

Relationships between the output functions and the types of regression (in image classification):

Single Output (maximum probability)

Probabilistic Outputs

Binary Classification

SVMs

Logistic Regression + sigmoid

Multiclass Classification

SVMs + multiclass strategy (OvR or OvO) + ArgMax

Logistic Regression + SoftMax

class SoftMax(nn.Module):
    # Constructor
    def __init__(self, input_size, output_size):
        super(SoftMax, self).__init__()
        # Creates a layer of given input size and output size
        self.linear = nn.Linear(input_size, output_size)
        
    # Prediction
    def forward(self, x):
        # Runs the x value through the single layers defined above
        z = self.linear(x)
        return z

super()

This function is used to call methods from the parent class (i.e. nn.Module) in the subclass (i.e. the SoftMax class); this is a prime example of how inheritance works in machine learning.

!! OOP Alert !!

Machine Learning and Inheritance

Inheritance is when a class, while being defined, takes attributes and methods from another class, thus making the newly-defined class a “child” of the “parent” class it took some of its attributes and methods from.

learning_rate = 0.1
# The optimizer will updates the model parameters using the learning rate
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
# The criterion will measure the loss between the prediction and actual label values
# This is where the SoftMax occurs, it is built into the Criterion Cross Entropy Loss
criterion = nn.CrossEntropyLoss()
# Created a training data loader so we can set the batch size
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=100)
# Created a validation data loader so we can set the batch size
validation_loader = torch.utils.data.DataLoader(dataset=validation_dataset, batch_size=5000)

The learning rate (represented as “a” or gamma in formulas and scholarly articles) is how extreme the model will move the model parameters in backpropagation (the process in which ANNs learn/reduce the error metric). If the learning rate is too high, the model may overshoot the minimum, leading to not being able to find the minimum error, and if the learning rate is too low, the model gradient descent will be too slow.

softmax = nn.Softmax(dim=1)

This will make a SoftMax object, which can be used to find the SoftMax probabilities for a model on some test images.

dim (int) – A dimension along which Softmax will be computed (so every slice along dim will sum to 1).

Tuesday, September 24


Today, I learned about SVMs, or Support Vector Machines.

Support Vector Machines are a type of algorithm in ML that is used to make a decision border in classification-type problems, and is especially useful when the data is not linearly separable. A linear hyperplane (decision function) can’t effectively split the data into distinct decision regions that contain only their respective classes. Thus, we have to map the original training data points onto a new space, where the data points get “stretched” by the mapping until a linear hyperplane in that space can separate the training data.

Now, there are different strategies on how to map the training data, and these strategies have a name: kernels. Of course, we already know one kind of kernel: linear, which is just a straight line with only linear (wx + b) mapping on the original feature space. Here are the other types of kernels that are popular:

  1. Polynomial

    Polynomial kernels are trained on using, well, polynomial functions.

  2. RBF (Radial Basis Function)

    Radial Basis Functions are interesting, and are one of the most popular kernels used in SVMs. RBFs are very flexible, and can turn itself into (almost) any shape; even into a spiral-like shape! But RBF is reliant on one parameter, and that’s the gamma parameter, which controls how “flexible” the decision boundary can get. The higher the gamma, the more flexible the decision boundary will be during training, but also the higher the risk of overfitting will be.

If you noticed that the model with the RBF kernel looks overfitted, then good job! It probably is.

What is Overfitting vs Underfitting?

You ever heard of the saying, “too much of a good thing is bad?”

To an extent, the same applies to training AI and Machine Learning models. When a model is being “trained” with data, the model adjusts its model parameters to decrease the error metrics (which measure how far off the predictions are from the true values) though a process called gradient descent. Ideally, the model would assume that any test data that it sees would be similar to the data that it was trained on, but when the model is trained too much and captures potential noise of the training data, the model becomes overfitted. That means the model “assumes” that the test data is almost equivalent to the training data (which, most of the time, isn’t the case), and therefore makes faulty predictions for the test data.

On the other hand, underfitting is when the model is trained too little on training data, where the model parameters are very “generalized”, unable to make accurate predictions on any test data.

scaler = StandardScaler()
X_train_logistic = scaler.fit_transform(X_train)
X_test_logistic = scaler.transform(X_test)

StandardScaler() (scikit-learn)

StandardScaler() is a class provided by scikit-learn, and when an object is initialized with it, it becomes an instance capable of calculating the mean and standard deviation of data (during fitting) and transforming the data to a standardized scale.

# Part 1 (making a logit)
logit = LogisticRegression(C=0.01, penalty='l1', solver='saga', tol=0.1, multi_class='multinomial')
logit.fit(X_train_logistic, y_train)
# Part 2 (testing the model)
y_pred_logistic = logit.predict(X_test_logistic)
print("Accuracy: "+str(logit.score(X_test_logistic, y_test)))
# Part 3 (error metrics)
label_names = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
cmx = confusion_matrix(y_test, y_pred_logistic, labels=label_names)

LogisticRegression() (scikit-learn)

LogisticRegression() is a class in scikit-learn, and, understandably, crates a logistic regression model object. Get more info about what the parameters do here (sklearn documentation)!

!! OOP Alert !!

Machine Learning Models and the .fit() method

When making a model object, chances are high that using .fit(train_data) will train the model, at least for scikit-learn.

Thursday, September 26

Looking deeper, I learned about Image Features, and how different image features prove to be more useful in some situations compared to others.

When using color and going off the frequencies of each color (red, green and blue), going off the colors’ frequencies themselves would help a lot in image classification, because say, for example, if an image of a red square and an image of a red circle were shown to a model with that logic, the model would classify both of those images as the same class. So instead, image classification models have to go off of HOG and Sobel Edge detection to extract image features to analyze.

HOG stands for Histogram of Oriented Gradients, where the “gradients” are whatever the “gradient” orientation is for each sectioned-off part of an image, and then HOG will generate a histogram for each localized region, thus counting the different types of curves and edges.

So as it turns out, I learned the whole image classification process backwards; it’s as follows:

  1. Feature Extraction (using HOG and Sobel edge detection)

  2. Kernel (using SVM)

  3. Linear Classification (using Logistic Regression)

Friday, September 27

I did a lab with IBM’s CV Studio, but I also reviewed a specific layout of error metrics: a confusion matrix.

This would be an example of an optimal confusion matrix for a model.

A confusion matrix is a performance measurement for a classification problem. It is a table with a combination of predicted and actual values. On the y-axis, we have the “true” label and on the x-axis we have the “predicted” label. This example will focus on a binary classifier, i.e. a yes or no model.

-IBM

So the end goal is to have the least frequencies as we can (zero) on every square except on the top-left to right-down diagonal, where the model doesn’t “confuse” an image for another class that isn’t the image’s class.

Saturday, September 28

Today, I actually got into Neural Networks, and went though a lab about a simple XOR model in the form of a neural network. Check out this series by 3b1b if you’ve never heard of the logic behind these amazing models!

What is XOR?

XOR can be referred to as a “gate” or a “condition”, it’s used in a lot of STEM fields, not just in Machine Learning. XOR stands for “exclusive or”, where the output of the XOR “gate” will only be 1 (positive) if only one of the conditions that lead to the condition is satisfied. This means if two out of two or zero out of two inputs are positive, and they are processed though the condition, then the output would be 0 (negative).

XOR is a unique condition/logic gate because it requires an “in the middle” condition, where the condition can’t just simply check if all of the inputs and see if there are “enough positive” or “enough negative” inputs to see if the output should be positive or negative.

In a setting where we have two inputs, and we have a number line from 0 to 2 representing how much total inputs are positive (x), and we have a line (y) that represents when the output would be positive or negative, this creates a “hill” kind of shape, where you can’t divide the data on number line into parts based on if the output will be negative or positive using just one split.

What you see here is what mathematicians would refer to as a “step function”, where the output of the function will change almost immediately as the input moves from value to another.

Because this data isn’t able to be split into parts based on if the output would be negative or positive, we have to combine the outputs of two different functions, where one is subtracted from another in order to get this “hill” shape. And essentially, this “combination” of function outputs is the heart of the first part of neural networks: forward passing! And the “hill” won’t always look like a block (a step function), it could look like an actual curve, using activation functions to adjust how “sensitive” the output calculation (also known as a neuron) should be to the various inputs/outputs of connected neuron. Some commonly used activation functions include sigmoid, tanh, or ReLU, to name a few.

Lessons Learned:

SVMs are a neat way to make non-linear decision borders for classification!

Inheritance is also used in ML (ooooooooh scary OOP)

The differences between SVMs and Logistic Regression (the former is usually for a single output, while the latter is usually for probabilistic outputs).

HOG is pretty cool in the context of image feature extraction.

Resources

The course I followed:

And of course, ChatGPT