The Hidden Layer
Posts
Learning Deep Learning Week 4

Learning Deep Learning Week 4

Week 15 of documenting my AI/ML learning journey (Jan 19 - Jan 25)

Brandon Kim
February 02, 2025

What was discussed last week…

What the “include_top” parameter does when loading a pre-trained model in Keras
The basics of transpose convolution, and it’s uses (e.g. super-resolution!)

Sunday, January 19th

So…transpose convolution!

I’ve mentioned last week that transpose convolution is essentially the opposite of regular convolution, but how does it do that?

It does by first, dilating the original input tensor with zeros (as placeholders), and then filling in those zeros using the weights it has from the neural network it’s in!

A good infographic about transpose convolution I found online

To get from the input tensor and the weights to the output tensor that’ll have a larger shape than the input tensor, the following equation is used to determine the values that will go in each cell:

OUTPUT DIMENSIONS: Transpose Convolution Output Size = (Input Size - 1) Strides + Filter Size - 2 Padding + Output Padding

-Keshav Aggarwal, Digital Ocean

This means that there are three variables left up to decide: the stride (how much cells the kernel moves before making the next dot product calculation), the kernel size (the length/height of the kernel square), and the padding (how much zeros will be inserted between elements in the new matrix as “placeholders”). For example, if given and input tensor shape of 6×6 and the output tensor shape will have to be 20×20, there are multiple ways to solve this equation, but many of which can end up in errors/badly made tensors:

Checkerboard Patterns/Artifacts

This is a Checkerboard Artifact…

And so is this one!

Example variables: Stride = 4, Kernel Size = 2, Padding = 1

Proof: (6 − 1) × 4 + 2 − 2 × 1 = 20

Checkerboard Artifacts are seen when the kernel size is considerably larger or smaller than the stride, leading to excessive overlaps or holes between patches, where there are multiple or no contributions in the same regions, creating a visible “checkerboard pattern” across the whole output tensor.

Checkerboard Artifacts can also be further categorized whether the kernel size is significantly greater (Overlapping Artifact) or lesser (Loss of Feature Detail) than the stride.

Note that when the output tensor is really large (e.g. 1280 × 1024), Overlapping Artifacts can be hard to notice and “diluted” when the difference between the kernel size and the stride is small i.e. +1 or +2.

Boundary Artifacts

Example variables: Stride = 3, Kernel Size = 5, Padding = 0

Proof: (6 − 1) × 3 + 5 − 2 × 0 = 20

Boundary Artifacts usually happen when the padding is equal to zero (or is asymmetrical, whatever that means), because the patches on the outside edges of the output tensor wouldn’t have enough space to insert data due to the severe lack of padding (placeholder zeros). A good formula to use to avoid this type of artifact is:

(Ideal) Padding = (Kernel Size - 1) / 2 [round down if it's a decimal]

Mismatched Dimensions

I don’t think this is an accurate depiction of the patches of an output tensor with mismatched dimensions, but I can assure you that it will look pretty chaotic.

Example variables: Stride = 2, Kernel Size = 4, Padding = 3.3 (and much more…)

Proof (or more so a proof of not being a proof): (6 − 1) × 2 + 4 − 2 × 3.3 ≠ 20

Mismatched Dimensions occur when the variables just don’t add up to the output size in the equation. Mismatched Dimensions happen when this sort of inequality happens in the equation, leading to misaligned intersections, uneven density of patches, and just chaos, as you might expect from an unsatisfied equation.

You might be weary of witnessing all of these wrong and warped output tensors, so here is…

The Optimal Output Tensor _{in this case of 6×6 to 20×20}

It’s beautiful! (the slight overlaps are to ensure smoothness)

Example variables: Stride = 4, Kernel Size = 4, Padding = 2

Proof: (6 − 1) × 4 + 4 − 2 × 2 = 20; also, (4 - 1) / 2 = 1 (rounded down)

Note that having a the kernel size = stride + 1 or 2 is optimal for larger-scale tensors/images, but for our purposes (6×6 tensor/image to a 20×20 tensor/image), having a kernel size equal to the stride is better because of how small the tensors/images are, meaning that it’ll be really easy to find overlaps and checkerboard artifacts.

Wednesday, January 22nd

Today, I learned how to implement transpose convolution in code:

conv_layer = Conv2D(filters=32, kernel_size=(2, 2), activation='relu', padding='same')(input_layer) # This layer does the convoluting, for the following layer to try to "reconstruct" i.e. transpose convolute

transpose_conv_layer = Conv2DTranspose(filters=1, kernel_size=(3, 3), activation='sigmoid', padding='same')(conv_layer) # This line does the transposing (essentially trying to "undo" the conv_layer initially did)

The Conv2DTranspose() function conducts the transpose convolution from the layer that is specified at the end of the function in parentheses (i.e. conv_layer in this situation); here are some of its useful parameters:

filters (int): The dimension of the output space
kernel_size (int, or tuple of ints): Specifies the size of the kernel that will be used (length, height)
stride (int, or tuple of ints): Specifies the stride length (how much units the kernel moves after each transpose convolution)
activation (string): The type of activation function that will be applied
padding (string, ‘same’ or ‘valid’): ‘same’ will result in padding evenly to the left/right or up/down of the input, while ‘valid’ will be used as default, which applies no padding.

It also turns out that the test data I used in the lab was some random numpy arrays:

X_test = np.random.rand(200, 28, 28, 1) 

y_test = X_test # the X and y data are the same because the model is trying to find what the original data was before convolution

So as a human looking at the training data and the model’s predictions…

The images we’re just…static.

Thursday, January 22nd

As I mentioned before in a post not too long ago, I outlined the fundamental steps in making a model in TensorFlow: one of the steps was compiling the model via the .compile() function (it comes after creating the model, and comes before training the model). However, I didn’t talk too much about it, and there’s one specific parameter from the function that I want to talk about: the loss.

loss is short for “loss function”, which is a function that calculates how “bad” a model performed; they are crucial in backpropagation, for the model assigns “blame” on certain neurons, based on what the loss function returns: that’s how ANNs (Artificial Neural Networks) work.

If you take a look at the code I used to compile my model yesterday:

model.compile(optimizer='adam', loss='mean_squared_error')

Mean-squared error was the loss function that I used; mean-squared error severely punishes outliers, and is mainly for regression models, but the reason that the lab used this loss function for a convolution-transpose convolution model, is because

The output of this model is a continuous tensor (i.e. image), and MSE works well with continuous values (which is why it’s used often in regression models).
MSE is one of the simplest error metrics (it has a very simple function compared to other error metrics), which makes it easy to understand.
MSE also helps emphasize the effect variables such as the stride, kernel size, and padding have on the model, making the errors “easier to notice”.

Friday, January 24th

Today I just reviewed everything I’ve learned through the past couple of weeks, including transfer learning (pre-trained models + machine learning), image augmentation (via ImageDataGenerator), and transpose convolution (basically the reverse process of convolution).

In a reflection discussion prompt asking “How do these techniques help in scenarios with limited data availability?” I answered:

“Data Augmentation and Transpose Convolution help with making more "high quality" data for the model to be trained on so that it can become "smarter" and more well-suited for more new and ambiguous situations, while PT Models provide people with insufficient hardware requirements (like me!) with the tools to create powerful models suited to their specific problems and needs.”

-Brandon Kim, 2025

Lessons Learned

I learned the many types of errors transpose convolution can fall into “artifacts”, which are signs of said errors.
I also learned about how the .compile() function: specifically the versatility of the loss parameter for the function.

Resources

Course I followed:

Deep Learning with Keras and Tensorflow

Offered by IBM. Deep learning is revolutionizing many fields, including computer vision, natural language processing, and robotics. In ... Enroll for free.

www.coursera.org/learn/building-deep-learning-models-with-tensorflow