- The Hidden Layer
- Posts
- Learning Computer Vision Week 5
Learning Computer Vision Week 5
Week 5 of documenting my AI/ML learning journey (Oct 6 - Oct 12)
What was discussed last week…
Different ways people solved (or at least mitigated) the vanishing gradient problem in deeper learning models
The DataLoader class from PyTorch
Popular neural network architectures
Goodness! It’s truly hard to keep up with these things. Writing a newsletter for every single day of learning can take more time than the initial learning process! I guess, in a way, I’m applying the Feynman technique, which is cool…
Thursday, October 10
While taking a course quiz, I learned a new concept: it’s called data augmentation. When using neural networks in general, all of whatever people input into the neural network isn’t going to be all perfect, pristine, crystal-clear data that has no imperfections. There will be image noise, some blurs here and there, you get the point. Essentially, the aim of data augmentation is to simulate those imperfect, “natural” conditions, where the training data is “messed up” a little bit to simulate how the model will perform in “public”. Augmentation can include anything from adding noise, blurring, scaling, skewing, rotating, and/or translating images or data.
After computer vision classifies an image, I learned how CV and ML model conduct object localization, which is “narrowing” the whole input image down to a “sub-image” that contains a certain object class, and object detection, which does object localization for multiple sub-images and their classes in an input image. There are many ways computers can do this, including:
Sliding Windows
Sliding Window is a object detection method where the model predetermines a certain sub-image size, say for example, 20 pixels by 50 pixels, and “slides” that sub-image “window” across the entire input image and classifies each sub-image that’s in that window until it gets to a sub-image position where the model classifies that window as the class we want (e.g. dog, cat, hotdog).
Sliding Windows, however, runs into problems when photos are stretched, are from a different aspect ratio, and have objects that are right next to or even “inside” each other; that’s where Bounding Box comes in.
Bounding Box
Bounding Box, instead of using a fixed sub-image size, uses two points: the top left-most point (ymin, xmin) and the bottom right-most point (ymax, xmax) of the subimage that the models will try to classify. Then, the model will aim to predict the coordinates of the two points, which will create a “bounding box” for the sub-image, that contains the image class that is wanted (e.g. dog, cat, hotdog).
After using either one (or more!) of these methods, a score, or a kind of certainty of the classified class that is calculated for each sub-image, is calculated for each classified sub-image, and a score can be set where the model only accepts detections that are above that score, so that low-scoring (and absurd) detections, like how a person doing a plank exercise may be classified as a cat, would be rejected.
Friday, October 11
Today I dove into the concepts behind Haar-Cascade Classifiers, and ML method that is trained on both positive (images of the target class) and negative images i.e. background (sub)images. It’s also worth mentioning that this classifier is based off of the Haar wavelet sequence, an advanced concept in math: here’s a video that I watched in an effort to understand the concepts of wavelets and stuff. Supposedly the classifier uses the sequence as convolutional kernels to extract features like lines and edges. There’s also the Integral Image concept, which, for each pixel, the value of the pixel (in the Integral Image) is the sum of all of the intensity values of all the pixels that are to the left and above it, including the pixel itself. I really don’t know how the Integral Image concept applies to CV, honestly, I might’ve forgotten.
Machine Learning and Weird Math Concepts
A trend in computer science in general, and especially in more academic and complex fields such as machine learning and deep learning, is that there is almost no math concept that doesn’t not apply to some shape or form of algorithm behind some form of a program. And the more deeper you go into computer science, the more complex the math behind the programs will be too. For example, I had pushed myself to learn the basics of multivariable calculus when I was learning Neural Networks (backpropagation, to be specific) a couple months ago, and that experience was…interesting, so to say.
The interesting thing about the Haar-Cascade Classifier, though, is its use of progressive filtering. Progressive filtering is akin to guard if statements, where the sub-image being classified is put though one classifier, and if the classifier deems the sub-image as the class we are looking for, then the sub-image moves on to the next classifier, until the sub-image either makes it though all of the classifiers and is therefore classified as the target class, or is discarded as it is not the object class we are looking for.
Saturday, October 12
Today, I tinkered with some code that made a model using Haar-Cascade Classifiers, and thankfully, the function for that is all prepackaged into one function from the OpenCV (cv2) library: cv2.CascadeClassifier()!

A custom image I tested the model on (I did pretty well, not perfect though)
cv2.CascadeClassifier()
detector = cv2.CascadeClassifier(haar_name)
The cv2.CascadeClassifier() function takes in a filename of type String as its only parameter, and the function only accepts XML files.
urllib.request.urlretrieve() (may be deprecated soon)
haarcascade_url = 'https://raw.githubusercontent.com/andrewssobral/vehicle_detection_haarcascades/master/cars.xml'
haar_name = "cars.xml"
urllib.request.urlretrieve(haarcascade_url, haar_name)
This function comes from the urllib.request library, where the function gets resource from the URL parameter (which is also the first parameter); and if you wanted to be more specific, you could also input the local file path (filename) as a parameter also: the full documentation can be found here.
.detectMultiScale() (a member function of CascadeClassifier)
def detect_obj(image):
#clean your image
plt_show(image)
## detect the car in the image
object_list = detector.detectMultiScale(image)
print(object_list)
#for each car, draw a rectangle around it
for obj in object_list:
...
There are many parameters that can be inputted into the .detectMultiScale() member function, but the most useful and intuitive one is image, which is just the image URL you want to model to detect the target object in. More parameters are explained here.
By the way, while searching for info on the function through OpenCV’s docs, I realized it has one of the more “older” sites for AI library documentations:

Come on, use Sphinx or something! Why with the skeuomorphism on the navbar too?
Lessons Learned
CV models can conduct object localization and detection though different , but also intuitive, strategies like sliding window and bounding box.
In CS (computer science), there is almost no such thing as a “useless” math concept.
Resources
Course I followed: