The Hidden Layer
Posts
The Code Who Cried Wolf

The Code Who Cried Wolf

Sorry for the long wait! Lessons from my Science Fair Project, Part 13

Brandon Kim
January 19, 2026

What was discussed the week before last week…

Notebooks and Integrated Development Environments (IDEs) in the context of computer science can both be used for development and programming, but they have different use cases.
Notebooks are better for teaching, giving lessons, and data visualization.
IDEs are better for larger-scale projects, collaborative programming, and developing highly technical programs.
Sticking with notebooks or IDEs (raw code) for collaborative projects is crucial, as trying to bridge the gap between the two mediums may be cumbersome, as with the case is for using code from Google Colab to IDEs, for example.

As a review, segmentation models are machine learning models that differentiate the shape and position of the object in focus from the rest of the “background” data. Recently, I added a fourth dataset into my compiled dataset for training my segmentation model for 3D CT scan data for the cervical (neck) spine.

Wed >> January 14

As I tried to resume training my model (because training AI models can take days, even weeks), I noticed that I couldn’t get my model to resume training for some reason, and I thought it was because I added another dataset into the compiled 3D cervical spine dataset, but no. It was because I edited the patch size of my AI model.

A “patch size” is the size of a section of data from a training file that the AI model (in this case a segmentation model) processes at once, and thus can contribute to how much context the model can comprehend by analyzing each patch. By that, I mean if a patch size is too small, then the model won’t be able to “connect the dots” between the whole file as well because it’s looking into the most minuscule details of the file, but also a too large of a batch size will overload the computer’s resources, making it unable to train on such a large patch size.

As I would find out later, there’s also errors that can happen if a patch size is too “awkward” for a certain segmentation model, which is why I kept on getting these errors (when trying to resume my training while adjusting the patch size to be the largest it can be while having the computer also being able to process the patches):

  File "/usr/local/lib/python3.12/dist-packages/torch/_refs/__init__.py", line 436, in _broadcast_shapes
    torch._check(
  File "/usr/local/lib/python3.12/dist-packages/torch/__init__.py", line 1695, in _check
    _check_with(RuntimeError, cond, message)
  File "/usr/local/lib/python3.12/dist-packages/torch/__init__.py", line 1677, in _check_with
    raise error_type(message_evaluated)
torch._dynamo.exc.TorchRuntimeError: Dynamo failed to run FX node with fake tensors: call_function <built-in function iadd>(*(FakeTensor(..., device='cuda:0', size=(5, 256, 22, 22, 22),
           dtype=torch.float16, grad_fn=<ViewBackward0>), FakeTensor(..., device='cuda:0', size=(5, 256, 21, 21, 21),
           dtype=torch.float16, grad_fn=<ViewBackward0>)), **{}): got RuntimeError('Attempting to broadcast a dimension of length 21 at -1! Mismatching argument at index 1 had torch.Size([5, 256, 21, 21, 21]); but expected shape should be broadcastable to [5, 256, 22, 22, 22]')

The main takeaway I got from this experience is that patch sizes can be very random when trying to get them to work with AI models or not: some will work, some won’t.

Thurs >> January 15

With the local science fair happening soon, I went ahead and tested the three different models that I had trained to 110 epochs against test data that they have never seen before. The testing data was still from the compiled data set that I collected, to ensure that the error metrics were reflective of their true performance.

To my horror, they all got accuracies (Dice Similarity Coefficients, DSCs) that were less than 0.75, in fact they around a pitiful 0.4 DSC, which is poor-performing, no matter how you look at it! Why did this happen?

Originally, I thought that these three models, “folds”, would perform well because they had a high “Pseudo DSC”, a DSC that was derived from the models current inferences on what the segmentation looks like against the validation file data instead of against the ground-truth testing data.

Let me explain, I have two theories:

Theory #1 Both the validation data and testing data include cases where some vertebrae are “cut off”, meaning that the model would have to learn the actual shape of the different types of vertebrae as well as “cut-off” version of vertebrae. This might pose a problem, as vertebrae in real-life CT scans are most likely not “cut off”, and having the segmentation model try to segment vertebrae that can either be cut-off or not will make it harder to differentiate which vertebrae is which.

C5 (The fifth vertebrae down from the brain stem) is cut off in this segmentation file.

Theory #2 When looking over the prediction error metrics (the “graded test results” so to say), I found that the model was broken: the “FN” and “FP” (False Negative and False Positive, respectively) error metrics, for example, were both 0, which meant that the segmentation model didn’t even try to predict what the vertebrae’s shape was for C1, the top-most vertebrae. This led me to the conclusion that even through the “Pseudo DSC” was high, it was probably derived from the accuracy of the model’s predictions only pertaining to the spots on the file it was trained on (I’m assuming the model is being trained on data from the “bottom-up” relative to 3D space, because the files themselves are 3D data).

If that didn’t make sense, my theory is like if the “Pseudo DSC” was a teacher that tested the model—the “student”—only on parts of 3D data that it has looked over (in this case, to bottoms parts of 3D training data), and the teacher grades the student based on that, instead of grading the model’s performance in how it predicts on the whole segmentation despite how much or little the model has seen of the 3D data.

"metrics": {
                "1": {
                    "Dice": NaN, <- Likely caused by division by 0
                    "FN": 0,
                    "FP": 0,
                    "IoU": NaN, <- Same here
                    "TN": 205520896,
                    "TP": 0,
                    "n_pred": 0,
                    "n_ref": 0
                },

The results that I had seen with my segmentations models using the “Pseudo DSC” was like seeing a terrific house built up, and then seeing those scores crumble under the weight of real testing data. It’s as if those former error metrics were like the boy(s) who cried wolf.

Lessons Learned

Using test data is absolutely necessary for confirming a model’s performance: validation data or anything else isn’t the same as using the testing data and forcing the model to make an inference (like how a unit test is more accurate in accessing a student’s knowledge instead of a bunch of different, easier classwork assignments what are only focused on the current topic).

Patch sizes can make machine learning volatile, at least, in my case when dealing with segmentation models.