New Research is Good Research

My Science Fair Journey, Part 6

What was discussed last week…

  • A smaller amount of clean, high quality data is better than a larger amount of "mixed” quality data, that is, when working with AI models.

  • Or you can just try cleaning “dirty” data, which can take a while.

Wednesday, November 5th

I almost had a heart attack when I found out that some researchers at Duke just released a research paper about the very same topic that I was working in .-.

Oh no! What do I do?

They used the same model framework, same scope, everything! But on the bright side, their main goal was to create a comprehensive dataset (that was eerily precise to what I was doing) using their own sources and experimentation, meaning that it isn’t just another cross-study; I’m doing a cross-study basically, so that would’ve been bad if all they did was a cross-study also. Additionally, they explicitly mentioned that they researched and came up with their (“homemade”, if you will) dataset to help further efforts in the field—what? That would be where I could some in!

Now although all new research might not be the best, you can still get great finds, like what I got today: and that, to me, is rewarding.

Friday, November 7th

Having a college friend or being at some kind of institution can be really helpful it comes to research things like these. I say that because there seems to be a lot of “paywalls” or even just restrictions behind valuable info such as research papers and datasets. For example, the datasets that I discovered earlier this week was locked behind a college login, which I had to get my friend’s GT (Georgia Tech) account login and verification in order to access.

So turns out new research is also highly-guarded research too, huh?

Saturday, November 8th

When trying to clean some data today, I ran into this bottleneck in the “cleaning” process. I had to reformat, rename, and filter a lot of raw data, and when I tried to create a single program that does all of that in one fell-swoop for a given dataset, it ended up running through the whole school day from when I started it in the morning, which brought it to my attention that something was wrong.

Turns out, the program was stuck in gzipping every file it wen through, so to alleviate this issue, I created a separate program that gzipped all the files and made it run by itself. I had implemented modular programming, a style of programming that splits an entire “operation” into multiple, broken-down parts (such as 5 files that together complete task X versus 1 file that tries to complete task X by itself), which solves “bottlenecks” like the gzip situation I was in.

This is a representation of what a “bottleneck” could look like, with the circles representing data, and the black regions representing the processing and flow.

Lessons Learned

No matter what topic or field you’re in, always learning and researching can, no, will be a valuable skill in problem-solving and innovating.

Modular programming is important for many reasons: one of them being the prevention of exacerbated bottlenecks.