By Bijan Bowen

Testing the vision capabilities of GPT4o

After receiving the email notification that I could now access the GPT-4o API, I was eager to see how well the vision capabilities held up to what was displayed in the demonstration. Naturally, we always expect product demonstrations to show the best, and only the best, of what the product has to offer.

A picture of the R1 with the Razer Kiyo mounted on it.

Since the Ominous Industries R1 is the robot I currently sell for $199, I figured it would be best to somehow integrate the API vision feature into the robot to give it a sort of "sightseeing ability," if you will.

The only actual camera hardware I had on hand to use for this was a slightly beat-up Razer Kiyo webcam. As far as webcams go, this is a solid contender, but a far cry from the high-end phone cameras that were utilized for the OpenAI demo and likely much weaker than what they intend to be used as a data source for the vision capabilities of GPT-4o.

The R1 robot taking a picture - as denoted by the LED on the camera being lit up.

It took a few hours to get everything working to a point where it was functional enough to properly test. My first experience of the vision capabilities came while I was troubleshooting some Android logcat messages to see why the robot was not telling me what it saw. I saw the log describing the image it had seen, and I was blown away. It had described the entirety of my space, down to details that I did not think were going to be picked up, especially by this webcam. Not only did it describe the elements of my space, including a Razor scooter leaned up against a couch, and a ring LED light (which wasn't even directly facing the camera), but it also accurately described the contents of art hanging on walls 15 feet away.

The logcat output describing the image

I quickly fixed the programming mishap causing the robot not to say what it was seeing and began to experiment. First, it recognized me holding a Mac Mini, describing the computer by name. This is impressive, but the Mac Mini is a pretty identifiable product. Next, I showed it a Raspberry Pi, and it described it as an SBC, followed by the correct identification of it as an RPi. I was extremely surprised at this. I spoke to someone on Reddit who shared with me a screenshot of him showing an Orange Pi to the vision model with the logos obscured, and it was able to specifically identify it as an Orange Pi.

The R1 robot seeing a Raspberry Pi.

I experimented some more with some hand-drawn artifacts, which it easily picked up on. Next, I tried to confuse it by showing it a blonde natural acoustic bass, which it correctly identified. I then showed it a blonde natural acoustic guitar, which it once again identified with ease.

Finally, as can be seen in the intro of my video, I held a mannequin head in front of myself in a questionable NSFW position, and it responded by saying it could not help me with this image, meaning it had fully understood my intentions and was very likely abiding by its content filters to not identify NSFW materials.

The R1 Robot seeing a hand written note.

Coming away from this, I am extremely impressed. Not only was the model simply identifying static elements in an image and naming them off, but it was also picking up on details and coming to conclusions based on the sum of what it saw. For example, one image (not shown in the video) had it identify my living space, picking up on the studio lights in the background and suggesting this was some sort of workspace. The ability to not only identify with frankly scary precision but draw conclusions and infer facts makes this a quite impressive and terrifying technology.

I must also mention that my experience with all of this was not using a top-of-the-line camera, rather a beat-up, few years old mid-range webcam. The implications of this technology are far-reaching, and I do wonder how this could be used to further understand human behavior in relation to "predicting" what someone will or will not do. I come away from this impressed, but more interested in the continued developments in local, open-source models that can be run offline with a privacy-first mindset.

Here is a link to the experiment video on my YouTube channel.