By Bijan Bowen

Testing GPT4o with Video Input

Following my recent exploration of the advanced vision capabilities of the most powerful AI models, I'm excited to delve deeper by incorporating actual video input into GPT-4. However, the model cannot directly process video streams. To work around this limitation, we convert the video into a series of individual frames. These frames are then analyzed sequentially by the model. This approach enables the AI to not only evaluate each static image but also to synthesize information across multiple frames, effectively simulating an understanding of video content.

GPT4o Vision

As I wanted to keep my testing similar to what I did with the static image testing, I went ahead and used the same Razer Kiyo webcam along with my Ominous Industries R1, which I sell for $199. Fortunately, as I had already implemented the image capabilities, it was not very difficult to modify my script to instead take a batch of images to analyze as video. I decided to set the camera to take video for 5 seconds, which produced around ~75 frames each 5 second interval.

Logcat Output

To keep things simple and consider the latency of what I was doing, I had the robot send 10 randomly sampled frames from the 5 second interval to GPT4o for analysis. The first things I tested, were random actions including, playing guitar, holding a mannequin arm and holding the OM-47 Laser Blaster with a stressed face. As with the image generation, the results were accurate and also described the rooms surroundings correctly.

First test with the Acoustic Guitar

The one thing I was impressed with, was when it correctly identified the "third arm" as a mannequin arm and inferred my action as humorous and playful. It was interesting to see the way that the model interpreted the video frames sent to it with consideration for the "theme" of the actions, as well as the static actions themselves.

Mannequin arm

Finally, I decided to do something obscure. I picked up my trusty TI-108 Calculator and held it up to my face, licking it suggestively. While the first go at this did not produce a speech output, I was able to see its response to this in Logcat. It described the frames as someone engaging in a perplexing activity with two mobile phones. It didn't correctly get it as a calculator, but it did describe the activity as perplexing, which I found to be pretty funny. I tried this again and it identified the calculator as an electronic device, and categorized the behavior as humorous.

The calculator

You can see the video relating to this article here.


Leave a comment