Come see us and our cutting-edge robots at the MassRobotics 7th Annual Robot Block Party on Saturday, September 28, 2024!

By Bijan Bowen

CogVideoX 2B & 5B Models Fully Tested! Local AI Video Gen Just Got Way Better!

I recently stumbled upon a post on the r/localllama subreddit that mentioned the CogVideoX text to video generation models. While CogVideo itself has been around for a very long time (speaking in terms of the Gen AI space), the repository was recently updated with a 2b and 5b model, dubbed CogVideoX-5B and CogVideoX-2B. After a quick peek at the repository, I found myself impressed by some of the sample results and decided to look further into this model. Truthfully, I have learned to take most project samples with a grain of salt, as a lot of them are generated with high-end Nvidia cards in university labs - meaning, not necessarily results a home user would be able to reproduce. The thing that really made me stop and think were the requirements listed to actually run the CogVideoX models.

The CogVideo Github Repo

Until this point, my only other experience with locally run text to video generation was with Open Sora, which I had previously written about. While a fantastic first step into exploring this technology in a local environment, the requirements were very restrictive to anyone without a 24GB+ Nvidia card. I had anticipated a similar barrier to entry with CogVideoX, but I was very mistaken:

"You can run CogVideoX-2B on older GPUs like the GTX 1080TI, and run the CogVideoX-5B model on mid-range GPUs like the RTX 3060"

The CogVideoX Requirements

I was shocked to see such "accessible" requirements. While pricing is, of course, relative to one's geographical location, these requirements mean that anyone with an extra $200 can pick up a card and be able to generate videos on their local machine, offline, without any subscription fees. To the best of my knowledge, a few months ago it would have been a laughable proposition for anyone wanting to generate even half-decent videos on a 1080TI. While I had at this point planned to try the model anyway, seeing how accessible it was placed it on top of my to-do list.

Though tempted to just stick with the 5B model and try for the best possible results, I felt it would be a disservice to myself and any others who stumble upon this to not take the time to compare the results of the 2B and 5B models. To do this, I simply generated a prompt (with the help of GPT) and ran it once on the 2B and once on the 5B. I feel this is a good time to briefly touch upon the install and setup process. Fortunately, I did not encounter any major hiccups while installing the requirements and downloading the models.

I did experience a rather frustrating issue with the 5B model, where after the generation had completed and the video was to be saved, the terminal would simply close and no output mp4 would be placed in the directory. While this was usually resolved by simply running the generation again, it did end up wasting a bit of time as the 5B model took about 9 minutes to generate, with the 2B taking between 2-3 minutes for a generation. My only other mention pertaining to actually running the models is to listen to the repository and use GPT to touch up your prompts. The video quality was EXTREMELY influenced by the quality of the prompt and the more detail the better.

I began by having it generate a video from the POV of a street racer. Instead of trying to verbosely explain the generations, I will just place screenshots of them here for comparison (the full video examples can be seen in the YouTube video for this article).

First, the 2B model's output:

The 2B Model Output

Next, the 5B model's output:

The 5B Model Output

While I was impressed with the 2B model's result, especially compared to my experience with Open Sora, the 5B model absolutely blows it out of the water. Not only is it more detailed in terms of the scenery, it also did a much better job in terms of prompt adherence. The prompt had specifically mentioned the driver seeing his reflection in the mirror, and though the face did look a bit horror film-esque, it was clearly visible in the side view mirror. This was truly an impressive result for something that was locally generated without pricey institutional hardware.

While I don't want it to seem like I am just promoting the video I did on this, in the interest of brevity I shall not try to describe the differences between the rest of the results I generated as they are best synthesized by seeing the video of them. For those of us who would much prefer to read an article instead of watch a video (myself included), I will leave static images of the rest of the generations below, sans any commentary.

Next up, we have a scene of an airport in the mid-1950s.

First, the 2B model's output:

The 2B Model Output

Next, the 5B model's output:

The 5B Model Output

Following this, a LAN party full of RGB decorated gaming computers.

First, the 2B model's output:

The 2B Model Output

Next, the 5B model's output:

The 5B Model Output

Finally, a horse dunking a basketball. This one was actually a bit interesting in terms of the comparison, as the 5B model's generation was less "exciting" than the 2B model's, though it did look better in terms of the specific details of the horse.

First, the 2B model's output:

The 2B Model Output

Next, the 5B model's output:

The 5B Model Output

Overall, I believe the CogVideoX models are an exciting milestone in accessible locally run video generation. While the results are impressive, what really gets me excited was how much more accessible the model was in terms of the hardware requirements. To be able to generate half-decent videos on cards that are years old gives me a very promising outlook on the future of locally run text to video generation.

You can view the video for this article on my YouTube Channel

0 comments

Leave a comment

Please note, comments must be approved before they are published