· By Bijan Bowen
A Demo of Magentic-One agents working with Ollama and Llama 3.2 Vision
I was immediately interested when I heard about the Magentic One release that was built on top of AutoGen. I have long been looking forward to a time where I can use an agent that can autonomously browse the web and perform other actions. My first test of Magentic One was by following the steps in the GitHub repo, which ultimately required an OpenAI API key to use. While I do have an API key which I was happy to use to get to try Magentic One, my real interest in these sorts of tools comes with the assumption that they will be able to be run locally.
After a good experience with Magentic One and GPT-4o, I decided to dive into the code and attempt to get it working with the recent Llama-3.2-11B-vision model release on Ollama. While it would be possible to use it with a different model, the multimodal nature of 3.2-vision meant that the web surfer agent would be able to work as it did in the demo, by looking at screenshots of the browser and using what it sees to guide the agent's actions. Since an 11B parameter model is far, far less capable than the GPT-4o model required to use the demo, I wasn't sure how much work was going to need to be done and if it would even be possible to have a functioning team of agents on this model.
After changing the utils.py script to point to my local Ollama server and the vision model it was running, the agents would start and behave well, until it was time to browse the internet. The biggest hurdle I faced was with the web surfer agent. The initial behavior was sort of funny, looking back at it, as the model was not actually navigating the web but was hallucinating and completely making up everything it was doing. After scratching my head and trying to look up the websites it said it was visiting, I enabled the save screenshots flag so I could see the images it was referencing which showed that it was not leaving the Google homepage.
To deal with this, I had to modify the text prompt in the web surfer script to be far more specific and guiding, as the model was happy to hallucinate output that made it seem like it was performing actions, when in reality it was not. After some iterations of this prompt centered around telling the model to output the tool names it wants to use in JSON, it began to get beyond the Google homepage and actually browse the web.
Overall, I had some decent results. It is somewhat functional, being able to browse the web and find information thanks to the multimodal nature of the vision model. It needs some tweaking with regards to web browsing, as it sometimes gets confused and cannot figure out how to click an element. The orchestrator agent works well, as does the coding agent. It is very cool to see them interact knowing that they are using a local model on Ollama. For reference, below is the API cost incurred from using this for a bit to experiment while it was still linked to GPT-4o. You can see why I wanted to use a local model instead.
The big upside to this is the cost savings. The API usage can get expensive pretty quickly with the amount of "communication" between the agents, so being able to play around with it if you have a system that can run Llama 3.2 vision is awesome and fun for experimenting without needing to worry about the API costs. I am eager to try this with a larger vision model, and would love to see the results of it with the Llama-3.2-90B-vision model.
To view the video demonstration of this, see here: Run AI Agents Locally with Ollama! (Llama 3.2 Vision & Magentic One)
For the GitHub Repo with these changes, see here: Magentic One Ollama Fork
1 comment
-
Ty a lot for your works !!u Keep going
Vinh on