Automating Complex PC Tasks With Multi-modal Large Language Models
PC-Agent automates complex PC tasks with Multi-modal Large Language Models, addressing challenges of intricate workflows & interactive environments. Experimental results show significant improvement in task success rates on new benchmark PC-Eval.
Disclaimer: this is a report generated with my tool: https://github.com/DTeam-Top/tsw-cli. See it as an experiment not a formal research, 😄。 Mindmap Summary This paper introduces PC-Agent, a novel framework designed to automate complex tasks on PCs using Multi-modal Large Language Models (MLLMs). It addresses the challenges posed by the intricate interactive environments and workflows typical of PC applications, which are more demanding than those found on smartphones. PC-Agent incorporates an Active Perception Module (APM) to enhance perception of screen content and a h...