shlogg · Early preview
Foxgem @foxgem

Automating Complex PC Tasks With Multi-modal Large Language Models

PC-Agent automates complex PC tasks with Multi-modal Large Language Models, addressing challenges of intricate workflows & interactive environments. Experimental results show significant improvement in task success rates on new benchmark PC-Eval.

Disclaimer: this is a report generated with my tool: https://github.com/DTeam-Top/tsw-cli. See it as an experiment not a formal research, 😄。


  
  
  Mindmap


  
  
  Summary

This paper introduces PC-Agent, a novel framework designed to automate complex tasks on PCs using Multi-modal Large Language Models (MLLMs). It addresses the challenges posed by the intricate interactive environments and workflows typical of PC applications, which are more demanding than those found on smartphones. PC-Agent incorporates an Active Perception Module (APM) to enhance perception of screen content and a h...