[!TIP] Objective: Enable multi-modal (vision/files) perception via CLI parameters and build a dynamically-loaded plugin-based skill system.
1. Letting the CLI “See” the World
Multi-modal capabilities don’t necessarily require complex SDKs. In VISAGENT, we leverage the native support for file paths in the gemini CLI (@path) to achieve sensory integration at the RoleEngine layer:
def _do_raw_invoke(self, message, files=None):
# Construct multi-modal suffix
mm_suffix = ""
if files:
mm_suffix = "\n" + "\n".join([f"@{f}" for f in files])
# Append to final Prompt
full_input = f"{message}{mm_suffix}"
# ... execute subprocess
Field Experience: To handle complex visual tasks, we encapsulated a dedicated vision_expert skill. By using the DEEP reasoning mode, we guide the AI through Chain-of-Thought thinking, enabling precise identification of screenshots and UI components.
2. Dynamic Skill Tree: SkillHandler
The flexibility of a “hand-rolled” system lies in modularization. We designed the SkillHandler module to automatically discover all capabilities within the skills/ directory:
- Automated Discovery: Scans directories at startup.
- Manifest Specification: Each skill includes a
manifest.jsondefining its functional description and Sovereignty Permissions. - Sovereignty-Aware: During Prompt injection, the system explicitly informs the AI about authorized FS paths and network domains, implementing security constraints.
3. Flexible Hooks Mechanism
To insert logic before and after AI calls, we implemented a set of Python Hooks:
def trigger_hook(self, role_name, hook_name, *args):
module = self.skills_hooks.get(role_name)
if module and hasattr(module, hook_name):
return getattr(module, hook_name)(*args)
Common Use-Cases:
- Pre-invoke: Dynamically adjust the system prompt based on image content before sending.
- Post-invoke: Perform strong validation on returned JSON or sanitize sensitive words.
4. Skill Synthesis
This is the most hardcore part: during execution, if the Agent discovers a reusable pattern, it attempts to autonomously “synthesize” a new Skill JSON. This makes VISAGENT more than just an executor—it’s like a “digital artisan” constantly evolving in the mud.
Conclusion
Through simple CLI parameter passing and dynamic directory scanning, we built a highly extensible perception and skill system. No complex containerized plugins needed—just a few lines of importlib calls, and the Agent gains infinite possibilities.
Next part: Digital Metabolism — Zero-cost Architecture Maintenance and the Agent Self-repair loop.