Hand-rolled Claw (3): All-around Perception — Multi-modal & Dynamic Skill Tree

[!TIP] Objective: Enable multi-modal (vision/files) perception via CLI parameters and build a dynamically-loaded plugin-based skill system.

1. Letting the CLI “See” the World

Multi-modal capabilities don’t necessarily require complex SDKs. In VISAGENT, we leverage the native support for file paths in the gemini CLI (@path) to achieve sensory integration at the RoleEngine layer:

def _do_raw_invoke(self, message, files=None):
    # Construct multi-modal suffix
    mm_suffix = ""
    if files:
        mm_suffix = "\n" + "\n".join([f"@{f}" for f in files])
    
    # Append to final Prompt
    full_input = f"{message}{mm_suffix}"
    # ... execute subprocess

Field Experience: To handle complex visual tasks, we encapsulated a dedicated vision_expert skill. By using the DEEP reasoning mode, we guide the AI through Chain-of-Thought thinking, enabling precise identification of screenshots and UI components.

2. Dynamic Skill Tree: SkillHandler

The flexibility of a “hand-rolled” system lies in modularization. We designed the SkillHandler module to automatically discover all capabilities within the skills/ directory:

Automated Discovery: Scans directories at startup.
Manifest Specification: Each skill includes a manifest.json defining its functional description and permission boundaries.
Permission-Aware: During prompt injection, the system explicitly lists authorized filesystem paths and network domains.

3. Flexible Hooks Mechanism

To insert logic before and after AI calls, we implemented a set of Python Hooks:

def trigger_hook(self, role_name, hook_name, *args):
    module = self.skills_hooks.get(role_name)
    if module and hasattr(module, hook_name):
        return getattr(module, hook_name)(*args)

Common Use-Cases:

Pre-invoke: Dynamically adjust the system prompt based on image content before sending.
Post-invoke: Perform strong validation on returned JSON or sanitize sensitive words.

4. Skill Synthesis

This is the most hardcore part: during execution, if the Agent discovers a reusable pattern, it attempts to autonomously “synthesize” a new Skill JSON. This makes VISAGENT more than just an executor—it’s like a “digital artisan” constantly evolving in the mud.

Conclusion

Through simple CLI parameter passing and dynamic directory scanning, we built a highly extensible perception and skill system. No complex containerized plugins needed—just a few lines of importlib calls, and the Agent gains infinite possibilities.

Next part: Digital Metabolism — Zero-cost Architecture Maintenance and the Agent Self-repair loop.

1. Letting the CLI “See” the World#

2. Dynamic Skill Tree: SkillHandler#

3. Flexible Hooks Mechanism#

4. Skill Synthesis#

Conclusion#

1. Letting the CLI “See” the World

2. Dynamic Skill Tree: SkillHandler

3. Flexible Hooks Mechanism

4. Skill Synthesis

Conclusion