The AI agent world is now crowded with frameworks that promise autonomy, orchestration, reasoning, and "self-improvement," but most of them are really just glorified tool-call loops in a trench coat. The more interesting category is much smaller: frameworks that ship with a native self-improvement loop, meaning they can critique themselves, remember what worked, optimize future behavior, or revise outputs without you hand-building the whole mechanism from scratch.
That distinction matters because "can loop" and "can learn from the loop" are not the same thing. A framework that retries because a tool call failed is useful; a framework that stores a lesson, rewrites a skill, or evolves a prompt for the next run is far more interesting.
It also turns out the global leaderboard looks different when you stop pretending the world ends at Silicon Valley. OpenClaw, Dify, and MetaGPT all matter a lot in global open source, and any ranking that ignores that is basically a local weather report pretending to be climate science.
The Leaderboard
The list below ranks the most popular globally relevant frameworks and tools that have some native self-improvement, self-correction, or self-evolving mechanism today, using public April 14, 2026 GitHub snapshots and star-tracking pages as directional counts.
| Name | What it is | GitHub stars today | Why the self-improvement loop matters |
|---|---|---|---|
| OpenClaw | Persistent open-source AI assistant with "Dreaming" memory. | ~356k | It writes back persistent memory after interactions, so future behavior changes based on prior experience rather than only current context. |
| AutoGPT | The original mainstream autonomous-agent project. | ~183k | It popularized the think-act-observe cycle with built-in iteration and self-critique, even if its current momentum lags its historic fame. |
| Dify | Open-source visual agent and workflow platform with huge global adoption. | ~138k | ReAct-style loops are native in the workflow system, so evaluation and retries are part of the product rather than custom plumbing. |
| LangChain | The largest LLM app ecosystem and orchestration layer overall. | ~128k | It matters because of sheer adoption, but its true native loop engine is really LangGraph, not base LangChain. |
| MetaGPT | Multi-agent "AI software company" framework with SOP-driven roles. | ~64k | Architect, engineer, QA, and debugging loops are part of the built-in operating model, which makes correction feel structural instead of improvised. |
| AG2 / AutoGen | Multi-agent conversational framework descended from Microsoft AutoGen. | ~50k+ | Reflection is a native pattern through writer-critic or planner-executor-reviewer conversations. |
| CrewAI | Role-based multi-agent orchestration with heavy production usage. | ~47k | Reviewer roles and conditional flows make "do, review, revise" a first-class pattern. |
| Hermes Agent | Nous Research's self-improving agent built to "grow with you." | ~44k | It writes reusable skills from completed work, stores them in memory, and pairs with an optimizer stack that evolves prompts and behaviors over time. |
| DSPy | Stanford's framework for LLM optimization rather than manual prompt crafting. | ~33k | It is the strongest native prompt-evolution engine in the open, with GEPA and other optimizers that mutate and score prompt candidates automatically. |
| Smolagents | Hugging Face's minimal code-first agent framework. | ~26k | CodeAct-style execution loops make failures concrete, which gives the agent better raw material for self-correction than vague natural-language retries. |
| LangGraph | Stateful graph runtime for durable long-running agents. | ~25k | Retry, branch, checkpoint, and re-plan loops are explicit and auditable, which is why it is more important in production than its star count alone suggests. |
| OpenAI Agents SDK | OpenAI's official lightweight agent runtime. | ~19k | It natively loops through tool use until completion, but it is more "run until done" than "reflect and evolve" unless you add evaluator agents. |
The headline is simple: OpenClaw wins on raw global traction, Hermes Agent is one of the clearest examples of native self-improvement by design, DSPy is the most serious optimization engine, and LangGraph remains the grown-up choice when you want loops that operations teams can actually inspect.
Not All Loops Are Equal
There are at least four different animals hiding under the label "self-improving agent." First, you have retry loops, where the system simply runs again after a failure. Second, you have reflection loops, where one agent or step critiques the output before the next pass.
Third, you have memory loops, where the system stores a lesson that can influence a future run. Fourth, you have optimizer loops, where prompts, tool instructions, or behaviors get automatically mutated and evaluated against a metric so the agent literally improves its setup over time.
That is why star count alone is not enough. AutoGPT is still enormous by stars, but Hermes Agent plus DSPy is closer to what most people mean when they say "self-evolving."
The Reflection Loop
The first pattern is the classic reviewer loop: a user asks for something, an orchestrator delegates to a worker, a critic reviews the result, and the orchestrator decides whether to send it back for revision or ship it. This is the design language behind AG2 / AutoGen and a lot of well-structured CrewAI systems.
Because FlowZap sequence diagrams require explicit multi-lane chronology, required handles, and request-response "ping-pong" between lanes, the loop is clearer when the orchestrator acts as the hub instead of pretending the worker and critic magically read each other's minds.
user { # User
n1: circle label="Start request"
n2: rectangle label="Send task"
n4: rectangle label="Receive acknowledgement"
n22: rectangle label="Receive final answer"
n1.handle(right) -> n2.handle(left)
n2.handle(bottom) -> orchestrator.n3.handle(top) [label="Task"]
n4.handle(right) -> n22.handle(left)
}
orchestrator { # Orchestrator
n3: rectangle label="Acknowledge task"
n5: rectangle label="Dispatch drafting job"
n8: rectangle label="Receive draft"
n9: rectangle label="Send for review"
n12: rectangle label="Receive critique"
n13: rectangle label="Request revision"
n16: rectangle label="Receive revised draft"
n17: rectangle label="Request final review"
n20: rectangle label="Receive approval"
n21: rectangle label="Send final answer"
n3.handle(top) -> user.n4.handle(bottom) [label="Ack"]
n3.handle(right) -> n5.handle(left)
n5.handle(bottom) -> worker.n6.handle(top) [label="Draft task"]
n8.handle(right) -> n9.handle(left)
n9.handle(bottom) -> critic.n10.handle(top) [label="Review draft"]
n12.handle(right) -> n13.handle(left)
n13.handle(bottom) -> worker.n14.handle(top) [label="Revise"]
n16.handle(right) -> n17.handle(left)
n17.handle(bottom) -> critic.n18.handle(top) [label="Review again"]
n20.handle(right) -> n21.handle(left)
n21.handle(top) -> user.n22.handle(bottom) [label="Final"]
}
worker { # Worker Agent
n6: rectangle label="Generate draft"
n7: rectangle label="Return draft"
n14: rectangle label="Revise draft"
n15: rectangle label="Return revision"
n6.handle(right) -> n7.handle(left)
n7.handle(top) -> orchestrator.n8.handle(bottom) [label="Draft ready"]
n14.handle(right) -> n15.handle(left)
n15.handle(top) -> orchestrator.n16.handle(bottom) [label="Revision ready"]
}
critic { # Critic Agent
n10: rectangle label="Review draft"
n11: rectangle label="Return critique"
n18: rectangle label="Approve revision"
n19: rectangle label="Return approval"
n10.handle(right) -> n11.handle(left)
n11.handle(top) -> orchestrator.n12.handle(bottom) [label="Critique"]
n18.handle(right) -> n19.handle(left)
n19.handle(top) -> orchestrator.n20.handle(bottom) [label="Approved"]
}
This is the friendliest form of self-correction because it matches how teams already work: someone does the thing, someone else pokes holes in it, and then it comes back less embarrassing. It is also the easiest one to fake, because if the critic is weak, you did not build a learning loop — you built a meeting.
The Memory Loop
The second pattern is where things get juicier: the agent does not just revise inside the current task; it stores something reusable that changes the next task. That is the lane where OpenClaw's "Dreaming" memory and Hermes Agent's skill-writing design become much more interesting than ordinary retry logic.
This diagram gives the memory system its own lane, instead of reducing it to a vague invisible database blob. That matters because the whole point of this category is that memory is not decoration; memory is part of the control loop.
user { # User
n1: circle label="Start task"
n2: rectangle label="Submit complex task"
n4: rectangle label="Receive acknowledgement"
n26: rectangle label="Receive improved result"
n1.handle(right) -> n2.handle(left)
n2.handle(bottom) -> planner.n3.handle(top) [label="Task"]
}
planner { # Planner
n3: rectangle label="Acknowledge task"
n5: rectangle label="Ask for similar lessons"
n8: rectangle label="Receive prior lessons"
n9: rectangle label="Build plan"
n10: rectangle label="Dispatch execution"
n13: rectangle label="Receive execution output"
n14: rectangle label="Request evaluation"
n17: rectangle label="Receive evaluation"
n18: diamond label="Good enough?"
n19: rectangle label="Store new lesson"
n22: rectangle label="Confirm lesson saved"
n23: rectangle label="Send retry plan"
n25: rectangle label="Send improved result"
n3.handle(top) -> user.n4.handle(bottom) [label="Ack"]
n3.handle(right) -> n5.handle(left)
n5.handle(bottom) -> memory.n6.handle(top) [label="Retrieve lessons"]
n8.handle(right) -> n9.handle(left)
n9.handle(right) -> n10.handle(left)
n10.handle(bottom) -> executor.n11.handle(top) [label="Execute plan"]
n13.handle(right) -> n14.handle(left)
n14.handle(bottom) -> evaluator.n15.handle(top) [label="Evaluate output"]
n17.handle(right) -> n18.handle(left)
n18.handle(right) -> n19.handle(left) [label="Yes"]
n18.handle(bottom) -> n23.handle(top) [label="No"]
n19.handle(bottom) -> memory.n20.handle(top) [label="Save lesson"]
n22.handle(right) -> n25.handle(left)
n23.handle(bottom) -> executor.n24.handle(top) [label="Retry with lessons"]
n25.handle(top) -> user.n26.handle(bottom) [label="Improved result"]
}
memory { # Memory Store
n6: rectangle label="Search memory"
n7: rectangle label="Return best lessons"
n20: rectangle label="Persist new lesson"
n21: rectangle label="Confirm persistence"
n6.handle(right) -> n7.handle(left)
n7.handle(top) -> planner.n8.handle(bottom) [label="Lessons"]
n20.handle(right) -> n21.handle(left)
n21.handle(top) -> planner.n22.handle(bottom) [label="Saved"]
}
executor { # Executor
n11: rectangle label="Run first attempt"
n12: rectangle label="Return output"
n24: rectangle label="Run improved attempt"
n27: rectangle label="Return improved output"
n11.handle(right) -> n12.handle(left)
n12.handle(top) -> planner.n13.handle(bottom) [label="Output"]
n24.handle(right) -> n27.handle(left)
n27.handle(top) -> evaluator.n28.handle(bottom) [label="Improved output"]
}
evaluator { # Evaluator
n15: rectangle label="Judge first output"
n16: rectangle label="Return verdict"
n28: rectangle label="Judge improved output"
n29: rectangle label="Return final verdict"
n15.handle(right) -> n16.handle(left)
n16.handle(top) -> planner.n17.handle(bottom) [label="Verdict"]
n28.handle(right) -> n29.handle(left)
n29.handle(top) -> planner.n19.handle(bottom) [label="Pass"]
}
This is where self-improvement starts to feel less like marketing and more like machinery. OpenClaw stores learned preferences and insights in persistent memory, while Hermes Agent goes further by turning successful completions into reusable skill artifacts that can be loaded when a similar task appears again.
In plain English: the system builds a playbook. That is a much bigger deal than "the model tried once more."
The Evolution Loop
The third pattern is the most ambitious one: instead of just revising an answer or storing a memory, the system improves the instructions that drive future performance. This is the territory where DSPy becomes a star and where Hermes Agent's self-evolution layer starts looking less like a workflow and more like a miniature training pipeline.
A good mental model is not "agent talks to tool" but "optimizer runs experiments on an agent configuration." That means you need more participants: a benchmark owner, an optimizer, a candidate agent, an evaluation harness, and a metrics store.
product { # Product Team
n1: circle label:"Start benchmark run"
n2: rectangle label:"Submit eval set"
n3: rectangle label:"Receive run acknowledgement"
n4: rectangle label:"Receive upgraded prompt pack"
n1.handle(right) -> n2.handle(left)
n2.handle(bottom) -> optimizer.n5.handle(top) [label="Eval set"]
}
optimizer { # Optimizer
n5: rectangle label:"Acknowledge benchmark"
n6: rectangle label:"Create prompt candidate"
n7: rectangle label:"Receive run traces"
n8: rectangle label:"Request scoring"
n9: rectangle label:"Receive score"
n10: diamond label:"Score high enough?"
n11: rectangle label:"Mutate prompt"
n12: rectangle label:"Log winning config"
n13: rectangle label:"Receive log confirmation"
n14: rectangle label:"Publish winner"
n5.handle(top) -> product.n3.handle(bottom) [label="Ack"]
n5.handle(right) -> n6.handle(left)
n6.handle(bottom) -> candidate.n15.handle(top) [label="Prompt candidate"]
n7.handle(right) -> n8.handle(left)
n8.handle(bottom) -> evaluator.n19.handle(top) [label="Score traces"]
n9.handle(right) -> n10.handle(left)
n10.handle(bottom) -> n11.handle(top) [label="No"]
n11.handle(bottom) -> candidate.n17.handle(top) [label="New candidate"]
n10.handle(right) -> n12.handle(left) [label="Yes"]
n12.handle(bottom) -> metrics.n23.handle(top) [label="Log winner"]
n13.handle(right) -> n14.handle(left)
n14.handle(top) -> product.n4.handle(bottom) [label="Winning pack"]
}
candidate { # Candidate Agent
n15: rectangle label:"Run benchmark tasks"
n16: rectangle label:"Return execution traces"
n17: rectangle label:"Run mutated candidate"
n18: rectangle label:"Return new traces"
n15.handle(right) -> n16.handle(left)
n16.handle(top) -> optimizer.n7.handle(bottom) [label="Traces"]
n17.handle(right) -> n18.handle(left)
n18.handle(top) -> evaluator.n21.handle(bottom) [label="New traces"]
}
evaluator { # Eval Harness
n19: rectangle label:"Score first candidate"
n20: rectangle label:"Return score"
n21: rectangle label:"Score mutated candidate"
n22: rectangle label:"Return improved score"
n19.handle(right) -> n20.handle(left)
n20.handle(top) -> optimizer.n9.handle(bottom) [label="Score"]
n21.handle(right) -> n22.handle(left)
n22.handle(top) -> optimizer.n12.handle(bottom) [label="Best score"]
}
metrics { # Metrics Store
n23: rectangle label:"Persist winning config"
n24: rectangle label:"Confirm persistence"
n23.handle(right) -> n24.handle(left)
n24.handle(top) -> optimizer.n13.handle(bottom) [label="Logged"]
}
This is the most "self-evolving" pattern in the whole category because the system is not merely changing an answer; it is changing the recipe that produces future answers. DSPy's optimizers such as GEPA and MIPROv2 are exactly about this kind of structured prompt and program evolution.
It is also the point where sloppy evaluation will absolutely ruin your day. A bad optimizer loop does not just fail once; it can confidently teach the system the wrong habit at scale.
So Who Really Deserves the Hype?
If you care about raw global popularity, OpenClaw is the loudest story on the board right now. If you care about visual productized workflows with native loops, Dify earns its spot because it brought iterative agent behavior to a huge worldwide audience instead of hiding it behind framework-only ergonomics.
If you care about enterprise reliability and explicit loop control, LangGraph remains one of the most credible choices because it makes state, retries, and re-entry visible instead of magical. If you care about reviewer-style multi-agent correction, AG2 / AutoGen still has one of the cleanest conceptual models.
But if the question is, "Which frameworks feel closest to agents that genuinely get better over time?" then the most interesting stack is Hermes Agent + DSPy. Hermes gives you skill accumulation and persistent improvement; DSPy gives you systematic optimization instead of artisanal prompt fiddling.
The easiest way to say it is this: some agent frameworks retry, some reflect, some remember, and a few actually evolve. The future belongs to the ones that can do all four without turning your codebase into an archaeological dig.
Inspirations
- https://o-mega.ai/articles/self-improving-ai-agents-the-2026-guide
- https://news.ycombinator.com/item?id=46509130
- https://www.gradually.ai/en/openclaw-statistics/
- https://microsoft.github.io/autogen/0.2/docs/topics/prompting-and-reasoning/reflection/
- https://github.com/stanfordnlp/dspy/blob/main/docs/docs/learn/optimization/optimizers.md
- https://agentlas.pro/frameworks/openai-agents-sdk/
- https://github.com/NousResearch/hermes-agent-self-evolution/blob/main/PLAN.md
- https://www.star-history.com/openclaw/openclaw
- https://github.com/EvanLi/Github-Ranking/blob/master/Top100/Top-100-stars.md
- https://nerq.ai/profile/foundationagents-metagpt
- https://www.star-history.com/significant-gravitas/autogpt
- https://www.star-history.com/stanfordnlp/dspy/
- https://www.ultralytics.com/glossary/auto-gpt
- https://github.com/Significant-Gravitas/AutoGPT/releases
- https://dify.ai/blog/100k-stars-on-github-thank-you-to-our-amazing-open-source-community
- https://www.instaclustr.com/education/agentic-ai/agentic-ai-frameworks-top-10-options-in-2026/
- https://github.com/langchain-ai/langgraph
- https://github.com/FoundationAgents/MetaGPT
- https://github.com/ag2ai/ag2
- https://github.com/crewaiinc/crewai
- https://github.com/nousresearch/hermes-agent
- https://hai.stanford.edu/research/dspy-compiling-declarative-language-model-calls-into-state-of-the-art-pipelines
- https://github.com/stanfordnlp/dspy
- https://github.com/huggingface/smolagents
- https://openai.github.io/openai-agents-python/
