What is Agentic AI? Part 2: The Role of LLMs

Why LLMs?

At their core, Large Language Models (LLMs) are AI systems trained on massive datasets that span books, codebases, websites, videos, and other curated content. This enables them to grasp the structure of language, relationships between concepts, and even domain-specific knowledge.

LLMs are designed to understand and generate human-like language, enabling them to interpret inputs such as questions or commands, reason about these inputs to provide answers, and generate outputs like text, code, images or audio.

These models receive an input (prompt) and use statistical learning to predict the most appropriate sequence of outputs, such as phrases, entire documents, code instructions, or even conversational audio.

For example, when an LLM agent is told "Log into the customer portal" it associates that this instruction likely requires finding a login button, entering credentials into username and password fields, and clicking submit— all because it has seen these interaction patterns across countless interfaces during training.

LLMs process inputs in context, considering the broader meaning of the conversation or task at hand.

In practice, LLMs serve as the "brain" of AI agents, enabling them to interact with digital environments much like humans do. Whether navigating a desktop application, operating a web interface, or processing documents, these agents use LLMs to perceive their environment through various inputs (like screenshots or text), reason about what they observe, and decide what actions to take.

OPN24_Opnova_Diagrams2

This perception-reasoning-action loop, powered by the sophisticated capabilities of LLMs, allows agents to operate autonomously across a wide range of digital environments and tasks. Let's explore the key capabilities that make this possible.

Key LLM Capabilities for Agentic AI

Let's explore the revolutionary capabilities that make LLMs the foundation of Agentic AI, starting with their ability to see and understand the world as humans do.

Multimodal Perception: The Power to See

The development of multimodal perception, particularly vision-language models, represents one of the most significant breakthroughs enabling Agentic AI. While traditional LLMs could only process text, modern models can see and understand visual information alongside text, fundamentally changing how AI can interact with user interfaces and the digital world.

This ability to understand visual information was crucial for Agentic AI. Think about how humans operate software: we don't read HTML or parse DOM trees—we look at the screen, understand what we see, and interact with visual elements. Vision-language models give AI agents this same capability, allowing them to "see" and understand interfaces just as humans do.

These models can process a screenshot alongside textual instructions and understand the relationship between what they see and what they're asked to do.

When a multimodal LLM receives a screenshot of a web application, it doesn't just see pixels—it recognizes buttons, text fields, dropdown menus, and other UI elements. It understands that a blue underlined piece of text is likely a clickable link, that a box with a downward arrow is a dropdown menu, and that a circular button with a gear icon probably leads to settings.

This visual understanding is crucial for operating user interfaces effectively. Consider a task like "click the submit button after filling out the form." The LLM needs to visually locate the form fields, understand their labels, identify the submit button, and verify that it's in the right state (enabled vs. disabled).

Without vision capabilities, this kind of interaction would be nearly impossible or extremely brittle, requiring exact coordinate positions or element selectors that break whenever the interface changes.

World Knowledge and Pattern Recognition

Building on their ability to see, multimodal LLMs develop sophisticated world knowledge during their training. Through exposure to vast amounts of multimodal data— text, images, videos, and code— these models develop an understanding of how the digital world works that mirrors human intuition. Imagine watching every single video on Youtube!

Think of it like human intuition about software interfaces. When we see a magnifying glass icon, we instinctively know it's for search. When we see a hamburger menu (☰), we expect it to open a navigation menu. LLMs have developed similar pattern matching through their exposure to vast amounts of interface-related content during training.

This world model helps LLMs understand common patterns in user interfaces, standard workflow sequences, and general concepts about how software operates. A multimodal LLM knows that clicking an 'X' typically closes a window not because it was explicitly programmed with this rule, but because it has observed this pattern countless times in its training data.

However, it's important to note that this world knowledge isn't perfect. While LLMs can make remarkably good generalizations, they can sometimes misinterpret elements, struggle with highly unusual layouts, or make incorrect assumptions based on their training data.

In-Context Learning: Show me how to do it

While world knowledge provides a foundation, in-context learning allows LLMs to adapt to new, specific tasks without traditional training or specialized fine-tuning. This capability exists on a spectrum, from understanding tasks with no examples to learning from multiple demonstrations.

At one end, we have zero-shot learning, where the model can perform tasks with no examples at all—just a clear instruction. For instance, you might tell an agent "extract the invoice total from this email," and it understands what to do immediately. This works because the LLM powering the agent has developed a general understanding of concepts like "extraction" and "invoice totals" during its training.

Moving along the spectrum, we find few-shot learning, where we provide one or two examples to help the model understand exactly what we want. For instance, we might show the LLM one example of how to process a specific type of form with a screenshot and expected action, and it can then handle thousands of similar forms following the same pattern. This is particularly powerful when dealing with company-specific formats or workflows that the model hasn't encountered before.

Finally, there's many-shot learning, where we provide multiple examples to help the model understand more complex patterns or edge cases. This might be useful when dealing with highly specialized tasks or when we need very precise outputs.

Chain-of-Thought Reasoning

Building on their ability to see, understand patterns, and learn from examples, LLMs can perform sophisticated chain-of-thought reasoning—breaking complex problems into smaller, logical steps and solving them sequentially. A fundamental capability to decompose enterprise workflows in sequential, verifiable steps.

When an agent processes an invoice, for instance, it might first extract the relevant fields like date and amount, then verify the data's accuracy, enter it into an accounting system, and finally flag any discrepancies for review.

This capability allows agents to handle multi-step workflows autonomously, making them valuable for real-world business operations. The agent can explain its reasoning at each step, making it easier for humans to understand and validate its decisions.

Agentic RAG: Dynamic Knowledge Integration

Finally, retrieval augmented generation (RAG) takes all these capabilities to the next level by allowing agents to intelligently integrate external information into their workflow. Unlike simple document retrieval, Agentic RAG is an active part of the agent's decision-making process, enabling it to dynamically access and use additional knowledge, examples, or instructions as needed.

Consider an agent handling a customer support ticket about a specific product issue. When it receives a customer complaint about a feature not working, the agent first reviews the ticket details and attached screenshots. It then intelligently determines that it needs more context about the specific feature, querying the product documentation database, searching recent bug reports, and checking for relevant troubleshooting guides.

The agent doesn't just blindly use whatever information it finds. It evaluates the retrieved information for relevance, identifies specific troubleshooting steps that match the customer's issue, and may perform additional searches if the initial information is insufficient.

Finally, it combines this retrieved information with its general knowledge to create a personalized response with specific troubleshooting steps.

Why Should Tech Leaders Care?

The capabilities of LLMs translate directly into practical applications for Agentic AI. In customer support, agents can analyze emails, generate personalized responses, and even resolve issues without human involvement. For data analysis, they can process and summarize vast amounts of information, helping teams make data-driven decisions faster.

From onboarding employees to managing supply chains, Agentic AI reduces manual effort and ensures consistency across workflows.

Organizations leveraging Agentic AI powered by LLMs can adapt quickly as agents learn and execute new workflows without major reprogramming. They can enhance accuracy through advanced planning and error correction capabilities.

While LLMs are undoubtedly powerful and transformative, they're not without limitations and risks. In our next posts, we'll explore real-world applications of Agentic AI across different industries, important limitations of LLMs and how to address them, inherent risks and best practices for responsible deployment, and strategies for combining LLMs with other technologies for more robust solutions.

This balanced approach— understanding both the capabilities and limitations of LLMs—is crucial for organizations looking to leverage Agentic AI effectively. The technology isn't magic, but when applied thoughtfully, it can revolutionize how we approach automation and workplace productivity.

Agentic AI, powered by LLMs, represents a significant leap forward in automation capabilities. The question isn't just whether you're ready to embrace it— but how to do so thoughtfully and effectively.

Stay tuned as we continue to explore the practical implications of this technology and how your organization can harness its power responsibly and effectively.

Cheers,

Pedro Saleiro,

Co-Founder/Chief AI Officer

Author

Pedro Saleiro

05 December, 2024

What is Agentic AI? Part 2: The Role of LLMs

Why LLMs?

Key LLM Capabilities for Agentic AI

Multimodal Perception: The Power to See

World Knowledge and Pattern Recognition

In-Context Learning: Show me how to do it

Chain-of-Thought Reasoning

Agentic RAG: Dynamic Knowledge Integration

Why Should Tech Leaders Care?

Author

Related

What is Agentic AI? Part 1: Introduction

RPA vs. Agentic AI: How to make sense of it?

Product

Company

Legal