Technical Report – Daniel Możdżyński
The author gratefully acknowledges Prof. Dr. Michael Klesel for his insightful suggestions during the preparation of this report.
Large Language Models (LLMs) and LLM-based agents are increasingly deployed in enterprise environments to support tasks such as ERP implementation, IT service management, customer support and knowledge management. While existing systems demonstrate strong performance in single-turn or short multi-turn interactions, long-horizon reasoning—reasoning over extended task sequences, time spans, and evolving contexts—remains a major bottleneck for production use. In this report we analyze the specific challenges of long-horizon reasoning in enterprise LLM systems, relate them to current architectural patterns such as Retrieval-Augmented Generation (RAG), multi-agent planning-and-execution, and memory architectures, and outline a practical design space for robust systems. We complement this analysis with a conceptual case study of an AI SAP consultant system for implementation and AMS (Application Management Services) work. Finally, we propose a research and evaluation agenda for long-horizon enterprise scenarios, including benchmark characteristics and metrics.
The last two years have seen rapid adoption of LLMs in enterprises, ranging from copilots embedded in productivity tools to domain-specific assistants for software development, legal work or ERP systems. However, the majority of successful deployments focus on short-horizon interactions: answering a question, drafting a single document, or assisting in a relatively local coding task.
Enterprise reality is different. Projects such as SAP S/4HANA implementations, system upgrades, or incident management in AMS unfold over weeks to months, involve multiple teams, heterogeneous tools and documentation, and require a consistent chain of reasoning over many steps. Current LLM systems struggle to maintain:
Coherence and traceability across long sequences of actions.
Robustness when context grows beyond a single prompt window.
Alignment with strict governance, compliance and safety requirements.
Recent work on Retrieval-Augmented Generation (RAG) has shown how integrating structured retrieval with LLMs helps ground responses in enterprise knowledge and reduce hallucination. At the same time, a growing literature on LLM agents and multi-agent systems highlights planning, tool use and memory as critical ingredients for long-horizon tasks.
In this report we focus on the intersection: what makes long-horizon reasoning particularly hard in enterprise settings, and what architectural patterns appear most promising.
We use long-horizon reasoning to denote tasks where:
The number of steps is large (tens to hundreds of actions or decisions).
The time span is extended (hours to months).
The state of the environment evolves (documents change, tickets progress, systems are configured).
Correct behavior depends on linking early decisions to late outcomes.
Long-horizon enterprise tasks include:
Designing and executing an ERP implementation (e.g. S/4HANA greenfield or upgrade).
Managing a portfolio of AMS incidents, including root-cause analysis, knowledge reuse and regression patterns.
Complex compliance and governance workflows, where obligations and evidence accumulate over time.
In other words, enterprise projects are closer to multi-week or multi-month problem solving than to single-turn question answering: early steps constrain later options, and success depends on maintaining a coherent line of reasoning over time rather than producing a single, isolated answer.
Compared to synthetic benchmarks, enterprise scenarios introduce constraints that exacerbate long-horizon difficulties:
Heterogeneous data silos – knowledge is spread across wikis, tickets, documents, emails and source systems.
Strict governance and access control – role-based access, data residency and auditability are mandatory.
High cost of errors – misconfiguring a production ERP system or mishandling regulated data can have severe impact.
Human-in-the-loop workflows – LLM systems must interoperate with project managers, consultants and engineers, not replace them in a single step.
These constraints make naïve scaling of context (just sending more tokens) both technically and organizationally infeasible.
Most current enterprise “AI copilots” behave like enhanced search interfaces wrapped around a large language model. Each query is handled largely independently: the system retrieves a handful of documents, generates an answer and discards most of the intermediate state. Project structure, long-term goals and governance constraints typically live outside the AI system—in the heads of consultants and in project plans—rather than being explicitly represented in the architecture of the copilot itself.
Empirical and anecdotal evidence from deployed systems, as well as recent studies of LLM agents, reveal several common failure modes in long-horizon scenarios.
Adding more context does not monotonically improve performance. As contexts grow, models:
Focus on wrong parts of the prompt.
Forget or misinterpret earlier constraints.
Generate answers that are locally plausible but globally inconsistent with previous steps.
This context paradox has been explicitly observed in production systems: beyond a certain size, additional context decreases system stability.
In long workflows, models often:
Fail to maintain an explicit plan across many steps.
Re-explain or re-derive decisions instead of reusing earlier conclusions.
Make contradictory decisions in steps that are far apart in the sequence.
Research on plan-and-execute architectures shows that naive prompting leads to low success rates on long web and tool-use tasks.
Enterprise LLM systems frequently rely on tools (ticketing APIs, SAP interfaces, knowledge base search). Fragility appears when:
Tool calling logic is encoded only in prompts, not robust control flow.
Error handling, timeouts, or partial failures are not modeled.
Changes in underlying systems silently break tasks.
Without explicit memory architecture, long-term information is spread across:
System logs and traces,
Vector databases,
Internal state of multiple agents,
External project tools (e.g. Jira, SAP Solution Manager).
This fragmentation leads to incomplete or inconsistent recall, especially when systems are restarted, upgraded or scaled.
We now outline architectural patterns that address these failure modes.
RAG combines a retriever over enterprise content with a generator LLM, grounding responses in relevant documents.
For long-horizon tasks, RAG helps to:
Keep prompts concise by retrieving only task-relevant slices of history or documentation.
Provide verifiable references in generated outputs (e.g. blueprint chapters linked to source documents).
Decouple knowledge updates (indexing new documentation) from model updates.
However, vanilla “query → retrieve → answer” RAG is fundamentally short-horizon; it does not in itself enforce multi-step plans or long-term memory beyond what fits in each prompt.
A common pattern for long-horizon tasks is to separate:
A Planner agent that generates and updates high-level plans (task decomposition, milestones).
An Executor agent (or agents) that carry out individual steps, including tool calls and document generation.
This pattern appears across multiple recent systems and analyses of agent architectures. Benefits for enterprise settings include:
Clear traceability: every action can be linked to a plan node.
Flexibility: planners and executors can use different models or configurations (e.g. larger planner, smaller executor).
Easier governance: plans can be surfaced to humans for approval.
Long-horizon reasoning requires hierarchical memory, analogous to human short-term and long-term memory. Recent work and practitioner literature converge on several tiers:
Short-term (working) memory – the immediate conversational or task context within the prompt window.
Episodic memory – structured records of past interactions, tasks and decisions (e.g. incident resolution timelines, workshop sessions).
Semantic memory – distilled knowledge from episodes (patterns, templates, playbooks).
External artifacts – documents generated by the system: blueprints, tickets, diagrams.
Architecturally, these can be implemented via vector stores for episodic and semantic memory, relational or graph databases for structured decision logs, and indexes and RAG over generated artifacts. The crucial point is consistency and addressability: each memory record must have identifiers that agents and humans can reference.
Recent research on agentic systems introduces thought management layers that:
Monitor agent trajectories.
Prune unproductive reasoning branches.
Trigger re-planning when necessary.
In enterprise deployments, a lightweight version of thought management can enforce guardrails, detect looping or inconsistent behavior, and log decision rationales for later audit.
To make these ideas concrete, we consider a conceptual system: an AI SAP consultant that supports implementation projects and AMS incident handling. The system is deployed on sovereign infrastructure and fine-tuned on organization-specific project documentation.
Use cases include:
Implementation / Upgrade Support: facilitate explore workshops, structure requirements into a business blueprint, suggest configuration options and integration patterns, and track design decisions over months.
AMS (Application Management Services): triage and analyze incidents (e.g. FI/CO, SD, MM, IS-U), reuse knowledge from previous incidents and SAP Notes, and propose structured responses and remediation steps.
Both categories are inherently long-horizon: they involve repeated interactions with the same landscape and stakeholders, and the system must remain consistent with past decisions.
A long-horizon-oriented architecture for this system can be structured as follows:
RAG Layer: indexes project documentation (blueprints, functional specs, customizing guides), AMS tickets, SAP Notes metadata, best practices; provides retrieval pipelines tailored to workshops, incident analysis, and regression/change-impact analysis.
Planner Agents: a Project Planner that maintains the map of blueprint sections, workshops and follow-up tasks; an Incident Planner that manages the lifecycle of an incident.
Executor Agents: a Workshop Facilitator, a Blueprint Drafter, and an Incident Analyst responsible for executing individual steps.
Memory Subsystem: episodic store for timelines of workshops and incidents, semantic store for patterns extracted from resolved incidents, and a decision log linking requirements, design decisions and configuration.
Governance & Human-in-the-Loop: role-based access, masking and redaction, approval steps for high-impact actions, and a full trace of which agent proposed which change and why.
Such a system directly targets several long-horizon challenges:
Maintaining design coherence: a persistent blueprint model ensures that suggestions in later project phases are consistent with earlier decisions.
Knowledge reuse in AMS: the episodic memory captures resolved incidents; semantic patterns help detect recurrences.
Traceability and audit: every agent action is linked to a plan step and to retrieved evidence.
At the same time, the system still faces open issues, such as managing index drift as documentation evolves, robustly evaluating agents in real-world, noisy environments, and balancing autonomy with human control in production SAP landscapes.
Recent surveys on LLM agent evaluation emphasize that long-horizon tasks are among the hardest to evaluate systematically, especially in dynamic enterprise settings. We outline a minimal evaluation framework for enterprise long-horizon systems.
Tasks should reflect realistic business processes (e.g. completing a blueprint section, resolving an AMS incident with multiple dependencies), require multi-step planning (at least 10–20 steps), and involve evolving state (new documents, updated tickets). Synthetic frameworks like TaskWeaver and long-horizon benchmarks such as HeroBench provide inspiration for controllable task generation.
Key metrics include:
Task success rate as a function of horizon length.
Plan stability (degree to which plans are followed or require re-planning).
Consistency with previous decisions and policies.
Evidence quality (correctness and appropriateness of retrieved documents).
Human effort saved (e.g. reduction in consultant hours for blueprinting, or AMS resolution time).
Evaluation modes may include:
Offline replay – run the system on historic project or AMS data with ground-truth outcomes.
Shadow mode – the system proposes plans and actions while humans execute; compare to the actually chosen steps.
Controlled A/B tests – compare workflows with and without long-horizon features enabled (e.g. with vs. without hierarchical memory).
Enterprise resource planning (ERP), and SAP S/4HANA in particular, provide a useful vertical testbed for long-horizon reasoning research. Projects expose rich, pre-existing structure in the form of processes, blueprints and governance frameworks, while AMS queues offer a continuous stream of real incidents tied to the same landscape. This combination allows cognitive architectures to be evaluated not only on synthetic benchmarks, but also on process-level indicators such as time-to-blueprint, rework cycles or mean time to resolution.
Based on the above, we see several directions where further research is needed:
Task and benchmark design for realistic, multi-month enterprise workflows, beyond synthetic puzzles and short web tasks.
Hybrid planning methods that combine symbolic, process-aware representations (e.g. BPMN, SAP Activate templates) with neural planners.
Robust memory architectures that provide guarantees on freshness, consistency and access control across multiple tools and agents.
Governance and safety frameworks tailored to long-horizon agents, including rollback strategies and fail-safe defaults.
Learning from traces – using historical project logs and AMS tickets to improve planning and decision policies.
Long-horizon reasoning is emerging as a central bottleneck for deploying LLM systems in serious enterprise settings. Simply increasing context windows or fine-tuning models on more data is not sufficient. Instead, we must design architectures that explicitly separate planning and execution, provide multi-tier memory for episodic and semantic knowledge, ground decisions via RAG in evolving enterprise documentation, and operate under strict governance and human-in-the-loop constraints.
The conceptual AI SAP consultant system discussed here illustrates how these principles can be instantiated in the domain of SAP implementations and AMS. More broadly, the same patterns apply to CRM, supply chain, and other complex enterprise workflows.
We hope this report provides both practitioners and researchers with a structured view of the challenges and a starting point for systematic experimentation on long-horizon enterprise LLM systems.
1. Klesel, M., & Wittmann, H. F. (2025). Retrieval-Augmented Generation (RAG). Business & Information Systems Engineering, 67(4), 551–561.
2. Mohammadi, M. et al. (2025). Evaluation and Benchmarking of LLM Agents: A Survey.
3. LangWatch. (2025). The 6 Context Engineering Challenges Stopping AI from Scaling in Production.
4. Anonymous. (2025). Improving Planning of Agents for Long-Horizon Tasks. arXiv preprint.
5. Bidochko, A. (2025). Thought Management System for Long-Horizon, Goal-Driven AI Agents. Future Generation Computer Systems.
6. Anonymous. (2025). TaskWeaver: Probing the Limits of Endurance in Long-Horizon Tasks.
7. Anonymous. (2025). HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning.
8. Anthropic. (2025). Effective context engineering for AI agents. Anthropic Engineering Blog, 29 September 2025.
Acknowledgements
The author would like to thank Prof. Dr. Michael Klesel (Frankfurt University of Applied Sciences) for his valuable comments and feedback on earlier drafts of this technical report. Any remaining errors are the author’s own.