
Enterprise AI relies on context sourced from many dozens of SaaS applications, including document management, chat and email, project management, code repos, customer relationship management (CRM), and enterprise resource planning (ERP) systems. This context forms the basis for decision-making, helping to dictate what action should be taken next by an agent. Getting access to the right context relies on how well your AI vendor designed their data connectors. Their architecture determines whether they can scale to hundreds of millions of documents, retrieve real-time data, and maintain resilient, highly reliable indexes. Just as crucially, data connectors define how enterprise information is accessed.
When AI vendors start offering support for data connectors, it’s essential to start asking questions about their design. These questions help ensure that enterprise context — the foundation of AI-driven decisions — remains intact.
1. Are you indexing data or running a federated fetch?
Effective search, the backbone of quality enterprise AI, starts with indexing — storing and organizing data for fast, accurate retrieval. Think of an index at the back of a book: instead of reading the entire book to find information, you can go straight to the relevant page. Search indexes work the same way, but beyond speed, they enhance quality by normalizing results, applying scoring, and handling synonyms and acronyms consistently. This is how Google structured the web to understand the world’s information. While that was a big problem, internet data also had the advantage of a standard data model and an open architecture. In contrast, enterprises have heterogeneous data that’s permissioned for each user.
A different approach, federated fetch, retrieves data directly from various sources via their search APIs. However, API quality and latency vary significantly, making results inconsistent. In a federated system, you are querying dozens of systems in parallel and then waiting for the slowest results to return before generating an answer. There’s no unified scoring system across sources, often leading to a simple union of results. With the limited context window of LLMs, this issue escalates, forcing AI to process enterprise data inefficiently and without truly understanding context — diminishing the accuracy and relevance of results. Federated fetch also requires users to know exactly which data source to query. A giveaway of a federated fetch system is when providers require user authentication for each data source.
At Glean, we’ve found that most customers pull data from at least three different sources to answer a single question. Data is fragmented, and users don’t always know where the information they need is stored. This is why indexing is critical for enterprise AI.
Your vendor should index your data using both semantic and lexical search indexes, grounding your AI in the most relevant enterprise context.
2. How do you enforce data source permissions?
Each connector comes with its own set of permissions, so your AI provider should respect granular permission structures at the data source level. Enterprise systems use a mix of:
- Access Control Lists (ACLs)
- Directory structures
- Group hierarchies
- Individual overrides
- Link sharing settings
- A combination of the above
- Real-time changes to all of the above that must be enforced instantly
Your AI provider should ensure real-time updates to permissions to prevent unauthorized data access. Enterprise controls that manage indexing down to an individual document level also ensure that sensitive or irrelevant data stays out of your knowledge base.
An often overlooked aspect of connector design is the need to build an identity graph across data sources so users only see what they’re permitted to. Usernames and emails also differ from application to application, making alias linking in an identity graph crucial.
Some vendors sidestep this complexity by only indexing publicly available company data, like wikis and public Slack channels, or applying only broad, high-level permissions. However, to be secure, enterprise AI should mirror the granularity of real-world access controls.
3. How are you protecting sensitive data?
Permissions alone aren’t enough to safeguard enterprise data. Under the weight of SaaS sprawl, many organizations have suffered through lax data retention and enforcement policies. As AI adoption scales beyond single file uploads to large-scale enterprise data integration, providers should be evaluated on whether they offer an extra layer of governance to prevent sensitive information—trade secrets, customer data, financial records—from leaking into AI-generated responses.
4. How often is new data updated and made accessible for AI?
Enterprise data can decay in value quickly. As AI moves towards agentic systems that take action based on real-time context, it’s more important than ever that results are grounded in context from recent Slack messages, Microsoft Outlook emails, or Zoom meetings. You should expect real-time connectors to reflect data updates in the span of minutes, not hours.
Keeping data and permissions up to date presents a deceptive challenge. Providers operate within API rate limits set by data sources, where request costs and quotas vary—and can shift dynamically based on server load. You’ll want a provider that uses multiple crawl strategies intelligently to retrieve as much data as possible without flooding the API. However, this requires finesse and constant adherence to changing API limits. It also requires careful prioritization around the ordering of which data to index, putting the indexing of permissions ahead of content to keep data secure.
5. Who is responsible for data modeling?
Another key factor to assess is the effort required to make each data source usable with an AI system. Many providers offer extensive integration catalogs but leave data modeling and indexing to the customer. They may also rely on engaging with system integrators to do the heavy lifting rather than offering native data connector support.
Data modeling is complex, as every application has its own structure. Take Slack, for example:
- Public vs. private channels
- DMs, group messages, and threaded conversations
- Top posts vs. mentions
- Attachments, links, and shared workspaces
Each element is interconnected — threads link back to original posts and Slack messages lack traditional titles, which can affect ranking in a typical search engine. Effective data modeling helps AI interpret an application’s structure and rank information.
Most enterprises have already invested heavily in data modeling for structured analytics. Extending that effort to unstructured data in LLMs is resource-intensive and constantly evolving as applications introduce new features. AI providers need to adapt their data model continuously rather than follow a set-it-and-forget-it approach, because each of these SaaS applications is constantly shipping new features and evolving their data model.
6. What data are you indexing?
The most relevant enterprise information is often authoritative, recent, and personalized. Capturing these factors requires more than just indexing content — activity and people data reflect how and by whom the information is used. The broader context is key because enterprise search and AI applications lack strong feedback signals. Without it, AI struggles to understand and learn the importance of a single document.
Understanding what content is being indexed is also important. In many cases, only a fraction of a data source’s content or functionality is supported, limiting AI’s ability to provide comprehensive insights. Evaluating the depth and breadth of indexing helps enterprise AI have the full context needed for relevant results.
For enterprises with years of historical context, intelligent indexing is key for cost efficiency. Setting rigid time-based limits, like indexing only the past year, can exclude high-value authoritative data. A more effective approach is using dynamic algorithms that determine what to index, adapting based on ongoing activity and relevance, and ensuring AI processes the most valuable data while avoiding unnecessary storage costs.
7. Are you organizing data in a knowledge graph?
Indexing content, activity, and people data is just the first step. The real differentiator in enterprise AI data connectors is knowledge graphs.
Enterprise knowledge isn’t just about processing text — it’s about understanding how information is used, who depends on it, and its overall impact. Even within the same organization, terms can have different meanings; for example, “POC” might mean “proof of concept” in engineering but “point of contact” in sales. Knowledge graphs provide this structure, giving AI the necessary context to make knowledge actionable and drive better decision making.
Constructing a knowledge graph is complex, requiring enterprise data to be fused together and then sifted to identify and understand concepts across documents, messages, and queries. This process must go beyond matching text as concepts are often referred to by multiple names.
8. Where is the index stored?
As the number of connectors grows, your entire enterprise knowledge base is effectively being indexed, creating a near replica of your data. This raises important considerations about where and how that data is stored. A cloud-prem deployment model, where data resides in your isolated VPC, provides greater control and security compared to traditional multi-tenant SaaS environments.
Beyond storage, protecting the transfer of enterprise data to LLMs is equally important. Establishing agreements with LLM providers for zero data retention and no training on enterprise data avoids the accidental leakage of enterprise data to LLMs.
9. How does your connector infrastructure scale?
Crawling and indexing data across an enterprise ecosystem of connectors requires a horizontally distributed crawl infrastructure. Each data source requires a different amount of work — for example, a Slack message is much smaller in size than a Google Drive file. And, a plain text file is easier to index than a PDF, which requires preprocessing. Beyond data complexity, API quotas introduce challenges. These quotas don’t scale proportionally to data volume, creating bottlenecks in your crawl infrastructure and limiting the speed at which data can be indexed.
This challenge doesn’t manifest for small teams, but does for enterprises with 10,000 to 100,000 users managing hundreds of millions of documents. Handling enterprise-scale data requires an indexing approach that optimizes for efficiency and independent scaling while navigating API constraints.
Final thoughts
Data connectors directly impact the quality and security of enterprise AI. A strong foundation in connectors with indexing of content, activity and people, permissions enforcement, and real-time updates gives AI agents the context they need to make automated decisions intelligently. By asking these tough questions, you can gauge whether a vendor truly understands the complexity of building an enterprise-ready connector framework. Don’t just take their word for future connector support — dig into how they plan to implement and maintain them. That way, you can make informed decisions about how your data fuels AI agents and, ultimately, the future of AI in your organization.