Argus · White Paper v0.1 · DRAFT
The architectural argument

Feature flags for multi-tenant SaaS.

Most feature flag platforms treat multi-tenancy as a targeting attribute on a user identity. It works at small scale. It strains at dozens of tenants. It breaks at hundreds. This paper makes the case that the tenant belongs in the data model, not the rule engine.

Author Shahin Zangenehpour, PhD
Version 0.1 · draft
Date 2026-05-01
Reading time ~30 min
00 · Abstract

Abstract

Most feature flag platforms treat multi-tenancy as a targeting attribute on a user identity. This works when you have a handful of tenants. It stops working ... quietly, then painfully ... when you have dozens or hundreds. The failure modes are specific: rule explosion, identity-keyed performance degradation, and audit fragmentation. This paper argues that multi-tenancy belongs in the data model, not the rule engine. It defines the distinction between attribute-based and structural multi-tenancy, walks through three real B2B scenarios, surveys how the major platforms handle tenants today, and introduces Argus ... a feature flag platform built around tenant-as-primitive. Engineers building B2B platforms that ship the same product to multiple customer organisations should read this. You will leave with a clear technical framework for evaluating how any flag tool ... including one you build yourself ... handles multi-tenancy.

01 · Executive summary

The argument in one section.

Feature flags solved the deployment-versus-release problem. The tooling for that problem is mature. The problem they did not solve is what happens when the product itself is not single-tenant.

Feature flags solved the deployment-versus-release problem. Wrap new code in a conditional, ship it dark, and turn it on when you are ready. The tooling for this is mature. LaunchDarkly, Unleash, Flagsmith, GrowthBook, and Statsig each do single-tenant feature flagging well. The rule engines are expressive. The SDKs are reliable. The dashboards are usable.

The problem starts when the product is not single-tenant.

B2B SaaS platforms ... telecom OEMs, vertical SaaS vendors, embedded fintech providers ... ship the same codebase to many customer organisations. Each organisation may need a different feature surface, a different rollout cadence, and a different audit boundary. This is not an edge case. It is the defining operational constraint of any platform that sells to other businesses.

The dominant approach today is to model tenancy as an attribute on the user identity. LaunchDarkly calls it a "context kind." Flagsmith stores it as a "trait." GrowthBook expresses it as a targeting condition. The mechanism varies. The architecture does not: the tenant is a value on the user, and per-tenant decisions are expressed as rules over that value.

Three conclusions:

  1. The dominant multi-tenancy approach is attribute-based. Every major feature flag platform represents the tenant as a property of the user identity and expresses per-tenant rollouts through targeting rules.
  2. That approach has known scaling and audit failure modes. Rule grids grow combinatorially with tenant count. Per-tenant aggregations require identity-store scans. Audit trails for "who changed what for which tenant" must be reconstructed from rule diffs rather than queried directly.
  3. A structural approach ... tenant-as-primitive ... is materially different. When the tenant is a first-class entity in the data model with its own override hierarchy, rollout configuration, and audit stream, per-tenant operations become O(1) lookups. Rule grids stay small. Audit becomes a native query. SOC 2 evidence collection becomes automated rather than manual.
Figure 01 · Side-by-side data modelsattribute vs. structural
Attribute-based
User
id: user_abc
tenantId: carrier_b
tier: premium
region: eu-west
↓ targeting rule
Flag · new_pairing_flow
if tenantId in [carrier_a] → true
if tenantId == carrier_b AND tier == premium → true
if tenantId in [carrier_c] → false
defaultfalse
vs.
Structural
Flag · new_pairing_flow
defaultValue: false
↓ environments
Environment · production
value: false
Tenant override · carrier_b
value: true
rollout: 25%
updatedBy: pm@co
Left: tenant decisions live inside the rule engine. Right: tenant decisions live in the data layer as their own document, sibling to the environment.

Argus is a feature flag platform built on this structural model. It treats the tenant as a document ... sibling to the environment ... with its own override hierarchy and immutable audit log. This paper is not a product pitch for Argus. It is an architectural argument that the structural approach produces better outcomes for any team operating a multi-tenant platform. Argus is the embodiment of the argument, not its precondition.

02 · The reality

The multi-tenant SaaS reality.

The term "multi-tenant" is overloaded. In this paper it means something narrower and more operational: one codebase, multiple customer organisations, distinct feature surfaces per customer.

This is not theoretical. It is the default architecture for any B2B platform that sells the same product to other businesses and lets those businesses customise their experience. Three scenarios illustrate the pattern and its operational tax.

Scenario A · Telecom OEM

A consumer electronics platform ships a single iOS and Android codebase under three carrier brands. Carrier A is in North America, Carrier B is in Europe, and Carrier C is in Japan. Each carrier has different regulatory requirements, different commercial agreements about which features they have licensed, and different rollout cadences driven by their own QA cycles.

The engineering team wants to ship weekly. Carrier B wants a two-week validation window. Carrier C requires regulatory sign-off before any feature touching device telemetry goes live. A feature that is ready for Carrier A today may not be deployable for Carrier C for six weeks.

Without per-tenant feature control, the team has three options: fork the codebase, multiply environments, or slow the entire release to the pace of the slowest carrier. All three have been tried. All three carry cost. Forking creates drift. Environment multiplication creates configuration sprawl. Slowing the release punishes the carriers who are ready.

The operational tax: a growing matrix of "which carrier has which feature" maintained in spreadsheets, Slack threads, and release notes. No single system of record. Audit requests from Carrier B's compliance team answered by grepping git history.

Scenario B · Vertical SaaS with tiered packages

A legaltech platform sells three tiers: Solo, Firm, and Enterprise. Each tier unlocks a different feature surface. Solo gets document drafting. Firm gets collaboration and e-signature. Enterprise gets audit trails, SSO, and API access. All three tiers run on the same codebase and the same infrastructure.

The product team ships a new AI-assisted contract review feature. They want to roll it out to Enterprise tenants first, validate adoption, then promote to Firm, then Solo. But two Enterprise tenants have explicitly opted out of AI features for compliance reasons. One Firm tenant has been granted early access as part of a commercial negotiation.

The rollout is not "Enterprise = on, Firm = off." It is "Enterprise = on except Tenant D and Tenant F, Firm = off except Tenant G, Solo = off." That is a five-clause targeting rule for a single flag. Multiply by the number of flags shipped per quarter ... the legaltech team ships roughly 15 feature flags per quarter ... and the total clause count across all flags climbs past 75 within three months. After a year, the rule grid has hundreds of clauses, each one an assumption about which tenant should see which feature. The system of record is no longer the flag tool. It is the collective memory of the product managers who wrote the clauses.

The operational tax: product managers maintaining per-tenant exception lists. Engineering deploying changes that accidentally override a commercial commitment. Customer success discovering that a tenant lost access to a feature because someone edited a targeting rule without understanding the full clause set.

Scenario C · Embedded / white-label fintech

A payments platform is OEMed into five PSP partners. Each partner has their own brand, their own compliance requirements, and their own user base. The payments platform provides the infrastructure. The PSP partners provide the customer relationship.

Partner A wants 3D Secure enabled by default. Partner B wants it behind a toggle. Partner C is in a jurisdiction where a specific payment method is prohibited. Partner D wants access to a fraud-scoring feature that Partner E has not licensed.

Each partner needs their own audit boundary. When Partner A's compliance team asks "show us every feature change that affected our users in the last 90 days," the answer cannot be "here is a diff of all targeting rules across all partners." It needs to be scoped to Partner A's tenant.

The operational tax: audit evidence collected manually by filtering rule changelogs. Tenant offboarding accomplished by searching for every rule that references the departing partner's identifier and removing the clause. No cascade. No single operation. Just careful, error-prone manual work.

These three scenarios share a structural pattern. The product is one. The tenants are many. The feature surface is different per tenant. The audit boundary is per tenant. The rollout cadence is per tenant. And the tools built for single-tenant feature flagging accommodate this pattern by adding tenant as an attribute on the user and expressing per-tenant decisions as targeting rules.

That accommodation works. Until it doesn't.

03 · The market

How the market handles tenants today.

The platforms in this section are production-grade, well-funded, and widely deployed. The criticism is narrow: their multi-tenancy model was designed after their core architecture, and that sequence has consequences.

Platform-by-platform

Not every platform handles tenants the same way. A fact-check of public documentation ... conducted against primary sources, not vendor comparison pages ... reveals meaningful differences in how each tool models the tenant.

LaunchDarkly has the most sophisticated tenant model in the market. Its "contexts" system, introduced to replace the older user-centric model, allows non-user entities to be first-class evaluation targets. You can define an organization context kind and write targeting rules that evaluate against organisation attributes directly, without routing through a user identity. An AWS Partner Network reference architecture demonstrates this pattern explicitly: create a tenant context kind, assign a plan attribute, build segments per tier, and target flags against those segments. This is real multi-tenant engineering. It works. The question is not whether LaunchDarkly can handle tenants ... it can ... but whether expressing every per-tenant decision as a rule is the right abstraction at scale.

Split / Harness Feature Management & Experimentation takes a similar direction with its "traffic types" concept. Traffic types define the entity being evaluated ... user, account, customer, machine ... and allow per-entity targeting and rollout. For B2B platforms, defining an account traffic type and rolling out by account is a documented, first-class pattern. Harness completed its acquisition of Split in 2024, and the combined product positions account-level evaluation as a core capability.

DevCycle explicitly supports rollout randomisation by Account, Organisation, Tenant, or Store ID through a custom-property randomisation feature. This lets operators choose which identifier is used for percentage bucketing, making tenant-level gradual rollouts a native operation rather than a workaround.

Statsig does not expose a named tenant object, but its custom IDs and unit ID conditions make company-level bucketing a supported native pattern. You can define a companyID as a unit type and evaluate flags against it. This is not a workaround. It is an intentional design that generalises across entity types. Statsig's acquisition by OpenAI (announced 2025, pending close at time of writing) introduces some uncertainty about the product's independent roadmap, but the current evaluation model is sound.

Unleash supports tenant rollout through custom context fields and configurable stickiness. You can add a tenantId field to the Unleash context and use constraints to target flags per-tenant. The stickiness configuration lets you hash on tenantId for consistent bucketing. This is a generic context system ... flexible, but not tenant-opinionated. The operator must know to configure stickiness correctly, and the system does not enforce tenant boundaries.

GrowthBook targets tenants through attributes and saved groups. Its warehouse-native analytics model is strong for experimentation ... GrowthBook runs analysis directly in your data warehouse, which is a genuine architectural advantage for teams that care about data ownership. But analytics integration is not the same as a tenant-native rollout primitive. GrowthBook's pricing page references a "Multi-tenant Mode" for self-hosted installations, which suggests the team recognises the use case, even if the current targeting model handles it through generic attributes.

Flagsmith uses identities, traits, and segments ... the least tenant-opinionated model in this group. Tenancy is expressed by setting a tenantId trait on the identity and building segments that match against it. This works, but it means every tenant operation ... rollout, override, audit ... routes through the identity and segment abstractions.

Credit where it is due: LaunchDarkly's contexts, Split's traffic types, and DevCycle's custom-property randomisation are real engineering. They work for teams with a moderate number of tenants and engineers who understand the targeting model deeply. The platforms in this space are production-grade, well-funded, and widely deployed.

Where the model strains

The strain shows in three specific places.

Rule explosion

Consider a flag that needs to be independently configured for 12 tenants across 3 tiers. In an attribute-based model, this is a targeting rule with up to 36 clauses. Each clause is a condition: "if tenant is X and tier is Y, serve variation Z." Add a second flag with the same requirements and you have 72 clauses. Multiply by the number of flags a B2B platform ships in a quarter ... often 30 to 50 ... and the rule grid becomes a spreadsheet that happens to live inside a flag tool.

This is not hypothetical. LaunchDarkly's own documentation notes that SDK initialisation time scales with the combination of total flags, variation size, and the number and complexity of targeting rules across all flags. The rule engine is doing work that the data model should be doing.

Figure 02 · Rule explosion12 tenants × 3 tiers · one flag
Tenant Standard Premium Canonical Tenant Aoffoff100% Tenant Boff100%off Tenant Coff50%off Tenant Doff100%off Tenant E25%offoff Tenant F100%offoff Tenant G10%offoff Tenant Hoffoffoff Tenant I100%offoff Tenant Joff100%off Tenant Koff75%off Tenant Loffoffoff
Attribute model: 36 clauses for one flag. Now multiply by 40 flags. · Structural model: 12 tenant documents, one override each. No grid.
A targeting-rule grid for a single flag across 12 tenants and 3 tiers. Each cell is a clause in the rule engine. The structural equivalent is twelve documents.

Identity-keyed performance

When the platform's primary evaluation key is the user identity and tenancy is an attribute on that identity, per-tenant aggregations require scanning the identity store. Flagsmith's architecture evolution illustrates this concretely. Their original Core API used relational storage for identity evaluation. As identity volumes grew, they migrated to an Edge API backed by DynamoDB global tables for low-latency evaluation at scale. Their local evaluation mode ... the fastest SDK path ... was shipped without identity override support because including identity data in the environment document would create unbounded payload sizes. These are rational engineering decisions, but they are consequences of an identity-keyed architecture being asked to do tenant-scoped work.

Audit fragmentation

"Who turned on flag X for Tenant B in production on March 15th?" In an attribute-based model, answering this question requires reconstructing the targeting rule history for the flag, finding the diff that added or modified the clause referencing Tenant B, and correlating that diff with the user who made the change. In a structural model, the answer is a direct query against the tenant's audit log.

The reconstruction process is worth spelling out, because it is the part that costs real engineering time. Step one: open the flag's change history. Step two: scroll through rule diffs until you find one that mentions the tenant. Step three: read the diff to understand what changed ... was a clause added, modified, or reordered? Step four: determine whether the change was the intended one or a side effect of a broader rule edit that touched multiple tenants. Step five: correlate the change with an approval record, if one exists. This process takes minutes per flag. Multiply by a compliance review covering 50 flags across three tenants, and the audit exercise takes a full engineering day.

This matters most during compliance reviews. SOC 2 Type II evidence collection asks for proof that changes were authorised, reviewed, and scoped correctly. When the audit trail is a sequence of rule diffs, evidence collection is a manual reconstruction exercise. When the audit trail is a per-tenant log with timestamps, actors, and before/after values, evidence collection is a database query.

Fair framing

These platforms are not wrong. They were built for monolithic SaaS, and multi-tenancy was added later as an attribute. LaunchDarkly's contexts system is the most ambitious attempt to generalise beyond user-identity evaluation, and it works well for many teams. The architectural decision predates the use case. The question is whether the accommodation is sufficient ... or whether the use case deserves its own primitive.

04 · The distinction

Two models. Same goal. Different structures.

Different operational outcomes follow from different data shapes. The distinction is not stylistic. It is structural.

Attribute model

The tenant is a key-value pair on the user identity (or, in LaunchDarkly's more general model, a context attribute on a context kind). The flag system stores no tenant entity. Per-tenant decisions are expressed as conditions in the rule engine. The rule engine evaluates the user's attributes at request time and resolves to a variation.

JSON · Attribute modelschematic
User {
  id: "user_abc",
  attributes: {
    tenantId: "carrier_b",
    tier: "premium",
    region: "eu-west"
  }
}

Flag "new_pairing_flow" {
  rules: [
    { if user.tenantId in ["carrier_a"] → true },
    { if user.tenantId in ["carrier_b"] AND user.tier == "premium" → true },
    { if user.tenantId in ["carrier_c"] → false },
    { default → false }
  ]
}

The tenant's state is distributed across every flag's rule set. There is no single place to look up "what does Carrier B see?"

Structural model

The tenant is a first-class document with its own ID, metadata, override hierarchy, rollout configuration, and audit stream. Per-tenant decisions are reads and writes against that document, not rule evaluations.

JSON · Structural modelschematic
Flag "new_pairing_flow" {
  environments: {
    production: {
      defaultValue: false,
      tenants: {
        carrier_a: { value: true, updatedBy: "eng@co", updatedAt: "..." },
        carrier_b: { value: true, rollout: 25, updatedBy: "pm@co", updatedAt: "..." },
        carrier_c: { value: false, updatedBy: "compliance@co", updatedAt: "..." }
      }
    }
  }
}

The tenant's state is co-located. One read answers "what does Carrier B see for this flag?" One query across flags answers "what does Carrier B see for everything?"

Four operations compared

"Turn flag X on for Tenant B in production."

Attribute model: open the flag's targeting rules, find or create a clause referencing tenantId == carrier_b, set its variation to true, save the rule set. The change is recorded as a diff to the rule array.

Structural model: navigate to flag X → production → carrier_b, set value to true, save. The change is recorded as a write to the tenant override document with actor, timestamp, and previous value.

"Show me all flag changes for Tenant B in the last 30 days."

Attribute model: query the audit log for all flag changes in the environment, filter for rule diffs that mention carrier_b in a clause, reconstruct the before/after state from the diff. This requires parsing rule structures and understanding clause semantics.

Structural model: query the audit log with { tenant: "carrier_b", timestamp: { $gte: 30_days_ago } }. Direct read. No reconstruction.

"Roll flag X out to 25% of users in Tenant A and 100% of users in Tenant B simultaneously."

Attribute model: create two targeting rules on the same flag. Rule 1: if tenantId == tenant_a, serve a 25% rollout. Rule 2: if tenantId == tenant_b, serve true. Rule ordering matters. Interaction with other rules must be verified manually.

Structural model: set tenant_a's override to { rollout: 25 } and tenant_b's override to { value: true }. Each tenant's configuration is independent. No rule ordering. No interaction risk.

"Audit which tenants have flag X enabled."

Attribute model: parse the flag's complete rule set, evaluate each clause to determine which tenant identifiers would resolve to true, account for rule ordering and default fallthrough. This is a rule-evaluation problem.

Structural model: read the tenant overrides collection for the flag. Each document states the value directly. This is a data-retrieval problem.

Figure 03 · Argus document hierarchydata plane
Flag · document
name · description · type
Environment · production
defaultValue · updatedBy · updatedAt
Tenant override · carrier_a
value · rollout? · updatedBy · updatedAt
Tenant override · carrier_b
value · rollout? · updatedBy · updatedAt
Tenant override · carrier_c
value · rollout? · updatedBy · updatedAt
Audit log · flag-scoped
actor · env · tenant? · field · previousValue · newValue · timestamp
Tenant state is a document read, not a rule evaluation. Audit is a native query, not a diff reconstruction.

Operational implications

SOC 2 evidence collection becomes a query: "show all changes to tenant_b across all flags in production for Q1." In the attribute model, this requires log parsing, rule-diff interpretation, and manual correlation.

Tenant offboarding becomes a cascade delete or archive on the tenant's override documents. In the attribute model, offboarding requires scanning every flag's rule set for clauses referencing the departing tenant and removing them individually.

Tenant-scoped RBAC becomes a permission on the tenant document: "this operator can modify overrides for tenant_b but not tenant_a." In the attribute model, RBAC must be enforced as a guard on the rule editor, which is harder to scope precisely.

The structural model is not universally better. For platforms with one tenant or two tenants, the attribute model is simpler and sufficient. The structural model earns its complexity at the point where per-tenant operations ... overrides, audits, rollouts, offboarding ... become a recurring operational cost rather than an occasional exception.

05 · Patterns

A model for tenant-tier rollouts.

Four patterns any team operating a multi-tenant platform should consider, regardless of which flag tool they use. For each, the data-model and audit primitives required ... and where attribute-based platforms require workarounds.

Pattern 1 · Canonical / Premium / Standard tiers

Define one "canonical" tenant that always has the latest features. This is often the vendor's own internal deployment or a design partner. New features ship to canonical first. Once validated ... meaning the feature has been running in production under real load for a defined period ... it promotes to Premium tenants. After a further stabilisation window, it promotes to Standard.

This creates a predictable rollout gradient: canonical → premium → standard. Each tier is a set of tenants, not a set of users. The rollout decision is "which tier is this feature in?" not "which users should see this feature?"

What this requires: the ability to group tenants into tiers; the ability to set a flag's default value per tier; the ability to override at the individual tenant level when a specific tenant needs to deviate from its tier's default.

Attribute-model workaround: create a segment per tier, add a tier attribute to each user context, target the segment in the flag's rules. This works, but the tier-to-tenant mapping is stored in the segment definition, not in the tenant's own configuration. Promoting a tenant from Standard to Premium means editing the segment, not the tenant. And if the segment is shared across flags, editing it to promote one tenant changes that tenant's tier for every flag that targets the segment ... even flags where the promotion is not yet validated. The coupling between tier membership and flag evaluation is implicit, which makes partial promotions (Tenant X gets Premium for flag A but stays Standard for flag B) difficult to express without creating per-flag segments, which re-creates the rule explosion problem at the segment level.

Pattern 2 · Per-tenant gradual rollout

The same flag at 10% in Tenant A and 100% in Tenant B. This is common when one tenant is cautious and wants a gradual ramp while another has already validated the feature in staging and wants it immediately.

Deterministic per-user bucketing matters here. If end users can move between tenants ... for example, a contractor who works with multiple PSP partners ... the bucketing must be stable per user, not per tenant-user pair. The rollout percentage is scoped to the tenant, but the hash input must include a user-stable identifier to avoid re-bucketing on tenant switch.

What this requires: per-tenant rollout percentage configuration; deterministic hashing that produces identical bucket assignments across platforms (iOS, Android, server); the ability to inspect a specific user's bucket assignment for debugging.

Attribute-model workaround: create two targeting rules on the same flag, one per tenant, each with its own percentage rollout. Rule ordering must be verified. Adding a third tenant means adding a third rule. At 20 tenants with independent rollout percentages, the flag has 20 rules, each with its own rollout configuration.

Pattern 3 · Tenant-scoped emergency rollback

A feature is live across all tenants. Tenant C reports a critical issue. The operator needs to disable the flag for Tenant C within minutes, without affecting Tenant A or Tenant B, and with same-day audit evidence that the rollback was scoped correctly.

This is the most time-sensitive operation in multi-tenant feature management. The operator is under pressure. The blast radius must be contained. The audit trail must be clean.

What this requires: the ability to override a single tenant's flag value without touching other tenants; an immutable audit log entry recording the actor, timestamp, previous value, and new value; the ability to produce this evidence within hours, not days.

Attribute-model workaround: add an individual targeting rule or clause for Tenant C that overrides the default. This works mechanically, but the audit evidence is a rule diff showing a new clause was added. Proving that only Tenant C was affected requires demonstrating that no other clauses were modified in the same change ... which requires diff analysis rather than a scoped log entry.

Pattern 4 · White-label feature policy

The tenant's own administrators control a subset of flags scoped to their tenant. Tenant B's product manager wants to enable a feature for their users without asking the platform vendor to make the change.

This is the most advanced pattern. It requires tenant-scoped RBAC: the ability to grant an external operator write access to a specific tenant's overrides without granting access to other tenants or to the flag's global configuration.

This pattern also has an economic argument. Every time a tenant's administrator asks the platform vendor to toggle a flag, that request consumes support and engineering time. At five tenants, this is manageable. At fifty, it is a staffing problem. Self-serve tenant-level feature control turns a support cost into a product feature. But it only works if the access boundaries are structural. An external operator who can accidentally read another tenant's configuration is a data exposure incident, not a product feature.

What this requires: tenant-scoped permissions; a UI or API that allows tenant-level operators to modify only their tenant's overrides; audit entries that distinguish between platform-level and tenant-level changes; the ability to define which flags are tenant-controllable and which are platform-controlled (not all flags should be exposed to tenant operators).

Attribute-model workaround: there is no clean workaround. Granting external operators access to the rule engine means granting them the ability to see (and potentially modify) rules that affect other tenants. Most teams solve this by building a custom abstraction layer on top of their flag tool ... which is, in effect, building a tenant-as-primitive layer themselves.

Figure 04 · Tenant-tier rollout matrix5 flags · 8 tenants · 3 tiers
Tenant pairing_flow ai_review 3ds_default fraud_score telemetry_v2 Canonical Tenant A100%100%100%100%100% Premium Tenant B100%50%100%100%25% Tenant C100%50%100%100%off Tenant D100%off100%100%25% Standard Tenant E25%offoffoffoff Tenant F25%offoffoffoff Tenant G10%offoffoffoff Tenant Hoffoffoffoffoff
fully on off partial rollout tenant exception (overrides tier default)
Tenant D in the Premium tier has the AI-review flag overridden to off, despite the tier default being on. Per-tenant exception handling without rule grids.
06 · The implementation

Argus in practice.

Argus is the implementation of the structural model described in Section 4. It is not the only possible implementation. Argus is one such build, in production, with specific technology choices worth examining.

Architecture

Argus is a React frontend backed by Cloud Firestore, with Cloud Functions enforcing server-side business rules for approval workflows and audit logging.

The choice of Firestore is deliberate. Firestore's document-subcollection model maps directly onto the flag → environment → tenant hierarchy. A flag is a document. Each environment is a document in a subcollection. Each tenant override is a document in a subcollection under the environment. This is not a relational schema forced into a document store. The data model and the storage model are the same shape.

RBAC is enforced at three levels: Super User (full access), Approver (can approve changes but not self-approve), and Contributor (can propose changes, cannot approve). Tenant-scoped permissions are layered on top: an operator can be a Contributor for Tenant A and have no access to Tenant B.

Figure 05 · Argus system architecturetenant as primitive
Operate Web dashboard Propose, review, and approve flag changes. Updates stream in real time over Firestore onSnapshot.
Consume iOS & Android SDKs Resolve flag values on demand for a given tenant and environment.
Cloud Functions RBAC, the approval state machine, and audit triggers — all server-enforced. A client cannot bypass the log or self-approve a change.
resolveFlags endpoint Reads current state and returns resolved values. A tenant override wins over the environment value.
Firestore — system of record
Customer · customerId
Flags → Environments → tenant overrides TenantsUsersConditionsAudit log
One deployment, every customer. Every document carries customerId and the security rules scope each read and write to it. The tenant override is a node in the tree — not a clause evaluated against a user identity.

Two real-time surfaces, two mechanisms

"Real-time" in Argus means two different things, served by two different mechanisms. Conflating them is a common source of imprecision, so they are stated separately.

The operator dashboard is genuinely real-time. The Argus web UI subscribes to Firestore through onSnapshot listeners. When one operator changes a flag, every other operator's dashboard reflects it within Firestore's replication window ... typically under a second ... with no polling, no WebSocket layer, and no pub/sub middleware. The database is the change channel. This is a property the document store provides directly, and it is the right tool for the job: a small number of authenticated operator sessions watching a shared, governed dataset.

SDK flag delivery is a request/response resolution model. Client SDKs do not hold Firestore subscriptions. They call the resolution endpoint and receive a snapshot of resolved flag values. To observe a change, an SDK resolves again ... on app launch, on foreground, or on a polling interval the integrating team controls. This is a deliberate choice for the current stage: it keeps the SDK dependency-free (a plain HTTPS client, no persistent connection), it keeps the client decoupled from the storage layer, and it makes the resolution path cacheable at the edge. The cost is propagation latency bounded by the SDK's resolve cadence rather than by Firestore's replication window.

Firestore is the system of record; it is not the client delivery edge. Every flag, environment, tenant override, approval, and audit entry lives in Firestore, where the governance model ... RBAC, approval workflow, immutable audit ... is enforced. The resolution endpoint reads from that same store. This unifies the write path and the read path behind one consistent, governed source. What it deliberately does not do today is push individual flag changes to client devices.

Real-time client delivery is a defined roadmap item, not a present claim. A future stage may introduce a push-oriented read edge ... a server-sent-events stream, or a projection of resolved values into a store optimised for fan-out subscriptions, fed by Firestore triggers. That is a separate architectural decision with its own trade-offs: persistent-connection infrastructure, cost per connected client, and the staleness model. It will be evaluated when the SDK client libraries reach that stage. The honest present-tense statement is: the dashboard is real-time, SDK delivery is resolve-on-demand, and the system of record is Firestore.

Deterministic bucketing

Percentage rollouts use FNV-1a hashing. The hash input is {flagKey}:{userId}, producing a 32-bit integer mapped to a 0–100 bucket. The same algorithm runs in the resolution endpoint, the iOS SDK, and the Android SDK. A user hashed into bucket 37 on the server is in bucket 37 on iOS and bucket 37 on Android. No drift. No inconsistency.

This is not novel. Most flag platforms use deterministic hashing. What matters is that the hash is scoped to the user, not the tenant-user pair. A user who appears in multiple tenants gets the same bucket assignment regardless of tenant context. The rollout percentage is per-tenant. The bucketing is per-user.

SDK integration

Swift · iOSargus-sdk
let argus = ArgusManager(
    apiURL: URL(string: "https://flags.example.com/v1/resolve")!,
    tenantId: "telecom_ca",
    userId: currentUser.id
)

if argus.isEnabled("new_pairing_flow") {
    showNewPairingFlow()
}
Kotlin · Androidargus-sdk
val argus = ArgusFeatureFlagService(
    apiUrl = "https://flags.example.com/v1/resolve",
    tenantId = "telecom_ca",
    userId = currentUser.id
)

if (argus.isEnabled("new_pairing_flow")) {
    showNewPairingFlow()
}
Bash · resolution endpointHTTP
# Request
curl -s "https://flags.example.com/v1/resolve" \
  -H "Content-Type: application/json" \
  -d '{
    "tenantId": "telecom_ca",
    "userId": "user_abc",
    "flags": ["new_pairing_flow", "ai_contract_review"]
  }'

# Response
{
  "new_pairing_flow":    { "value": true,  "source": "tenant_override" },
  "ai_contract_review": { "value": false, "source": "environment_default" }
}

The source field in the response is a debugging aid. It tells the consumer whether the resolved value came from a tenant override, an environment default, or a rollout evaluation. This is the kind of metadata that makes production debugging tractable.

Performance characteristics

Argus does not publish benchmark numbers because the resolution path depends on Firestore's latency characteristics, which vary by region and document size. What can be stated: the resolution endpoint performs one document read per flag per tenant (the tenant override document), with Firestore's caching layer handling repeat reads. The payload is a JSON object keyed by flag name. There is no rule evaluation at resolution time. The value is a direct read, not a computation.

Figure 06 · End-to-end request flowchange → SDK
01 · Operator
Dashboard
Sets new_pairing_flow to true for tenant telecom_ca.
02 · Server
Cloud Function
Validates RBAC, checks approval, writes tenant override + audit row. Client cannot bypass the log.
03 · Data plane
Firestore
Change is now the system of record. Operator dashboards subscribed via onSnapshot reflect it within a second.
04 · Edge
Resolve endpoint
Next SDK resolve call reads current Firestore state and returns it. Source field tagged tenant_override.
05 · Client
SDK · iOS / Android
argus.isEnabled("new_pairing_flow") resolves to true.
Five steps. Server-enforced audit at step 02; Firestore as system of record at step 03; resolve-on-demand SDK delivery at step 04.
07 · Case study

One codebase, three tenants.

A representative scenario: a product team ships a single application ... iOS and Android apps, a tenant portal, and supporting services ... under three distinct customer brands in three countries. The same codebase. Three distinct feature surfaces.

Before Argus

The team used Firebase Remote Config for feature flagging. Per-tenant configuration was managed by maintaining separate Remote Config projects per environment, with tenant-specific values set manually. This produced environment proliferation: 3 tenants × 3 environments (dev, staging, production) × 2 platforms (iOS, Android) = 18 configuration surfaces. The team referred to this informally as "the matrix."

Audit was the sharpest pain. When one tenant's compliance team asked for evidence of feature changes in the prior quarter, the engineering team spent two days reconstructing the history from Remote Config version diffs, Jira tickets, and Slack messages. The answer was delivered as a spreadsheet, not a system-generated report.

Per-tenant rollout was manual. A flag that needed to be at 100% for the first tenant, 25% for the second, and off for the third required three separate Remote Config updates, each in the correct project, each verified manually.

What Argus shipped

Argus replaced the matrix with a single flag store. Each flag has one configuration document per environment. Each environment has override documents per tenant. The 18-surface matrix collapsed to 1 flag store × 3 environments × 3 tenants = 9 tenant override documents per flag.

At the time of this writing, Argus manages active flags across three tenants, with per-tenant overrides and rollout percentages configured independently. The audit log captures every change with actor, timestamp, environment, tenant, and before/after values.

Time to ship a tenant-scoped change went from "open the correct Remote Config project, find the parameter, update the value, verify you're in the right project, document the change in Jira" (15–30 minutes per change, plus risk of project confusion) to "open the flag, click the tenant, set the value" (under a minute, audit-logged automatically).

What broke

The first version stored tenant overrides as fields within the environment document rather than as subcollection documents. This worked until the team needed to set up Firestore security rules that allowed tenant-scoped RBAC ... because field-level security in Firestore is limited compared to document-level rules. The data model was refactored to use subcollections, which aligned the security boundary with the data boundary. This took two days and required a migration script.

A second trade-off emerged around Firestore's subcollection query model. Querying "which tenants have flag X enabled?" across all environments requires a collection group query, which works but requires a composite index. Querying "what does Tenant B see across all flags?" requires iterating over flag documents and reading each tenant subcollection. Firestore's document model optimises for the per-tenant-per-flag read path (the hot path for SDK resolution) at the cost of cross-flag aggregation queries (the warm path for dashboard views). This trade-off was accepted because the resolution path is latency-sensitive and the dashboard path is not.

The lesson: the document hierarchy is not just a storage convenience. It is the security model. And the query model follows the document model ... optimise the hierarchy for the hot path, accept the cost on the cold path.

Operator perspective

Before, I had to remember which Remote Config project belonged to which tenant, and I was always one click away from pushing a change to the wrong tenant. Now the tenant is right there in the UI. I pick the tenant, I see its overrides, I make the change, and the audit log writes itself. The compliance team gets their evidence in a query, not a spreadsheet.

Illustrative quote · reflecting the operational improvement described by the engineering team
08 · Roadmap

What's next, and the AI angle.

Argus is a production system, not a finished product. Several capabilities are planned or in progress.

Code references

Scan the consuming repositories to find where each flag key is read in the codebase. Surface this in the dashboard: "flag new_pairing_flow is referenced in PairingViewController.swift:142, PairingFragment.kt:89, and pairing-api.ts:34." This makes dead flag detection concrete. If a flag key has no code references and has been stable for 90 days, it is a candidate for cleanup. Dead flags are not just clutter. They are cognitive load on every engineer who encounters them and wonders whether they are safe to remove.

AI-assisted tenant rollout suggestions

Given historical rollout data ... which tenant tiers received which flags, how long each flag spent at each rollout percentage before promotion, which flags were rolled back and why ... suggest a rollout plan for a new flag. "Based on your last 20 rollouts, new flags typically spend 2 weeks at canonical, then 1 week at 50% in Premium before full promotion. Suggest: canonical today, Premium at 50% on June 15, Premium at 100% on June 22, Standard at 25% on July 1."

This is operational AI, not generative AI. It does not write the flag configuration. It suggests a rollout schedule based on observed patterns and lets the operator accept, modify, or reject.

Web SDK and edge resolution

The current resolution endpoint serves iOS and Android. A JavaScript SDK and an edge-resolution layer (for server-side rendering and API middleware) are planned. The JavaScript SDK is a prerequisite for the tenant portal consumer.

SOC 2 readiness

The audit log already captures the data SOC 2 Type II requires: who changed what, when, with what authorisation. The remaining work is export formatting, retention policies, and access controls on the log itself. This is compliance packaging, not architectural change.

What the AI angle is not

Argus is not entering the prompt management space. LaunchDarkly's AI Configs (GA as of May 2025) and Statsig's prompt evaluation tools address a different problem: managing LLM prompts and model configurations as runtime variables. That is a valid use case, but it is not the problem Argus solves. Argus's AI work is operational: reduce the toil of managing flags at scale across many tenants. Not manage the AI. Manage the flags.

09 · Appendix

Glossary, FAQ, sources.

Glossary

Tenant
A customer organisation that consumes the platform. In Argus, a tenant is a document in the data model with its own override hierarchy and audit stream. In attribute-based platforms, a tenant is a value on the user identity.
Environment
A deployment stage (dev, staging, production). Each environment has its own flag configurations and, in Argus, its own per-tenant overrides.
Override
A tenant-specific value for a flag within an environment. Overrides take precedence over the environment's default value.
Rollout
A percentage-based gradual exposure of a flag's value. In Argus, rollouts are configured per-tenant, not globally.
Segment
A group of users or contexts defined by shared attributes. In attribute-based platforms, segments are the primary mechanism for per-tenant targeting. In Argus, segments are not needed for tenant targeting because the tenant is a structural primitive.
Resolution
The process of determining a flag's value for a specific user in a specific tenant in a specific environment. In attribute-based platforms, resolution is rule evaluation. In Argus, resolution is a document read plus optional rollout hashing.
Identity
A user or entity being evaluated. In attribute-based platforms, the identity carries tenant information as an attribute. In Argus, tenant and user identity are separate inputs to the resolution call.
Attribute
A key-value pair on an identity or context. The mechanism by which attribute-based platforms encode tenant information.
Structural primitive
A first-class entity in the data model ... not a value on another entity, but its own document with its own schema, permissions, and audit trail.

FAQ

How is this different from LaunchDarkly Custom Attributes?

LaunchDarkly's contexts and context kinds (such as organization) allow non-user entity evaluation, which is more sophisticated than simple attribute matching. The distinction is architectural: in LaunchDarkly, per-tenant decisions are expressed as targeting rules evaluated by the rule engine. In Argus, per-tenant decisions are stored as override documents read from the data layer. The operational difference shows up in audit (query vs. reconstruction), offboarding (cascade delete vs. rule scanning), and RBAC (document-level permissions vs. rule-engine guards).

Can I use Argus alongside an existing flag tool?

Yes. Argus handles tenant-scoped flags. Your existing tool handles non-tenant flags ... A/B tests, user-level personalisation, experimentation. The two systems coexist. The SDK calls are independent.

What about end-user experimentation and A/B testing?

Argus is not an experimentation platform. It does not compute statistical significance, run multi-armed bandits, or manage experiment lifecycles. If your primary need is experimentation, use a tool built for that. If your primary need is tenant-scoped feature management with clean audit trails, that is what Argus does.

Self-hosted option?

Argus is built on Firebase (Firestore + Cloud Functions) and React. The entire stack can run on a Google Cloud project you own. There is no proprietary backend service. Self-hosting means deploying the Cloud Functions and hosting the React frontend on your own infrastructure.

Pricing model?

Not yet formalised for external distribution. The operational cost is Firestore read/write pricing plus Cloud Functions invocations. For a platform with 50 flags, 10 tenants, and 3 environments, the Firestore cost is measured in single-digit dollars per month. The cost scales with the number of resolution calls, not the number of seats or the number of flags.

Sources

  1. LaunchDarkly contexts documentation. launchdarkly.com/docs/home/flags/contexts/intro
  2. LaunchDarkly targeting rules documentation. launchdarkly.com/docs/home/flags/target-rules
  3. LaunchDarkly context kinds for releases. launchdarkly.com/docs/home/releases/context-kinds
  4. Harness Feature Management & Experimentation traffic types. developer.harness.io/docs/feature-management-experimentation/traffic-types
  5. DevCycle custom-property rollout randomisation. docs.devcycle.com/platform/feature-flags/targeting/randomize-using-custom-property
  6. Statsig targeting conditions documentation. docs.statsig.com/feature-flags/conditions
  7. Unleash context fields and stickiness. docs.getunleash.io/reference/unleash-context
  8. GrowthBook targeting documentation. docs.growthbook.io/features/targeting
  9. Flagsmith identity and traits documentation. docs.flagsmith.com/basic-features/managing-identities
  10. Flagsmith Edge API migration. docs.flagsmith.com/performance/edge-api
  11. Flagsmith local evaluation identity override limitation (GitHub #1762). github.com/Flagsmith/flagsmith/issues/1762
  12. LaunchDarkly AI Configs GA announcement (May 2025). launchdarkly.com/blog/ai-configs-ga-runtime-control-prompts-models
  13. OpenAI / Statsig acquisition announcement. openai.com/index/vijaye-raji-to-become-cto-of-applications-with-acquisition-of-statsig
  14. Harness / Split acquisition. prnewswire.com/news-releases/harness-completes-acquisition-of-split-software-302170987.html
  15. Feature Flag & Management Platform Market Analysis (Mid-2026). Manus research draft, May 2026. Fact-checked independently; corrections noted in companion matrix.
  16. Reddit r/devops LaunchDarkly pricing discussions. March 2026. Cited as secondary source for pricing sentiment; not primary for specific figures.