Overview
This page is the contributor entry point. It gives a fast system map and links to canonical detail pages.
Use this page as the map, then drill down to the linked pages for implementation-level detail.
Section Map
| Section | What it answers | Canonical details |
|---|---|---|
| 1. Architecture Overview | Why the platform is split this way | This page |
| 2. Infrastructure | What on-prem/k3s baseline the system assumes | Infrastructure |
| 3. Service Architecture | Who owns what behavior across services | Service Architecture |
| 4. Runtime and Lifecycle Flows | How requests/deployments/access move | Runtime and Lifecycle Flows |
| 5. State Reconciliation | How desired and actual state are converged | State Reconciliation |
| 6. Security | Where trust boundaries and controls are enforced | Security |
| 7. Storage and Data | Why each data technology exists in the design | Storage and Data |
| 8. Scalability | Which limits and quotas govern throughput | Scalability |
| 9. Observability and Operations | How to diagnose and operate safely under load | Observability and Operations |
| 10. Design Principles | Why major architectural decisions were chosen | Design Principles |
1. Architecture Overview
The platform is organized into four primary blocks with clear ownership boundaries.
| Block | Responsibility |
|---|---|
| Management & Business | User-facing entry point. Contestants interact through the Contestant Portal, and requests are handled by backend services. The Admin Portal is based on open-source CTFd and provides management interfaces for Jury, Challenge Writers, and Platform Admins. |
| Deployment Orchestration | Coordinates challenge build and deployment through Deployment Center and Argo Workflows. Manages challenge lifecycle operations and exposes control APIs such as stop, status check, and log retrieval. |
| Infrastructure (Kubernetes) | Runs each challenge as an isolated pod and performs expired challenge cleanup via scheduled cron jobs. |
| Shared Persistence & Caching | Provides shared state infrastructure. MariaDB stores durable data, while Redis supports caching, locking, and temporary lifecycle state management. |
2. Infrastructure
| Aspect | Current design | Why it matters |
|---|---|---|
| Platform | Self-managed on-prem k3s | No cloud-managed dependency assumptions. |
| Networking baseline | flannel disabled, Traefik disabled, Calico installed explicitly | Predictable CNI behavior and policy control. |
| Ingress | Ingress NGINX via Helm | Consistent routing for app and runtime access boundary. |
| Runtime sandbox | Optional RuntimeClass gvisor via USE_GVISOR | Additional isolation option for untrusted challenge workloads. |
| Service exposure | ClusterIP + Ingress (primary), NodePort (fallback) | Works in constrained lab DNS/ingress environments. |
Key namespaces: app, argo, db, monitoring, registry, cattle-system, challenge (dynamic labeled namespaces), storage.
Canonical details: Infrastructure.
3. Service Architecture
| Service boundary | Primary responsibility |
|---|---|
| Contestant Service | Competition domain APIs and race-safe quota logic |
| Deployment Center | Deployment intent, orchestration coordination, workflow/pod status surfaces |
| Challenge Gateway | Only runtime ingress boundary (HTTP/TCP token-based access) |
| Deployment Consumer + Argo | Async deploy/build execution pipeline |
| Deployment Listener | Reconcile Redis/MariaDB lifecycle truth against Kubernetes events |
Boundary reminder: Challenge Gateway is the only runtime ingress boundary for challenge traffic.
Canonical details: Service Architecture.
4. Runtime and Lifecycle Flows
| Flow | Path |
|---|---|
| User request | ingress -> frontend -> Contestant Service |
| Deployment | Deployment Center -> RabbitMQ -> Deployment Consumer -> Argo -> Kubernetes |
| Runtime access | signed token -> Challenge Gateway -> internal challenge service |
| Stop/cleanup | control intent -> runtime termination -> reconciled STOPPED state |
Canonical details: Runtime and Lifecycle Flows.
5. State Reconciliation
| Concept | Implementation in FCTF |
|---|---|
| Desired state | Redis lifecycle keys + MariaDB tracking rows |
| Actual state | Kubernetes namespaces/pods/readiness/restarts |
| Control loop | desired -> execute -> observe -> reconcile |
| Drift sources | watch disconnect, stale resourceVersion, out-of-band deletes, partial workflow completion |
| Reconciler | Deployment Listener shards pod events and applies cleanup/state correction |
Canonical details: State Reconciliation.
6. Security
| Layer | Control |
|---|---|
| Runtime exposure | Challenge pods are never directly exposed; gateway-only boundary |
| Access tokens | HMAC-signed token flow for HTTP and TCP challenge access |
| Network isolation | Role-scoped NetworkPolicies in app and challenge namespaces |
| Data-plane authz | Redis ACL, MariaDB least privilege, scoped RabbitMQ topology |
| Runtime hardening | Optional gVisor; RBAC scoped by service role |
Canonical details: Security.
7. Storage and Data
| Store | Role in architecture |
|---|---|
| MariaDB | Durable business truth and audit history |
| Redis | Low-latency coordination, quotas, lifecycle state |
| RabbitMQ | Async buffering and backpressure for deploy intents |
| NFS | Shared challenge assets and workflow build contexts |
| Harbor | Private image distribution for challenge runtime namespaces |
Canonical details: Storage and Data.
8. Scalability
| Control area | Current approach |
|---|---|
| Queue pressure | Bounded queue length and reject-publish behavior |
| Worker throughput | Prefetch/batch controls and workflow concurrency gate |
| Runtime entry load | Gateway connection and rate limits (global, per-IP, per-token) |
| Fairness | Team-level concurrent and per-challenge limits via Redis |
| Scaling style | Hybrid: HPA for selected stateless services, manual tuning for worker/reconciler components |
Canonical details: Scalability.
9. Observability and Operations
| Capability | Stack or surface |
|---|---|
| Metrics | Prometheus |
| Visualization | Grafana |
| Log aggregation | Loki |
| Runtime diagnostics | Deployment Center workflow/pod/log query endpoints |
| Cluster operations | Scripted bootstrap/install path for on-prem environments |
Canonical details: Observability and Operations.
10. Design Principles
| Principle | Practical effect on implementation |
|---|---|
| Async-first control path | Keeps user-facing APIs responsive under deployment load |
| Isolation by default | Limits blast radius and cross-team interference |
| Reconciliation over assumption | Prevents stale runtime state from persisting |
| Defense in depth | Applies layered controls at gateway/network/data/runtime |
| On-premise-first operability | Works without managed cloud dependencies |
| Contributor-first boundaries | Keeps service, flow, and consistency concerns separate |
Canonical details: Design Principles.
If you are new to the codebase, read sections in this order: Overview -> Service Architecture -> Runtime and Lifecycle Flows -> State Reconciliation.