How we design systems
Monorepo
We maintain a monorepo architecture for our codebase. Currently the main React app is located outside the monorepo, but we plan to migrate it in. This will enable us to share code and types across frontend and backend, allowing complete end-to-end features to be delivered in a single pull request.
Key principle: Services should only import code from commons modules, never directly from other services. If you find yourself needing to import code from another service, create a shared commons module instead to maintain proper separation of concerns.
See also: Code & repos for information about our repositories.
Serverless first
Our infrastructure follows a serverless-first approach:
- Default choice: Use the Serverless framework with AWS Lambda, SQS, and Step Functions for most services
- Exception for appBackend: We use App Runner with Docker for the appBackend service. This approach provides easier local testing and solves the cold-start latency issues inherent to Lambda functions
See also: Node / Serverless setup for detailed configuration guidance.
Secrets and configuration
Configuration management follows a clear separation between sensitive and non-sensitive data:
- Secrets: Store all secrets in AWS Secrets Manager. These can be referenced directly in your serverless configuration files
- Non-sensitive configuration: Store configuration parameters in Systems Manager Parameter Store and reference them in serverless config
- Runtime injection: Use environment variables within serverless services to inject configuration values at runtime
Infrastructure
Infrastructure is split based on ownership and scope:
- Shared infrastructure: Define shared underlying infrastructure in CDK within the
gf-infrastructurerepository. These resources are needed by multiple services - Service-specific components: Define components owned by a single service in that service's
serverless.tsfile - Known exception: The crawler ECS infrastructure is incorrectly structured for historic reasons and does not follow this pattern
See also: App Environments for information about our dev, staging, and production environments.
Storage and state
We choose storage solutions based on data size and access patterns:
- S3 for archival: Use S3 for archiving data long-term
- S3 for temporary data: Use S3 for storing temporary working datasets (such as during file uploads), but always attach a retention policy to prevent indefinite storage
- Postgres for fast access: Store smaller datasets that require fast access in Postgres. This includes data needed to respond to interactive application requests
- Redshift for analytics: Store larger datasets where higher access latency (10+ seconds) is acceptable in Redshift
Think multi-tenant
We operate a multi-tenant platform serving many clients. Every architectural decision must consider tenant isolation:
Consistent experience: Each client must have a consistent, reliable experience regardless of how other clients use the platform.
Resource isolation: Ensure that one client using the platform unexpectedly, being misconfigured, or operating at higher scale than anticipated does not impact other clients.
Authentication and authorization:
- All client APIs must be authenticated in the backend
- Data can reside in shared tables, but you must verify access to records in every API call
- When a user requests a record that exists but belongs to another client, return
404 (not found), not403 (forbidden). This prevents leaking information about the existence of records across clients
Shared resource considerations: Be cautious when using shared resources like queues across all clients. Consider the impact if one client fills a shared queue, potentially blocking or slowing down other clients.
Step functions for multi-step workflow orchestration
Step Functions are our preferred tool for orchestrating complex, multi-step workflows. They provide excellent orchestration capabilities and observability into workflow execution.
Client-scoped workflows: Many Step Function workflows are scoped to a single client (for example, the build dataset, update segment, market preview workflows). This scoping allows workflows to run in parallel with minimal cross-client impact.
When to use Step Functions: If you find yourself designing a system with more than one stage and more than one queue, consider whether Step Functions would be a better architectural choice.
Distributed Map for scale: Use Distributed Map mode for large-scale data processing tasks, such as processing hundreds of files in S3 or processing multi-line files.
State size limits: Be aware of the 256KB state size limit per execution. For larger datasets, store results in S3 (with a retention policy) and store only the S3 URL in the Step Function state.
Development cycle
Our development workflow supports running services locally while connecting to shared dev environment resources:
Frontend development: Run the React app against either the dev environment (for frontend-only changes) or a locally running appBackend Express instance (when backend changes are needed).
Backend development: Run appBackend locally against the dev environment's Postgres database. You can apply migrations to dev and deploy work-in-progress changes at any time.
Serverless development: Test Lambda functions locally using sls invoke local -f <function> --stage=dev -d '{}' for simple cases. For complex workflows (such as Step Functions), deploy directly to dev using sls deploy --stage=dev or sls deploy function -f <function> --stage=dev.
See also: App Environments for environment details, Local database and test setup for local testing configuration, and Node / Serverless setup for Serverless framework usage.
Observability
We are actively iterating and improving our observability practices:
CloudWatch for basic logging: Use CloudWatch logs (console.log) for basic logging, being mindful of payload sizes to manage costs.
New Relic for events and metrics: Use New Relic (via the Logger helper) for capturing events and metrics.
See also: New Relic for monitoring dashboards and usage.
Security
Our security architecture follows a defense-in-depth approach with multiple layers of protection:
Network isolation: All services deploy inside the VPC's private subnet by default. Services requiring public access are exposed via API Gateway or App Runner.
Database protection: Databases are never publicly accessible; they're protected behind VPC firewalls and require tunneling for access. See Accessing databases via port forwarding for connection instructions.
Security monitoring: We monitor the #eng-security-alerts channel for Aikido alerts, which cover dependency scanning, Docker image scanning, and AWS misconfiguration detection.
SQL injection prevention: All database SQL must be generated using the sqlT helper to prevent SQL injection attacks—never use raw strings. See sql templating.
Input validation: We use zod to parse and validate all incoming user parameters wherever possible.
Related documentation
- GoodFit Tech Overview - High-level system architecture and technology stack
- AWS User Access - How to access AWS accounts and configure CLI
- Git Conventions - Our git workflow and conventions
- Development Lifecycle - How we develop and ship features