Skip to content

Queues & Async ​

Long-running or error-prone work doesn't belong in a request handler. It belongs in a queue. This is factor VIII of our architecture β€” scale out via the process model, not by making individual requests do more.

We use BullMQ backed by Redis. Workers are independent processes: they can be scaled, restarted, or rate-limited without touching the API.

The problem with doing work in the request ​

A request handler has a hard contract: respond quickly, don't fail the user because a downstream service is slow. When you do real work inside a request:

  • If the external API (OpenAI, GCS) is slow, the user waits β€” or times out
  • If the service restarts mid-operation, the work is lost with no way to retry
  • If traffic spikes, every request does the full work β€” no backpressure, no throttling

The same applies to in-process alternatives:

PatternProblem
EventEmitter for mutationsIn-process β€” service crashes, event is gone. No retry, no trace.
MongoDB change streamsListener is in-process β€” restart = every missed event is lost forever.
@Cron / NestJS schedulerEvery running instance fires the job. Two instances = duplicate emails, duplicate reports.

The rule: if the outcome matters and it can fail, it goes in a queue.

What queues give you ​

  • Retries with backoff β€” a failed job is retried automatically, not silently dropped
  • Backpressure β€” a rate limiter on the worker protects external APIs from being hammered
  • Chaining β€” a processor enqueues the next job, so a multi-step pipeline (receive β†’ transcribe β†’ parse) retries each step independently
  • Observability β€” every job has an ID, a state, a history; Bull Board shows what's running, waiting, or failed
  • Horizontal scaling β€” add workers for the slow queues, leave the fast ones alone

Queues in this codebase ​

The audio pipeline is the clearest example of why this matters:

call-log.receive β†’ asset.acquire β†’ transcribe.openai β†’ transcribe.parse

Each step is independently retryable. A transient OpenAI timeout retries just the transcription, not the entire pipeline from the start.

Other domains follow the same pattern: chat message delivery, attachment analysis, schema audits, Slack notifications β€” anything that touches an external service or takes more than a few milliseconds.

Why BullMQ and not Pub/Sub or RabbitMQ ​

Different brokers solve different problems. We use BullMQ because our jobs are tightly coupled to the backend β€” they need NestJS DI, MongoDB access, and Redis is already in the stack. It's the right tool for task queues within a single service.

BullMQPub/Sub (GCP)RabbitMQ
Best forBackground jobs within a serviceCross-service event fan-out at scaleComplex routing across many services
DeliveryAt-least-once (exactly-once with Redis lock)At-least-once (exactly-once available on pull)At-least-once with manual ack; at-most-once with auto-ack
ModelPull (workers poll Redis)Push or pull (per subscription)Push (broker pushes to subscribed consumers)
PersistenceRedisGoogle-managedBroker-managed
RetriesBuilt-in, per-jobAck/nack, per-subscriptionAck/nack, per-consumer
RoutingQueue nameTopic + subscription filtersExchanges + binding keys
OverheadLow (Redis already present)Managed, no infra to runRequires a broker to operate

If we ever need to fan events out to multiple independent services (e.g. a data pipeline consuming the same call log events as the backend), Pub/Sub would be the right addition β€” not a replacement.

Further reading: BullMQ docs Β· Cloud Pub/Sub overview Β· RabbitMQ tutorials