How to Gracefully Shutdown 16 Microservices? Autional's Unified Bootstrapper Revealed
Service restarts are the norm in production—Kubernetes rolling updates, node evictions, resource scaling—each sends a SIGTERM signal to the Pod. A microservice that doesn’t handle SIGTERM will terminate immediately, leading to:
- In-flight HTTP requests being interrupted, clients seeing connection reset errors
- Messages taken from RabbitMQ but not yet processed being permanently lost (auto-acked)
- gRPC streams cut off mid-way, downstream services receiving
UNAVAILABLEerrors - Database connection pool violently closed, uncommitted transactions rolled back
Autional’s 16 microservices achieve zero-downtime graceful shutdown through the unified micro-middleware/app bootstrapper.
Wild Shutdown vs Graceful Shutdown
Wild Shutdown (Nothing Done)
Timeline:
T+0s Kubernetes sends SIGTERM → process exits immediately
T+0s 8 in-flight HTTP requests all disconnected → users see 502
T+0s 3 MQ messages auto-acked but not processed → message loss
T+0s 2 database transactions uncommitted → data inconsistency
T+0s gRPC stream disconnected → downstream retries (avalanche risk)
This is the most common and most dangerous scenario—a simple go run cmd/server/main.go with no signal handling and no shutdown logic.
Graceful Shutdown (Autional Pattern)
Timeline:
T+0s Received SIGTERM → shutdown sequence starts
T+0s Stop accepting new HTTP requests (return 503 + Retry-After header)
T+5s Wait for 3 in-flight HTTP requests to complete
T+6s HTTP server.Shutdown() complete
T+6s Stop MQ consumer, wait for in-flight messages to finish
T+8s All 3 MQ messages acked
T+8s gRPC server.GracefulStop() → wait for stream transfers to complete
T+12s Close database connection pool (LIFO)
T+12s Process exits
Consumers are unaware. Kubernetes terminationGracePeriodSeconds is set to 30 seconds, providing ample buffer time.
Autional Unified Bootstrapper Design
Application Builder Pattern
Each service builds its startup configuration via a Builder in main.go:
import app_pkg "gitee.com/linmes/authms/micro-middleware/app"
func main() {
cfg := config.Load("configs/service/identity-service.yaml")
logger := logger_base.New(cfg.Log)
// Initialize dependencies
db := gorm_client.MustInitDB(cfg.DB, domainModels...)
mq := rabbitmq_client.Connect(cfg.RabbitMQ)
redis := redis_client.Connect(cfg.Redis)
// Build application
app := app_pkg.New(cfg.Service.Name, logger).
WithRouter(router).
WithHealth(healthHandler).
WithServer("grpc", grpcServer).
WithServer("mq-consumer", consumerServer).
WithCloser("db", sqlDB.Close). // Database connection pool
WithCloser("redis", redis.Close). // Redis connection pool
WithCloser("mq", mq.Close). // MQ connection
WithCleanupNamed("audit", auditClient.Stop)
app.Run(cfg.Service.Port)
}
Each WithServer and WithCloser registers a named shutdown callback. During shutdown, they execute in reverse registration order (LIFO), ensuring “first created, first opened, and opened resources close in dependency order”:
WithServer: Lifecycle Management
WithServer registers components implementing the app.Server interface:
type Server interface {
ListenAndServe() error
Shutdown(ctx context.Context) error
}
Common Server implementations in Autional:
| Component | Implementation | Purpose |
|---|---|---|
| Gin Router | app.NewHTTPServer(addr, handler) | HTTP service |
| gRPC Service | grpc_mw.Server via app.NewGRPCServer wrapper | gRPC endpoints |
| MQ Consumer | consumer_pkg.NewServer(consumer) | RabbitMQ consumption |
| Health Check | health.StartStandaloneServer | Pure health probe |
On shutdown, app.Run calls each Server’s Shutdown(ctx) in reverse order, passing the context timeout (default 30 seconds) to each component.
WithCloser vs WithCleanupNamed
Autional distinguishes two cleanup methods:
WithCloser(name, fn)— simplefunc() errorclosure for single-step cleanup (close DB, close Redis)WithCleanupNamed(name, fn)— same as Closer but semantically for “side-effect cleanup” (e.g., audit client flush buffer)- Deprecated:
WithCleanup(func())— no name, no error return, not observable
// Correct: returns error, has a name
app.WithCloser("db", func() error {
sqlDB, _ := db.DB()
return sqlDB.Close()
})
// Wrong: no name, no error
app.WithCleanup(func() { db.Close() })
Shutdown Sequence in Detail
Step 1: Stop Accepting New Requests (0-1 sec)
Upon receiving SIGTERM, app.Run immediately calls http.Server.Shutdown(ctx):
func (a *Application) Run(port int) {
// ... start all Servers ...
quit := make(chan os.Signal, 1)
signal.Notify(quit, syscall.SIGINT, syscall.SIGTERM)
<-quit
a.logger.Info("shutting down", slog.String(logger_base.KeyReason, "signal"))
// Step 1: HTTP stops accepting new requests
ctx, cancel := context.WithTimeout(context.Background(), a.shutdownTimeout)
defer cancel()
for _, srv := range a.servers { // reverse order
a.logger.Info("shutting down server", slog.String("name", srv.Name))
if err := srv.Shutdown(ctx); err != nil {
a.logger.Error("shutdown failed",
slog.String("name", srv.Name),
slog.Any(logger_base.KeyError, err))
}
}
}
HTTP Server Shutdown behavior:
- Closes the listening socket → new connections rejected, returns 503
- Waits for all in-flight requests to complete or timeout (
ctxdeadline)
Step 2: Drain MQ Consumers (3-8 sec)
MQ consumers achieve graceful shutdown through consumer_pkg.Server:
// internal implementation of consumer_pkg.NewServer
func (s *Server) Shutdown(ctx context.Context) error {
s.cancel() // ← triggers consumer internal context cancellation
// consumer on receiving cancel:
// 1. Stops subscribing (no longer receives new messages)
// 2. Waits for all in-flight messages to finish processing
s.wg.Wait() // ← waits for all handler goroutines to exit
return nil
}
In Autional’s consumer architecture, messages are only acked after successful processing (manual acknowledgment mode). So even if the MQ consumer hasn’t finished processing messages during process shutdown, messages are re-queued (not acked) and are not lost.
For long-running messages (e.g., compliance report generation, potentially 30+ seconds), context cancellation interrupts processing, and the message returns to the queue to be picked up by another Pod.
Step 3: gRPC GracefulStop (8-10 sec)
func (s *GRPCServer) Shutdown(ctx context.Context) error {
done := make(chan struct{})
go func() {
s.grpcServer.GracefulStop() // blocks until all RPCs complete
close(done)
}()
select {
case <-done:
return nil
case <-ctx.Done():
s.grpcServer.Stop() // force close on timeout
return ctx.Err()
}
}
gRPC’s GracefulStop ensures ongoing stream transfers can complete fully, while Stop is the hard-close fallback.
Step 4: Close Infrastructure Connection Pools (10-12 sec)
Closed in LIFO order:
Close order (reverse of registration):
[4] audit_client.Stop() ← flush buffered logs first
[3] mq.Close() ← close MQ connection
[2] redis.Close() ← close Redis connection pool
[1] db.Close() ← close database connection pool (last opened, first closed)
↑ sql.DB.Close() waits for all borrowed goroutines to return connections
Each step is logged:
INFO shutting down server name=http
INFO shutting down server name=grpc
INFO shutting down server name=mq-consumer
INFO closing name=audit
INFO closing name=mq
INFO closing name=redis
INFO closing name=db
INFO shutdown complete
If a Closer returns an error, it does not skip subsequent Closers—all cleanup steps are executed. This is defensive design: even if Redis connection is already broken causing Close to fail, the DB connection pool still needs to be released normally.
Timeout and Fallback
const defaultShutdownTimeout = 30 * time.Second
// In app.Run
ctx, cancel := context.WithTimeout(context.Background(), a.shutdownTimeout)
defer cancel()
// If all Shutdown steps don't complete within 30 seconds, force exit
go func() {
<-ctx.Done()
if errors.Is(ctx.Err(), context.DeadlineExceeded) {
a.logger.Error("shutdown deadline exceeded, forcing exit")
os.Exit(1) // hard exit, let Kubernetes restart the Pod
}
}()
Why set a timeout:
- Kubernetes default
terminationGracePeriodSecondsis 30 seconds - If graceful shutdown doesn’t complete within 30 seconds, Kubernetes sends SIGKILL to force-kill the Pod
- Autional’s 30-second default aligns with this, but can be customized via
WithShutdownTimeout
Verified in Production
Autional’s graceful shutdown performance in production:
Scenario 1: Normal Rolling Update
Pod identity-service-7f8b9c-abc12 receives SIGTERM
→ 0.2s: Stop accepting new requests
→ 2.1s: Last 3 requests complete
→ 3.5s: MQ messages acked
→ 3.8s: gRPC stream complete
→ 5.0s: DB connection pool released
→ 5.0s: Process exits
The gateway load balancer detects Pod termination and automatically routes traffic to the new Pod. Zero errors, zero 5xx.
Scenario 2: Database Connection Failure (Fallback Test)
Pod billing-service-6c3d9a-xyz78 receives SIGTERM
→ 0.1s: Stop accepting new requests
→ 0.3s: HTTP shutdown successful
→ 0.5s: gRPC shutdown successful
→ 0.5s: Close DB failed → error logged, continues
→ 0.6s: Close Redis successful
→ 0.7s: Close MQ successful
→ 0.7s: Process exits (despite db.Close failure)
Because db.Close() returned an error, but the WithCloser implementation always calls all Closers, never interrupting due to a single failure:
for _, closer := range s.closers { // reverse order
if err := closer.Fn(); err != nil {
logger.Error("close resource failed",
slog.String("name", closer.Name),
slog.Any(logger_base.KeyError, err))
}
}
Why This Matters
User Experience
Zero-downtime graceful shutdown means: users in the middle of two-factor authentication (MFA), submitting a password reset request, or checking wallet balances—none of these in-progress operations are interrupted by deployments. Users don’t notice a thing.
Data Integrity
MQ messages are not lost: unacked messages are re-queued after shutdown and taken over by new Pods. Database transactions don’t hang: connection pool shuts down gracefully, waiting for all goroutines to return connections and complete transactions.
Operations-Friendly
The complete shutdown sequence is recorded in logs. If a Pod consistently fails to shut down, operations can quickly locate the problematic component from “close resource failed name=xxx” logs.
Summary
Autional’s micro-middleware/app bootstrapper uses less than 300 lines of code to uniformly manage the lifecycle of 16 microservices:
- Declarative Registration: Builder pattern with
WithServer+WithCloser - Signal-Driven: Listens for SIGTERM/SIGINT, automatically triggers shutdown sequence
- Tiered Gracefulness: HTTP → MQ → gRPC → Infrastructure, orderly shutdown
- Fallback Mechanism: Hard exit on timeout + single-step failure doesn’t interrupt subsequent cleanup
- Full Logging: Every component shutdown has name and error recorded
If you’re building microservices in Go, there’s no need to reinvent the wheel—this pattern can be directly copied into your project. The core principle is just one rule: Never let SIGTERM directly kill your in-flight requests.