How to Gracefully Shutdown 16 Microservices? Autional's Unified Bootstrapper Revealed

Service restarts are the norm in production—Kubernetes rolling updates, node evictions, resource scaling—each sends a SIGTERM signal to the Pod. A microservice that doesn’t handle SIGTERM will terminate immediately, leading to:

In-flight HTTP requests being interrupted, clients seeing connection reset errors
Messages taken from RabbitMQ but not yet processed being permanently lost (auto-acked)
gRPC streams cut off mid-way, downstream services receiving UNAVAILABLE errors
Database connection pool violently closed, uncommitted transactions rolled back

Autional’s 16 microservices achieve zero-downtime graceful shutdown through the unified micro-middleware/app bootstrapper.

Wild Shutdown vs Graceful Shutdown

Wild Shutdown (Nothing Done)

Timeline:
T+0s   Kubernetes sends SIGTERM → process exits immediately
T+0s   8 in-flight HTTP requests all disconnected → users see 502
T+0s   3 MQ messages auto-acked but not processed → message loss
T+0s   2 database transactions uncommitted → data inconsistency
T+0s   gRPC stream disconnected → downstream retries (avalanche risk)

This is the most common and most dangerous scenario—a simple go run cmd/server/main.go with no signal handling and no shutdown logic.

Graceful Shutdown (Autional Pattern)

Timeline:
T+0s   Received SIGTERM → shutdown sequence starts
T+0s   Stop accepting new HTTP requests (return 503 + Retry-After header)
T+5s   Wait for 3 in-flight HTTP requests to complete
T+6s   HTTP server.Shutdown() complete
T+6s   Stop MQ consumer, wait for in-flight messages to finish
T+8s   All 3 MQ messages acked
T+8s   gRPC server.GracefulStop() → wait for stream transfers to complete
T+12s  Close database connection pool (LIFO)
T+12s  Process exits

Consumers are unaware. Kubernetes terminationGracePeriodSeconds is set to 30 seconds, providing ample buffer time.

Autional Unified Bootstrapper Design

Application Builder Pattern

Each service builds its startup configuration via a Builder in main.go:

import app_pkg "gitee.com/linmes/authms/micro-middleware/app"

func main() {
    cfg := config.Load("configs/service/identity-service.yaml")
    logger := logger_base.New(cfg.Log)
    
    // Initialize dependencies
    db := gorm_client.MustInitDB(cfg.DB, domainModels...)
    mq := rabbitmq_client.Connect(cfg.RabbitMQ)
    redis := redis_client.Connect(cfg.Redis)
    
    // Build application
    app := app_pkg.New(cfg.Service.Name, logger).
        WithRouter(router).
        WithHealth(healthHandler).
        WithServer("grpc", grpcServer).
        WithServer("mq-consumer", consumerServer).
        WithCloser("db", sqlDB.Close).             // Database connection pool
        WithCloser("redis", redis.Close).           // Redis connection pool
        WithCloser("mq", mq.Close).                 // MQ connection
        WithCleanupNamed("audit", auditClient.Stop)
    
    app.Run(cfg.Service.Port)
}

Each WithServer and WithCloser registers a named shutdown callback. During shutdown, they execute in reverse registration order (LIFO), ensuring “first created, first opened, and opened resources close in dependency order”:

WithServer: Lifecycle Management

WithServer registers components implementing the app.Server interface:

type Server interface {
    ListenAndServe() error
    Shutdown(ctx context.Context) error
}

Common Server implementations in Autional:

Component	Implementation	Purpose
Gin Router	`app.NewHTTPServer(addr, handler)`	HTTP service
gRPC Service	`grpc_mw.Server` via `app.NewGRPCServer` wrapper	gRPC endpoints
MQ Consumer	`consumer_pkg.NewServer(consumer)`	RabbitMQ consumption
Health Check	`health.StartStandaloneServer`	Pure health probe

On shutdown, app.Run calls each Server’s Shutdown(ctx) in reverse order, passing the context timeout (default 30 seconds) to each component.

WithCloser vs WithCleanupNamed

Autional distinguishes two cleanup methods:

WithCloser(name, fn) — simple func() error closure for single-step cleanup (close DB, close Redis)
WithCleanupNamed(name, fn) — same as Closer but semantically for “side-effect cleanup” (e.g., audit client flush buffer)
Deprecated: WithCleanup(func()) — no name, no error return, not observable

// Correct: returns error, has a name
app.WithCloser("db", func() error {
    sqlDB, _ := db.DB()
    return sqlDB.Close()
})

// Wrong: no name, no error
app.WithCleanup(func() { db.Close() })

Shutdown Sequence in Detail

Step 1: Stop Accepting New Requests (0-1 sec)

Upon receiving SIGTERM, app.Run immediately calls http.Server.Shutdown(ctx):

func (a *Application) Run(port int) {
    // ... start all Servers ...
    
    quit := make(chan os.Signal, 1)
    signal.Notify(quit, syscall.SIGINT, syscall.SIGTERM)
    <-quit
    
    a.logger.Info("shutting down", slog.String(logger_base.KeyReason, "signal"))
    
    // Step 1: HTTP stops accepting new requests
    ctx, cancel := context.WithTimeout(context.Background(), a.shutdownTimeout)
    defer cancel()
    
    for _, srv := range a.servers {  // reverse order
        a.logger.Info("shutting down server", slog.String("name", srv.Name))
        if err := srv.Shutdown(ctx); err != nil {
            a.logger.Error("shutdown failed", 
                slog.String("name", srv.Name),
                slog.Any(logger_base.KeyError, err))
        }
    }
}

HTTP Server Shutdown behavior:

Closes the listening socket → new connections rejected, returns 503
Waits for all in-flight requests to complete or timeout (ctx deadline)

Step 2: Drain MQ Consumers (3-8 sec)

MQ consumers achieve graceful shutdown through consumer_pkg.Server:

// internal implementation of consumer_pkg.NewServer
func (s *Server) Shutdown(ctx context.Context) error {
    s.cancel()  // ← triggers consumer internal context cancellation
    // consumer on receiving cancel:
    //  1. Stops subscribing (no longer receives new messages)
    //  2. Waits for all in-flight messages to finish processing
    s.wg.Wait()  // ← waits for all handler goroutines to exit
    return nil
}

In Autional’s consumer architecture, messages are only acked after successful processing (manual acknowledgment mode). So even if the MQ consumer hasn’t finished processing messages during process shutdown, messages are re-queued (not acked) and are not lost.

For long-running messages (e.g., compliance report generation, potentially 30+ seconds), context cancellation interrupts processing, and the message returns to the queue to be picked up by another Pod.

Step 3: gRPC GracefulStop (8-10 sec)

func (s *GRPCServer) Shutdown(ctx context.Context) error {
    done := make(chan struct{})
    go func() {
        s.grpcServer.GracefulStop()  // blocks until all RPCs complete
        close(done)
    }()
    
    select {
    case <-done:
        return nil
    case <-ctx.Done():
        s.grpcServer.Stop()  // force close on timeout
        return ctx.Err()
    }
}

gRPC’s GracefulStop ensures ongoing stream transfers can complete fully, while Stop is the hard-close fallback.

Step 4: Close Infrastructure Connection Pools (10-12 sec)

Closed in LIFO order:

Close order (reverse of registration):
[4] audit_client.Stop()      ← flush buffered logs first
[3] mq.Close()               ← close MQ connection
[2] redis.Close()            ← close Redis connection pool
[1] db.Close()               ← close database connection pool (last opened, first closed)
    ↑ sql.DB.Close() waits for all borrowed goroutines to return connections

Each step is logged:

INFO shutting down server name=http
INFO shutting down server name=grpc
INFO shutting down server name=mq-consumer
INFO closing name=audit
INFO closing name=mq
INFO closing name=redis
INFO closing name=db
INFO shutdown complete

If a Closer returns an error, it does not skip subsequent Closers—all cleanup steps are executed. This is defensive design: even if Redis connection is already broken causing Close to fail, the DB connection pool still needs to be released normally.

Timeout and Fallback

const defaultShutdownTimeout = 30 * time.Second

// In app.Run
ctx, cancel := context.WithTimeout(context.Background(), a.shutdownTimeout)
defer cancel()

// If all Shutdown steps don't complete within 30 seconds, force exit
go func() {
    <-ctx.Done()
    if errors.Is(ctx.Err(), context.DeadlineExceeded) {
        a.logger.Error("shutdown deadline exceeded, forcing exit")
        os.Exit(1)  // hard exit, let Kubernetes restart the Pod
    }
}()

Why set a timeout:

Kubernetes default terminationGracePeriodSeconds is 30 seconds
If graceful shutdown doesn’t complete within 30 seconds, Kubernetes sends SIGKILL to force-kill the Pod
Autional’s 30-second default aligns with this, but can be customized via WithShutdownTimeout

Verified in Production

Autional’s graceful shutdown performance in production:

Scenario 1: Normal Rolling Update

Pod identity-service-7f8b9c-abc12 receives SIGTERM
→ 0.2s: Stop accepting new requests
→ 2.1s: Last 3 requests complete
→ 3.5s: MQ messages acked
→ 3.8s: gRPC stream complete
→ 5.0s: DB connection pool released
→ 5.0s: Process exits

The gateway load balancer detects Pod termination and automatically routes traffic to the new Pod. Zero errors, zero 5xx.

Scenario 2: Database Connection Failure (Fallback Test)

Pod billing-service-6c3d9a-xyz78 receives SIGTERM
→ 0.1s: Stop accepting new requests
→ 0.3s: HTTP shutdown successful
→ 0.5s: gRPC shutdown successful
→ 0.5s: Close DB failed → error logged, continues
→ 0.6s: Close Redis successful
→ 0.7s: Close MQ successful
→ 0.7s: Process exits (despite db.Close failure)

Because db.Close() returned an error, but the WithCloser implementation always calls all Closers, never interrupting due to a single failure:

for _, closer := range s.closers {  // reverse order
    if err := closer.Fn(); err != nil {
        logger.Error("close resource failed",
            slog.String("name", closer.Name),
            slog.Any(logger_base.KeyError, err))
    }
}

Why This Matters

User Experience

Zero-downtime graceful shutdown means: users in the middle of two-factor authentication (MFA), submitting a password reset request, or checking wallet balances—none of these in-progress operations are interrupted by deployments. Users don’t notice a thing.

Data Integrity

MQ messages are not lost: unacked messages are re-queued after shutdown and taken over by new Pods. Database transactions don’t hang: connection pool shuts down gracefully, waiting for all goroutines to return connections and complete transactions.

Operations-Friendly

The complete shutdown sequence is recorded in logs. If a Pod consistently fails to shut down, operations can quickly locate the problematic component from “close resource failed name=xxx” logs.

Summary

Autional’s micro-middleware/app bootstrapper uses less than 300 lines of code to uniformly manage the lifecycle of 16 microservices:

Declarative Registration: Builder pattern with WithServer + WithCloser
Signal-Driven: Listens for SIGTERM/SIGINT, automatically triggers shutdown sequence
Tiered Gracefulness: HTTP → MQ → gRPC → Infrastructure, orderly shutdown
Fallback Mechanism: Hard exit on timeout + single-step failure doesn’t interrupt subsequent cleanup
Full Logging: Every component shutdown has name and error recorded

If you’re building microservices in Go, there’s no need to reinvent the wheel—this pattern can be directly copied into your project. The core principle is just one rule: Never let SIGTERM directly kill your in-flight requests.