Overview of the Problem
A DataSource timeout occurs when an application cannot get a database connection from the connection pool within a configured wait time, or when a query/command does not complete before its execution timeout. These errors signal a mismatch between demand and supply: too many callers waiting for too few connections, long-running SQL blocking the pool, misconfigured timeouts, network instability, or a database that is saturated or unreachable. If left unaddressed, Datasource timeouts can cause cascading failures, high latency, and user-visible errors across services.
Datasource timeouts often show up as “connection pool exhausted,” “SQL timeout,” or “socket timeout” depending on whether the issue is at acquire time, execution time, or on the network path. Understanding where the timeout happens is the fastest way to the fix.
How a DataSource Timeout Manifests
Symptoms and Typical Error Messages
-
Connection acquisition (pool) timeouts:
-
Java/JDBC (HikariCP):
com.zaxxer.hikari.pool.PoolBase – HikariPool-1 – Timeout failure waiting for connection from pool
java.sql.SQLTransientConnectionException: HikariPool-1 – Connection is not available, request timed out after 30000ms -
Apache DBCP/C3P0:
org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool exhausted
-
-
Execution/command timeouts:
-
JDBC:
java.sql.SQLTimeoutException: Query timed out
-
.NET (SqlClient):
Microsoft.Data.SqlClient.SqlException: Execution Timeout Expired.
-
Postgres (via JDBC):
org.postgresql.util.PSQLException: An I/O error occurred while sending to the backend.
-
-
Network/socket timeouts:
-
Java:
java.net.SocketTimeoutException: Read timed out
-
Node.js:
SequelizeConnectionError: ResourceRequest timed out
-
These messages help determine whether you’re failing to acquire a connection, timing out during query execution, or hitting network-level delays.
Possible Causes
Quick reference causes and solutions:
-
Cause: Connection pool exhaustion (connections not returned, leak, pool too small)
- Solution: Fix connection leaks, reduce long transactions, right-size pool, lower maxLifetime, add metrics/alerts.
-
Cause: Slow or blocking queries (missing indexes, suboptimal plans, large scans)
- Solution: Use EXPLAIN/EXPLAIN ANALYZE, add indexes, rewrite queries, adjust statistics, use proper JOINs and filters.
-
Cause: Lock contention or deadlocks (long-running transactions, high isolation)
- Solution: Shorten transactions, lower isolation where safe, consistent access patterns, break hot spots, retry deadlocks.
-
Cause: Database saturation (CPU, I/O, memory, max_connections)
- Solution: Scale vertically/horizontally, tune caches, connection limits, use read replicas, queue load.
-
Cause: Network/DNS/firewall issues (latency, packet loss, DNS timeouts, stateful firewalls)
- Solution: Fix DNS, pin IPs temporarily, adjust keepalive, MTU, firewall idle timeouts, verify routing.
-
Cause: Misconfigured timeouts (too aggressive or too lax)
- Solution: Align connectionTimeout, socketTimeout, commandTimeout with SLOs and query profiles; set server-side statement timeouts.
-
Cause: Idle connection eviction or NAT timeouts breaking connections
- Solution: Enable keepalives, reduce idle timeouts, prefer shorter connection lifetimes, retry on transient failures.
Step-by-Step Troubleshooting Guide
Step 1: Confirm Scope and Baseline
- Identify if the problem is persistent or spike-based.
- Collect timestamps, impacted services, and environment (prod/stage).
- Gather existing metrics: connection pool utilization, wait time, DB CPU/IO, QPS, p95/p99 latency.
- Check recent deployments, schema changes, or traffic shifts.
Tip: Add a correlation ID per request to tie errors back to specific queries and code paths.
Step 2: Identify Where the Timeout Occurs
Figure out if the error occurs on connection acquisition, during query execution, or at the network level.
For Java/JDBC
- If you see Hikari “Timeout failure waiting for connection from pool,” it’s acquisition.
- If it’s SQLTimeoutException on executeQuery/update, it’s execution timeout.
- If it’s SocketTimeoutException/Read timed out, check network or socket-timeout settings.
Enable Hikari leakDetectionThreshold to detect connection leaks:
spring.datasource.hikari.leakDetectionThreshold=20000
For .NET (System.Data.SqlClient or Microsoft.Data.SqlClient)
- Execution Timeout Expired indicates command-level timeout.
- Timeout expired while trying to open indicates connection acquisition/network.
- Use EventSource logging or DiagnosticSource to trace connection open vs command execution durations.
For Node.js (Sequelize, generic-pool)
- “ResourceRequest timed out” → pool acquisition.
- ECONNRESET/ETIMEDOUT → network/socket.
- Use poolLogging and set acquire and idle timeouts to distinguish.
For Python (SQLAlchemy)
- QueuePool errors (QueuePool limit reached) → acquisition timeout.
- OperationalError with timeout hints → execution/network.
- Enable echo_pool, pool_pre_ping, and set pool_timeout for visibility.
Step 3: Inspect the Connection Pool
What to check
- Pool size (maxPoolSize), active vs idle connections, average wait time.
- Connection lifetime/maxLifetime, idleTimeout, and leakDetection.
- Whether connections are being returned in finally blocks or via try-with-resources/using.
Fixes
-
Ensure connections are closed in all code paths:
-
Java:
try (Connection c = ds.getConnection();
PreparedStatement ps = c.prepareStatement(sql);
ResultSet rs = ps.executeQuery()) {
// work
} // auto-close -
.NET:
using var conn = new SqlConnection(cs);
await conn.OpenAsync();
using var cmd = new SqlCommand(sql, conn) { CommandTimeout = 30 };
using var reader = await cmd.ExecuteReaderAsync();
-
-
Right-size the pool: too small causes waits; too large can overload the DB. Rule-of-thumb: number of cores on DB node times a small factor; benchmark under your workload.
-
Set maxLifetime slightly shorter than DB’s idle/NAT timeouts to avoid broken connections (e.g., HikariCP maxLifetime=30m, keepaliveTime=2m).
-
Enable leak detection and log stack traces of overdue connections to find code paths that don’t return connections.
Step 4: Analyze Queries and Locks
Collect Slow queries
-
Enable slow query logs or APM tracing.
-
Postgres:
ALTER SYSTEM SET log_min_duration_statement = ‘500ms’;
SELECT pid, state, wait_event_type, wait_event, query FROM pg_stat_activity WHERE state <> ‘idle’; -
MySQL:
SET GLOBAL slow_query_log = ‘ON’;
SET GLOBAL long_query_time = 0.5;
SHOW PROCESSLIST; -
SQL Server:
- Use Query Store, Extended Events, or sp_WhoIsActive.
-
Read execution plans
-
Use EXPLAIN/EXPLAIN ANALYZE to find scans, bad join orders, missing indexes:
-
Postgres:
EXPLAIN (ANALYZE, BUFFERS) SELECT …;
-
SQL Server:
- Include Actual Execution Plan in SSMS.
-
Lock contention and deadlocks
-
Symptoms: many sessions waiting on locks; deadlock graphs.
-
Postgres:
SELECT * FROM pg_locks l JOIN pg_stat_activity a ON l.pid = a.pid WHERE NOT granted;
-
SQL Server: Extended Events “deadlock graph.”
-
MySQL: SHOW ENGINE INNODB STATUS\G
-
-
Solutions:
- Shorten transactions; move long reads to READ COMMITTED/READ UNCOMMITTED where safe.
- Access tables in consistent order to reduce deadlocks.
- Batch writes and commit frequently.
- Add indexes to reduce lock footprints.
Fixes
- Add or adjust indexes (covering indexes for frequent filters and sorts).
- Rewrite N+1 queries, use proper pagination (keyset over OFFSET for large pages).
- Update statistics and analyze tables.
- Break monolithic transactions into smaller units.
Step 5: Check Database Resource Saturation
What to review
- CPU near 100%, high I/O wait, memory pressure leading to evictions.
- Connection limits: Postgres max_connections, MySQL max_connections, SQL Server worker threads.
- Tempdb (SQL Server) or temp tablespace pressure.
Actions
- Scale up or out (larger instance, read replicas, sharding).
- Add caching (application-level, query results).
- Rate-limit or queue traffic during surges.
- Tune server parameters (effective_cache_size, work_mem, innodb_buffer_pool_size).
- Ensure you are not over-connecting (too many app instances with large pools).
Step 6: Network and DNS
Diagnostics
- Check DNS resolution time; verify DB hostname resolves quickly.
- Test connectivity and latency:
- ping/traceroute/mtr to DB endpoint (where allowed).
- TCP checks:
telnet host 5432ornc -vz host 3306.
- Inspect firewall/NAT timeouts on idle connections.
- Review load balancer or proxy logs (e.g., HAProxy, PgBouncer, Azure PG gateway).
Fixes
- Set TCP keepalives and shorter connection lifetimes to avoid stale NAT entries.
- Pin IPs temporarily to rule out DNS; fix slow or misconfigured resolvers.
- Align LB/proxy idle timeouts with pool idle and server keepalive settings.
- Ensure MTU is consistent to avoid fragmentation-induced retransmissions.
Step 7: Tune Timeouts Intelligently
Application-side
-
Connection acquisition timeout:
-
HikariCP:
spring.datasource.hikari.maximumPoolSize=20
spring.datasource.hikari.connectionTimeout=30000
spring.datasource.hikari.maxLifetime=1800000
spring.datasource.hikari.keepaliveTime=120000
-
-
Command/query timeout:
-
JDBC:
stmt.setQueryTimeout(30); // seconds
-
.NET:
var cmd = new SqlCommand(sql, conn) { CommandTimeout = 30 };
-
Postgres JDBC:
Properties p = new Properties();
p.setProperty(“socketTimeout”, “30”); // seconds
p.setProperty(“loginTimeout”, “10”);
-
Set timeouts based on your SLOs. A too-long timeout hides issues; too-short causes spurious failures.
Database-side
-
Postgres:
ALTER SYSTEM SET statement_timeout = ’30s’;
ALTER SYSTEM SET idle_in_transaction_session_timeout = ’60s’;
SELECT pg_reload_conf(); -
MySQL:
SET GLOBAL wait_timeout=600;
SET GLOBAL interactive_timeout=600; -
SQL Server:
- Use Resource Governor or query hints judiciously; consider MAXDOP, recompile for parameter sniffing issues.
Infrastructure
- Load balancer idle timeout >= client idle + small buffer.
- TCP keepalive enabled on hosts; adjust intervals to be less than NAT/LB idle timeouts.
Step 8: Hardening and Resilience
Retries and circuit breakers
- Implement bounded retries with jitter for transient timeouts (retry safe, idempotent operations only).
- Use circuit breakers to shed load when the DB is degraded.
- Apply bulkheads: separate pools per workload (read vs write) to prevent starvation.
Backpressure and timeouts budget
- Apply a global request timeout and allocate a budget to downstream calls (DB included).
- Use queues and Rate limiting during burst traffic to avoid pool overload.
Configuration Examples
Spring Boot + HikariCP
application.properties:
spring.datasource.url=jdbc:postgresql://db:5432/app
spring.datasource.username=app
spring.datasource.password=secret
spring.datasource.hikari.maximumPoolSize=20
spring.datasource.hikari.minimumIdle=5
spring.datasource.hikari.connectionTimeout=30000
spring.datasource.hikari.idleTimeout=600000
spring.datasource.hikari.maxLifetime=1800000
spring.datasource.hikari.keepaliveTime=120000
spring.datasource.hikari.leakDetectionThreshold=20000
Log pool metrics periodically:
@Bean
public MeterBinder hikariMetrics(HikariDataSource ds) {
return registry -> registry.gauge(“hikari.active”, ds, HikariDataSource::getHikariPoolMXBean().getActiveConnections);
}
.NET SqlClient
Connection string and command timeout:
var cs = “Server=tcp:sqlserver.database.windows.net,1433;Database=app;User ID=user;Password=secret;Connect Timeout=10;Encrypt=True;TrustServerCertificate=False;”;
using var conn = new SqlConnection(cs);
await conn.OpenAsync();
using var cmd = new SqlCommand(“SELECT …”, conn) { CommandTimeout = 30 };
using var reader = await cmd.ExecuteReaderAsync();
Enable transient fault handling (Polly):
csharp
var retry = Policy.Handle
.Or
.WaitAndRetryAsync(3, i => TimeSpan.FromMilliseconds(200 * Math.Pow(2, i)));
Postgres Server settings
postgresql.conf:
max_connections = 300
shared_buffers = 8GB
effective_cache_size = 24GB
work_mem = 64MB
statement_timeout = ’30s’
idle_in_transaction_session_timeout = ’60s’
Operational queries:
SELECT FROM pg_stat_activity WHERE state <> ‘idle’;
SELECT FROM pg_locks WHERE NOT granted;
MySQL
my.cnf:
[mysqld]
max_connections = 500
innodb_buffer_pool_size = 8G
wait_timeout = 600
interactive_timeout = 600
Slow query log:
SET GLOBAL slow_query_log=ON;
SET GLOBAL long_query_time=0.5;
Common mistakes and How to Avoid Them
- Relying solely on increasing pool size: This can overload the database and worsen timeouts. Instead, fix query Performance and leaks.
- Ignoring connection leaks: Not closing connections in error paths starves the pool. Use try-with-resources/using and enable leak detection.
- Setting very high timeouts: Masks issues and holds threads. Use timeouts aligned with SLOs and set server-side statement timeouts.
- Overusing retries: Blind retries amplify load. Implement bounded retries with jitter and idempotency checks.
- Skipping query plan analysis: Indexes and plan quality are often the root cause. Always EXPLAIN Slow queries.
- Mixing long-running jobs with user traffic in the same pool: Separate pools or services to avoid starvation.
- Not monitoring pool metrics: Without visibility into active, idle, and wait times, you’ll guess at fixes.
- Forgetting network/LB timeouts: NAT/firewall idle timeouts silently drop connections; add keepalives and tune lifetimes.
Prevention Tips / Best practices
- Monitor everything: connection pool metrics, DB resource usage, slow queries, lock waits, error rates, and latency distributions.
- Implement safe defaults:
- Reasonable maximumPoolSize and connectionTimeout.
- Server-side statement_timeout (Postgres) or CommandTimeout policies (.NET).
- Idle-in-transaction timeouts to prevent session pinning.
- Optimize hot paths:
- Add appropriate indexes, avoid SELECT *, paginate efficiently.
- Keep transactions short; commit early and often.
- Separate workloads:
- Use read replicas for heavy reads, dedicated pools for background jobs.
- Enforce per-endpoint budgets and rate limits.
- Resilience patterns:
- Retries with backoff for transient failures; circuit breakers; bulkheads.
- Use pool pre-ping/validation and keepalives to avoid stale connections.
- Capacity planning:
- Ensure DB can handle peak concurrent queries; scale before saturation.
- Load-test with realistic concurrency and data volumes.
- Operational hygiene:
- Regularly update statistics, reindex where needed, vacuum/analyze (Postgres).
- Review execution plans after schema or data distribution changes.
Key Takeaways
- Pinpoint the timeout location first: pool acquisition vs query execution vs network.
- Most issues stem from pool exhaustion, slow/blocked queries, or DB saturation—optimize rather than just increasing pool size.
- Use timeouts wisely: align connection, command, socket, and server-side timeouts with your SLOs.
- Close connections reliably, keep transactions short, and monitor pool metrics continuously.
- Add resilience: retries (with care), circuit breakers, and separate pools/workloads to prevent cascading failures.
FAQ
How do I differentiate a connection pool timeout from a SQL execution timeout?
Check the exception type and message. “Timeout waiting for connection from pool” or “pool exhausted” indicates acquisition timeout. “SQLTimeoutException” or “Execution Timeout Expired” during executeQuery/ExecuteNonQuery indicates an execution timeout. Socket/Read timeouts suggest network issues. Enable pool metrics to see wait times and active connections.
What should I set as a reasonable maximum pool size?
Start small and data-driven. For many OLTP systems, a pool size between 10–50 per service instance is sufficient. Consider DB core count and workload; too many connections can reduce throughput due to context switching and lock contention. Load-test to find the knee of the curve, and remember aggregate connections across replicas and instances.
Can increasing CommandTimeout or statement_timeout fix the problem?
It can reduce false positives for legitimately long queries, but it’s often a band-aid. If queries are slow due to missing indexes or blocking, raising timeouts only delays failures and increases resource contention. Prefer optimizing queries and setting timeouts that reflect your SLO.
How do I detect and fix connection leaks?
Enable leak detection (e.g., Hikari leakDetectionThreshold) and ensure every connection/command/reader is closed in finally blocks or via try-with-resources/using. Add metrics for borrowed/returned counts. Use static code analysis and Integration tests to confirm connections are freed in error paths.
What network settings help avoid idle disconnects?
Enable TCP keepalives on clients and servers, set keepalive intervals lower than NAT/LB idle timeouts, reduce pool idleTimeout, and set maxLifetime shorter than Infrastructure idle timeouts. If using proxies or load balancers, align their idle timeouts with your application and database settings.
