
Being a cloud engineer, i have to manage networking of construction based project management company and we rely heavily on AWS services. AWS provides robust cloud infrastructure but sometime networking issues can cause disruption in smooth operation of the application. Recently, i faced such issue, and here in this blog i am sharing my experience on what was that issue, how i diagnosed and resolved it.
The main problem was database connection timeouts. I got a call from our client that there is something wrong in the application. It is running very slowly and it is failing to save the realtime project updates on application.
Initially i checked the cloudwatch log metrics and noticed many connection timeout errors between application servers hosted in EC2 and RDS for PostgreSQL database. Then i thought it must be some kind of networking error.
After find out the problem, i followed following steps to diagnose it:
First i ran the following command from an EC2 instance to check if the database was reachable or not:
telnet db.xxxx.us-east-1.rds.amazonaws.com 5432
The connection could not be established, which confirmed that something was blocking network traffic.
Then I verified associated security groups & NACLs of both EC2 server and RDS. The inbound rules for RDS correctly allowed traffic from the EC2 instances while NACLs were also correctly configured to allow inbound and outbound traffic. The principle of least privilege was not ensured to restrict unwanted ports and IP open. I thought this might be causing some issues.
In the next step, I analyzed VPC flow logs to capture network traffic and look for anomalies.
The logs showed that some packets were being dropped when the EC2 instances tried to connect to RDS.
Thereafter, I checked route tables & subnets where our database was set in a private subnet, and our EC2 instances were in a public subnet. The route table for the private subnet had no issues as it allowed traffic within the VPC.
Still i was not being able to identify the main cause only until i found misconfiguration in NAT Gateway. I noticed that our NAT Gateway that allows instances in private subnets to reach the internet was experiencing high latency and packet drops issues.This was because the NAT Gateway was overloaded because it was shared across multiple services.
Now let see how as I identified the issue, i took following steps to resolve the problem:
i. Deployed new NAT Gateway:
Created a dedicated NAT Gateway for our application servers to prevent congestion and updated the private subnet’s route table to use this new NAT Gateway.
ii. Modified security group rules:
An explicit allow rule was added for traffic between the EC2 instances and RDS. This helped resolve potential issues caused by AWS security changes.
iii. Enabled monitoring for RDS:
Turned on enhanced monitoring for Amazon RDS to get deeper insights into performance issues.
iv. Optimized database queries:
It was found that some queries were taking longer than expected, causing timeouts. These queries were optimized to improve performance.Some queries were frequently requested, such as retrieving project details.Instead of querying the database every time, results were cached in AWS ElastiCache (Redis). The following code in python represents the query caching in Redis.
import redis
import psycopg2
cache = redis.Redis(host='cache.xx.us-east-1.cache.amazonaws.com', port=6379)
def get_project(workforce):
cached_data = cache.get(f"project:{workforce}")
if cached_data:
return cached_data # This Returns cached result
# If not in cache, fetch from database
conn = psycopg2.connect("dbname=db123 user=admin2 password=dbpswd host=apprds.xxx.us-east-1.rds.amazonaws.com")
cur = conn.cursor()
cur.execute("SELECT name, status FROM projects WHERE id = %s", (workforce,))
project = cur.fetchone()
# Store result in cache for future queries and cache for 1 hour
cache.setex(f"project:{workforce}", 3600, project)
return project
In this way after implementing these changes, the connectivity and latency issue was resolved. The application started working smoothly, and our clients could access real-time project updates without delays.
The main things to consider while resolving networking issues are:
- Always monitor VPC flow logs to detect network issues early.
- Avoid overloading NAT Gateways. Instead use dedicated ones for critical services.
- Security Groups and Network ACLs should be regularly reviewed to avoid unexpected blocks.
- Enable RDS Enhanced Monitoring to catch slow queries before they cause problems.
- Use caching (Redis) for frequently accessed data.
Leave a Reply