arrow-left arrow-right brightness-2 chevron-left chevron-right circle-half-full dots-horizontal facebook-box facebook loader magnify menu-down rss-box star twitter-box twitter white-balance-sunny window-close
How to handle AWS performance & scaling
7 min read

How to handle AWS performance & scaling

How to handle AWS performance & scaling

So it's  deadline time for our cloud project. We are load testing our app and no one seems to know where the problem. We are in a bit of a mild panic to get our app released. The app is suffering with major performance issues.

Why is load testing failing?

Only  half of our load tests are coming back successful. Our QA was reporting that our app couldn’t handle the intensity of users we’d expect on high traffic days.Then the questioning begins....

"Is it this? Is it that? Is it something on your end"
"I think it could be this?"

Have you ever been here?

What to do when you have performance issues with your AWS app?

Start with a map of your architecture. It could be a drawing on a napkin for all we care.

Our apps is like so:

A hybrid cloud app with the back-end on AWS.

We are trying to figure out where in the process it's slowing down. We are not sure if it's network, database, middle-ware, front-end etc. We are using PostgreSQL as the DB and Fargate as our node middle-ware.

Our Architecture

Summary of our Journey:

QA reported load testing problems and many of the tests were failing

  1. We realized we needed to identify where our performance was having issues
  2. We added App Dynamics to our app to help us monitor performance
  3. CPU was spiking for RDS & Fargate and we weren’t sure where to focus on
  4. Our PostgreSQL database wasn’t performing well on intensive queries
  5. I recalled a former co-worker saying EC2 instances have network bandwidth limits
  6. We discovered that we needed to beef up CPU/Memory and add auto-scaling to our Fargate
  7. We created a terraform script to handle scaling in our pipeline

How do you add AppDynamics to Fargate?

We’d thought we’d try on our enterprise monitoring since we weren’t ready to go into the X-ray land of AWS. Didn’t have time to go into that just yet.

Insert the AppDynamics code in the startup of your node app or other main.code file in the fargate image. See: https://docs.appdynamics.com/display/PRO45/Install+the+Node.js+Agent

require("appdynamics").profile({
        controllerHostName: "appd-api.domain.org",
        controllerPort: XXX,
        controllerSslEnabled: false,
        accountName: "customerAccountName",
        accountAccessKey: "asdjfaksdjfi",
        applicationName: "OUR APPNAME",
        tierName: "DEPTNAME",
        nodeName: "NODENAME",
        debug: true
    });
  • Make sure the right ports from the Fargate Server to the AppDynamics server are open

We had to tinker around with AWS security, firewalls and security groups until we understood what was going on. We also had to dive into understanding our Enterprise standards a bit more.Then we had our System Admin did some stuff with the ports and our on-prem load balancer and network stuff. Finally after several meetings with our various teams we had something working. Our Fargate server could talk to our AppDynamics monitoring server. Yeah!This took us a large part of the time. Understanding all the network security and getting the right people together on a Teams call what a job.

Just expect that learning AWS security and networking will add overhead to your teams efforts. Especially if your Devs aren’t use to it. Cloud requires you to know a little bit of everything as it were. Just realize “COLLABORATION” is key when it comes to cloud projects.

Is the database the problem?

So we called everyone we knew until we found a good DBA

Here's what we found.

  • Connections were limited between 2-10 by default on Postgresql

So what is the max connection setting we can set?

  • Our RDS db.t2.medium instance  can handle 150 connections.

The RDS types with max_connections limit:

t2.micro 66
t2.small 150
m3.medium 296
t2.medium 312
m3.large 609
t2.large 648
m4.large 648
m3.xlarge 1237
r3.large 1258
m4.xlarge 1320
m2.xlarge 1412
m3.2xlarge 2492
r3.xlarge 254

https://serverfault.com/questions/862387/aws-rds-connection-limits

https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concepts.DBInstanceClass.html


Pooling Config of Knex.js

We are using Knex.js and had to make adjustment to or knex config file.

The client created by the configuration initializes a connection pool.
This connection pool has a default setting of a min: 2, max: 10

To change the config settings for the pool, pass a pool option as one of the keys in the initialize block.

Knex.js - A SQL Query Builder for Javascript

The CPU on RDS is spiking

Sometimes CPU spikes because the database is waiting...
https://aws.amazon.com/blogs/database/analyzing-amazon-rds-database-workload-with-performance-insights/

The second challenge was optimizing queries. Our Back-end Dev & DBA helped get our queries working optimally.

I’m not going to go into this other than to say AWS and Google have various articles on this.I chatted with AWS support and their suggestions took us another way.


Increase CPU & Memory on Fargate

Then I get on a web chat session with an AWS support technician.
I show him all of our performance stats in cloud watch and he says:

“let's try upping the CPU and Memory for your Fargate instances.”

So we are using terraform, and we were using our companies default terraform modules.
We didn't have an Auto-Scale policy written into our script so we added that. Now we are auto-scaling and have 4x the cpu and memory than before.

Enterprise Module Default
Our terraform module that overrides the defaults now

I recalled I had chatted with my AWS friend Joey about an issue he had a year ago.
He said:

"we were hitting what looked like I/O issues, but it was really just us hitting the cap on our instance size's bandwidth to EBS"

So in the end upping our Fargate not only improved CPU performance on Fargate but network throughput as well.

? Different instance size = different maximum throughput

  • Different Fargate/EC2 Sizes (CPU/Memory) have different network bandwidth and performance.
AWS Fargate Network Performance
In this posting we take a look at AWS Fargate’s network performance and discus some interesting observations we had.
AWS clearly states it in the "fine-print" so to say.

Again I'll repeat
?Different instance size = different maximum throughput.
See EBS-optimized instances link


Auto-scaling really helped here’s how

We decided to setup Auto-scaling to trigger based on Fargate CPU

He decided on a high of 45 and low of 10. We set it a bit lower. Our average CPU is quite low and we wanted it to trigger Fargate to get tasks up sooner. The performance is now flawless in AWS far as I can tell and we are very happy with the new performance. Our load testing came back with 100% success rate.
We no longer had failed calls.

image-20200731122409110

We made a Terraform autoscaling.tf file

- Our pipeline rocks now!

We just added this on to our Fargate Terraform folder and Tada. We have liftoff!

# auto_scaling.tf

resource "aws_appautoscaling_target" "target" {
  // The ECS service needs to exist before autoscaling target
  count = local.use_autoscaling ? 1 : 0


  service_namespace  = "ecs"
  resource_id        = "service/${local.app_name}/${local.app_name}"
  scalable_dimension = "ecs:service:DesiredCount" #The desired task count of an ECS service.
  min_capacity       = var.autoscaling_min_capacity #Required variable for autoscaling per local.use_autoscaling
  max_capacity       = var.autoscaling_max_capacity #Required variable for autoscaling per local.use_autoscaling

  depends_on = [aws_ecs_service.this]
}

# Automatically scale capacity up by one
resource "aws_appautoscaling_policy" "up" {
  // The Autoscaling target needs to exist before autoscaling policy
  count = local.use_autoscaling ? 1 : 0
  
  name               = "${local.app_name}_scale_up"
  service_namespace  = "ecs"
  resource_id        = "service/${local.app_name}/${local.app_name}"
  scalable_dimension = "ecs:service:DesiredCount"

  step_scaling_policy_configuration {
    adjustment_type         = "ChangeInCapacity"
    cooldown                = var.autoscaling_cooldown_in
    metric_aggregation_type = "Maximum"

    step_adjustment {
      metric_interval_lower_bound = 0
      scaling_adjustment          = 1
    }
  }

  depends_on = [aws_appautoscaling_target.target[0]]
}

# Automatically scale capacity down by one
resource "aws_appautoscaling_policy" "down" {
  count = local.use_autoscaling ? 1 : 0
  
  name               = "${local.app_name}_scale_down"
  service_namespace  = "ecs"
  resource_id        = "service/${local.app_name}/${local.app_name}"
  scalable_dimension = "ecs:service:DesiredCount"

  step_scaling_policy_configuration {
    adjustment_type         = "ChangeInCapacity"
    cooldown                = var.autoscaling_cooldown_out
    metric_aggregation_type = "Maximum"

    step_adjustment {
      metric_interval_upper_bound = 0
      scaling_adjustment          = -1
    }
  }

  depends_on = [aws_appautoscaling_target.target[0]]
}

# CloudWatch alarm that triggers the autoscaling up policy
resource "aws_cloudwatch_metric_alarm" "service_cpu_high" {
  count = local.use_autoscaling ? 1 : 0
  
  alarm_name          = "${local.app_name}_scaletrigger_cpu_high"
  comparison_operator = "GreaterThanOrEqualToThreshold"
  evaluation_periods  = "2"
  metric_name         = "CPUUtilization"
  namespace           = "AWS/ECS"
  period              =  var.autoscaling_cooldown_in
  statistic           = "Average"
  threshold           =  var.autoscaling_cpu_threshold_up

  dimensions = {
    ClusterName = aws_ecs_cluster.this.name
    ServiceName = aws_ecs_service.this.name
  }

  alarm_actions = [aws_appautoscaling_policy.up[0].arn]
}

# CloudWatch alarm that triggers the autoscaling down policy
resource "aws_cloudwatch_metric_alarm" "service_cpu_low" {
  count = local.use_autoscaling ? 1 : 0
  
  alarm_name          = "${local.app_name}_scaletrigger_cpu_low"
  comparison_operator = "LessThanOrEqualToThreshold"
  evaluation_periods  = "2"
  metric_name         = "CPUUtilization"
  namespace           = "AWS/ECS"
  period              =  var.autoscaling_cooldown_out
  statistic           = "Average"
  threshold           = var.autoscaling_cpu_threshold_down

  dimensions = {
    ClusterName = aws_ecs_cluster.this.name
    ServiceName = aws_ecs_service.this.name
  }

  alarm_actions = [aws_appautoscaling_policy.down[0].arn]
}

For further reading see my article on "Cloud Projects require a new mindset"

Enjoying these posts? Subscribe for more

Subscribe now
Already have an account? Sign in
You've successfully subscribed to BenCassani.com.
Success! Your account is fully activated, you now have access to all content.
Success! Your billing info is updated.