Choosing an EC2 Instance

Hey guys, in this post I am going to talk about what all to think about before choosing an EC2 instance for whatever work load that you have for it to do.

The Basics

Ok, if you are new to cloud, the first thought that comes to mind is something akin to

I gotta look at the number of cores, the RAM and choose an EBS (magnetic / general purpose SSD / provisioned IOPS?) size, balance it with cost, what’s complicated in that?

Well lots! Let’s consider what all should be considered before making a decision:

Number of processing threads
Speed of individual threads
Graphic processing?
Disk throughput requirement
Disk latency requirements
Network throughput
Network latency
Instance availability
Corresponding Spot pricing for instance type sensitive workloads
Availability across different zones
Does the application horizontally scale?
Does the application framework (.Net, Java, NodeJS) vertically scale well with hardware?
Price

Cool, since we have established there is in fact exists a reason why I’m writing this article, let’s move forward and delve into each one of the above properties. Some will be simple and straight-forward while others will need to considered with a lot more thought. We will first determine the nature and requirements of our app and then proceed to actually choosing the instance type.

Number of processing threads

The number of processing threads determines the number of tasks that your processor can do in true parallel, now remember that this is different than asynchronous behavior exhibited by some applications. While asynchronous is pseudo parallel and requires much less physical threads, parallel computing by its definition requires at least two threads.

In order to estimate the number of threads required, ask yourself these questions:

Is my application IO intensive or compute intensive? Or in other words, does my application go to the database or some other outside service and processes little (general web applications are this one) or does my application loads and computes a large amount of data in parallel (think about analytical data crunching / google map service computing shortest paths to a destination).
Does my application aggregate data from a lot of sources? Even if the application is IO intensive and doesn’t do much parallel compute, aggregation from multiple sources require a constant healthy availability of application threads (not to be confused with hardware ones) which ultimately depend on the number of hardware threads available to the system.
Do I have more than one application on the server? I would say change that, because more than one app causes you to dilute your decisions about the best hardware, auto-recoveries, auto-scaling, redundancy planning, fault detection and in general you wind up spending more money, but more on this specific topic in a later article maybe.

If your answer to question 1 is “IO intensive” and “No” for the second one, your choice for an EC2 instance does not pivot on number of hardware threads available.

Speed of individual threads

Everybody wants a faster CPU and it never ever hurts, however, given a choice, what is better for you? More threads or faster processor?

The questions to ask yourself:

Does my application handle a “lot” of data over the network? Data transfers requires an actively participating CPU thread, faster threads process and clears data buffers faster.
Does my application do a lot of compute with the data? e.g. a map location service computing 100’s of paths to reach and determine the most optimum path for your vehicle (or foot) to take with the least amount of traffic etc.
Does my application depend upon multiplexing instead of multi-threading? E.g. Redis is mostly a single threaded app, it uses one thread to handle all requests and the another (or more) for admin tasks. Clearly adding more than two threads is a waste here however increasing the speed of individual threads has a direct correlation to its performance.

If your answer to any one of the above is “yes” then your application pivots on speed of the CPU.

Graphic processing

Simple, straight forward, you either need this or not. Before you choose a specific instance family, take note that this is about to come (at the time of writing this article) https://aws.amazon.com/ec2/Elastic-GPUs/

Disk throughput

High disk throughput is required typically for applications such as databases or any application which read/write from file systems a lot. Determine the number of IO operations that happen over a second (IOPS), typically its a good idea to find the average peak requirement of IOPS and then pad over that to make sure the app is never bottlenecked on IOPS.

Disk latency

Latency is the speed at which the disk completes a task, this is different than throughput. Basically, you will require lower latency in time-sensitive operations such as seek for a database but not so much for logs that are being written asynchronously to the file system (in which case you will probably require more bandwidth).

Network throughput

This is required when your applications connect to other services whether third-party or local, network throughput will play a pivotal role for your application in case you make a lot of requests over the same and the result is determined by the same. Its a good idea to note your network throughput that happen over a second (PPS), typically its a good idea to find the average peak requirement of PPS and then pad over that to make sure the app is never bottle-necked on PPS.

In AWS, the bigger the machines, the higher is the network bandwidth, keep that in mind while selecting your machine class.

Instance availability

This one is worth mentioning as I have personally burned because of this. Typically, when AWS launches the next more awesome server (in my case it was the c4 series) some people jump on it to make the best use of better hardware at a lower cost which is generally the case in AWS. However, notable it is to wait for a few months before you shift over the production workloads. What happened in my case was that AWS literally ran out of c4.large machines during peak time when my auto-scaling kicked in, resulting in slower responses of our product. AWS generally is very good at increasing capacity when they launch something new, however, for the cautious its still worth waiting.

Corresponding Spot pricing for instance type sensitive workloads

In case you do mixed scaling to save money (see this article in case you aren’t using spots well enough) and you keep the instance type the same for better predictability of cloudwatch metrics (I do that), its worth noting how dependent spot servers are for the instance type that you are about to choose. Sometimes choosing a smaller or bigger machine (higher or lower number of machines, therefore) can eventually result in a better price model balance. You can see the dependability and rates of these spot instances on AWS at its Spot Bid Advisor.

Availability across different regions

Not all instance types are available in all regions. AWS has that goal, I’m sure, however they come up with newer regions every now and then and hoping that everything is everywhere is just impractical. Its best to see if the instance you are using is available across all the regions that you plan to use or at least have an alternative already present for the region that does not support the instance that you are going for, for the other region.

One way to find out is to go here and see whether the instance is listed in the regions that you choose from the region dropdown towards the top of the page.

Does the application horizontally scale?

Horizontal scalability of an application is defined (in my words) as:

The ability to throw more machines to cater to more requests

Blatantly simple, however, it also means that all your dependent services like databases, cache servers, session maintenance etc need to be horizontally scalable as well, typically applications start bottlenecking at databases especially if you are using RDBMS’s in which you can have limited read-replica’s and more importantly single active write machines. No-SQL is better at this particular problem, however, most of these still do not have infinite horizontal scalability.

Anyhow, I digress, for the purpose of this article horizontal scalability would be more like:

The ability of my application to be run on at least thrice the number of machines that I predict I will ever need

The answer to the above question depends upon the size of the machine, but you should know in case the number of machines matter to your application due to licensing limitations (get rid of those softwares), some dependent service slowing down due to more number of connections to it (sometimes there is limited number of connections to servers per server and is not directly co-related with load on any one machine, think logging as an example).

With this in mind, let’s move on to the next topic.

Does the application framework (.Net, Java, NodeJS) vertically scale well with hardware?

Its generally taken for granted that if I add more processors, more memory etc, I will get proportionately better performance or at least load bearing capacity. However, not all application frameworks scale as well with more hardware. In my experience .Net and Java do magnificently well, although my experience in most others is very limited, I have heard PHP does not do so well (without hooks and tinkers). So before you choose to get a x1.32xlarge machine (128 vcpu and ~1.8TB of memory {I want to play counter-strike on this one}), make sure the application can actually make use of it.

Price

The list obviously cannot be complete without this. Before you write off something as too expensive though, make sure you look at all the pricing strategies that AWS has to offer @ https://aws.amazon.com/ec2/pricing/

Instance families and their traits

AWS has a set of instance families which are focused on different things, either CPU, GPU, memory, disks or even being above average at everything (but not particularly good at anything either).

Look at this for specific capabilities, I will point out the not-so-obvious traits below.

The Cx’s

I started with C3’s graduated to C4’s and recently read an AWS post about C5’s about to be released. Basically, these are newer generations of the “C” family (If you are an Indian you are probably sniggering, if you are not, forget it).

These are best for API’s, scientific processing, anything heavily dependent in speed of individual traits.

CPU is best among them all, has the fastest speed per thread with the least errors per 1000 computations which means they are even more faster than perceived by simply looking at the age old GHZ value.

Intel AVX and AVX2 capability (better floating point calculations, think scientific calculations, data analytics, 3D modelling, audio/video processing)

They also have Low RAM per CPU, make or break depending upon your application needs, typically applications do not use more than a GB or two even under heavy load though, in case you are on .Net, make sure you set the machine.config to have the .Net process at least 80% of the machine’s memory (40% is default) as the machine is just supposed to run your software and windows for which 20% is enough.

The T’s

The T series (t2.nano/micro etc) are basically cheap machines and are apt for lower than production environments or anywhere else where machines are not continuously in use. Basically in these machines you get credits which get exhausted if you go over the baseline performance for the specific machine (bigger the t.x machine, higher the baseline CPU), once the credits are gone you are stuck to the baseline. In my experience, dev, QA and some batch job machines never really ran out of credits and we saved a lot of money this way.

To learn in detail on how the CPU credit system works, here is an AWS article on the same.

The M’s

The general purpose instances with a balance of memory and CPU power, these are best for UI applications which serve dynamic pages (MVC apps) especially when images and JS files come from the same server. A mature application will typically not serve those off the machines or have a CDN in front of them meaning which hits for those files will seldom come on those machines in which case you might be better off with the Cxx’s, M series is powerful otherwise.

The R’s

Memory (RAM) focused machines, their RAM is actually the fastest along with the best price to memory ratio that you can get in AWS. These machines are typically best for cache stores, databases etc.

The I’s

Focused on attached SSD’s, these machines are indespensible if your workload is extremely disk speed sensitive. Attached SSD’s give the best disk performance no matter which EBS type you choose which at the end are basically network drives.

Before choosing this family, do note that attached SSD’s are volatile, which means if you restart your instance from the console (software restarts are ok from within the machine) all SSD storage is wiped out. This happens because AWS does not gurantee that you will actually get the same instance and more often than not when you shut and turn back on your instance you get new hardware allocated to you. The major takeaway from volatile SSD’s is therefore, the application using these SSD’s must have a good level of redundancy (replication) happening, such that in case instance restarts you don’t wind up losing all your data.

Note that instance restarts can happen without you wanting them to, there have been instances where bugs in AWS caused instances to restart, although it happened only once since I started using them about 4 years ago.

Elastic search is one example where these instances come in handy.

There are other instances which I have not specifically mentioned here, some because they have nothing special going on or a few which I haven’t dwelved deep into yet. Let me know from the comments below if you are interested in anything specific.