Sounds simple, but not necessarily done correctly always. In this article I will talk about the following (try not to skip stuff):
- What should be in every log line
- What should you be logging at a high level
- Must have application feature for logging
- Monitoring and alert mechanisms so that you don’t have people staring at the screen overnight.
This is a must do in case you have a highly distributed system like if you are using Microservices architecture (MSA), it will be a nightmare if the 7th service that you are calling returned something unexpected and you have no idea what really happened there. If you are on MSA, do check this blog post out.
What should be in every log line?
Logs have the following fundamental properties which must not be overlooked:
- Categorized and sub-categorized (Error / Critical)
- Request tag to tie in logs (all categories) together with a single request
- Session tag to tie in logs together with a user session
- Time-stamped, all logs must be time-stamped, accurate to milliseconds
- User information
- Server information to determine where the code was running
While the above list is definitely not complete and some compliances (e.g. PCI) might mandate others, the list mentions what you would definetly need regardless of the type of the application you have.
Logs should be clearly marked with categories, this means different things with different types of logs, e.g. Trace Logs should be marked with level of verbosity like chatty, high level business flow (not in these words, hopefully) and exception logs with “Medium” (errors that you can live with, like 3rd party temp failure), “High” (ones that become sev 2’s like incorrect configuration from a client stopping business), “Critical” (ones where all hands are on deck until resolved).
Here is how I had my logs categorized for an application that I (and team) built which are applicable to every app out there:
Points to note in the above categorization:
The sub-categories in the trace and error logs are hierarchical, i.e.
they are basically enums in the code and I decide based on the alert
configurations which error should raise which alert (call me or just
email me etc) and in case of trace logs to control the amount of
verbosity that my application sends to the logging system.
Request & user session tagging
People always wind up going back and adding this for any application getting more than 100 requests a day (I made that number up), to avoid that however, lets find out what that means.
A request tag can be anything, GUIDs are generally good for this, basically you generate a unique identifier for every request and pass that along to other participating services (if you have any) so that when you are looking at your logs you can actually filter down to a single request to see what happened there to either identify a problem or just to see where it takes more time than others.
Session tags are similar to request tags, basically use this identifier to see what the user was doing while she was playing around with your application, this is essential in replicating complicated defects and later on provides an awesome structured data to give you good insights to what your users do in your site to help you improve their experience or even up-sell stuff.
Every log line must contain a timestamp in UTC (you don’t want local server time here, think global deployments). Timestamps moreover should be accurate to milliseconds, which means no pre-initializing the log object at the starting of the request, put it in when the log data is ready to be shipped (and not when it actually ships, think asynchronous). Timestamps not only gives you the obvious context of time to your data, it also helps programmers identify bottlenecks in codes later while identifying slowness not easily replicable by your favorite tool in your IDE.
Enough to help you identify the user making the requests, don’t keep too much repeatable data which would bloat your logs, an id is generally enough.
Put in the servers IP address, machine name (these are generally same for machines spawned from the same image, look to IP for them). This helps in identifying those rare and pesky bugs which creep up on a server. You can use this information to tie in with the system (OS) logs to identify issues.
What should you log?
Let’s get the list done with first and we’ll delve deeper into their uses later:
- Trace logs, small data lines which helps to figure out what happened in a request
- Request / Response blobs
- Server events (OS logs) to determine the health of servers & the services it hosts
The simplest one and the one which most application get wrong.
I define exceptions (in my own words) as:
Exceptions are “errors” or more aptly put “exceptional” scenarios
which your application either did not expect or hoped not to
The first takeaway from the above statement is that exceptions shouldn’t be used as placeholders for logging some information, ever, period. I find that quite a few developers do this because of any or a combination of these factors:
- They don’t know better
- There is no infrastructure to do any other type of logging
- The other (trace) logging is normally turned off to save space and turned on only when there are issues.
The first one generally made by dev’s new in the business can be corrected by code reviews, the others are a result of a bad design and a team which poorly communicates. I say communication here as I cannot imagine a world where an architect would say “no, we will never let you log anything but errors“.
While we all log exceptions (errors) I have seen few applications which categorize them well enough to be actionable. If all your exceptions look the same, how will you prioritize? How will you decide (in an automated fashion) which error is bad enough to wake up your developers (see monitoring mechanisms below) and start looking at it immediately?
See the image at the top to see some categorization I did for an app.
Categorized generally speaking into three categories, namely:
- High level business flow (should always be on)
- Module level flow
- Method level (maximum verbosity)
Let me digress a little bit here and write an opinion:
Every app should have the ability to change the required verbosity of
trace logs via a configuration without need for deployments
The above statement ensures that although normally I wouldn’t load my logging systems with logs like “Entered method .” which nobody would read, I have the power to increase the verbosity of my logs to locate issues in a running system without disrupting it.
Nlog is a great tool in the Microsoft world which can easily be extended to do the above.
Request / Response Blobs
Applicable to API’s (I would avoid it for UI, not much benefit) there is nothing like having the knowledge handy of what exactly was sent out to the client, this rounds up as a first point of reference for those sticky issues where there are no errors but the application behaved in a manner it shouldn’t have.
This includes calls to any third-party API’s as well, if you use a payment provider, you might want to keep these in case of any disputes with that provider, make sure you mask any CC numbers or equivalent before you send these logs out though otherwise, well, boom!
Server events / OS logs
If you are on cloud (if you are not, get on it) and have auto scaling setup (checkout how to save money with AWS auto scaling in this blog post) you will have servers getting destroyed and created all the time. In case if your services stop working in between chances are that your auto-scaling would have already destroyed the server before you get a chance to RDP into it to check what had happened (you can prevent that though by using lifecycle hooks).
I recommend using logstash agents on your servers (I have used it in both windows and linux) to ship these logs as they come so they are always handy.
Monitoring and Alert Mechanisms
Logging all the data that you would need is the definite first step, but alas, it’s of no use really if you do not have a good, solid, reliable, robust, infallible (running out of adjectives) monitoring system in place to make sure you are not dependent on humans to watch for errors or god/science forbid, clients to report them.
Monitoring systems need other monitoring systems (pretty easy, its not overkill) to make sure that the monitoring systems aren’t dead as well, it costs you to set these up, doesn’t have to cost you to maintain them and in fact saves a lot of money as you can get rid of the necessity of 24×7 support (like we did).
Here is an overview of a system that worked for me:
- Elastic Search (No SQL search and analytics DB)
- Kibana (to visualize ES data, has custom dashboards for your viewing pleasure)
- S3 for request / response blob storage, you don’t want large files going into ES, plus it provides good archiving methods.
- Cloudwatch (backup monitoring system)
- AWS Lambda (keep alive data to cloudwatch to make sure Elastic search is alive)
- SNS (to send us mails, invoke pagerduty in case of sev 1’s)
- Pagerduty (to make phone calls to make us jump out of the bed in case of sev 1’s, has escalation policies if the first person doesn’t pick up)
- Log pipeline is another topic, takes care of shipping large amount of logs in proper sources after manipulating them for additional information
To explain the entire monitoring system will be too much for a single post, however, I do plan to cover each of these over time. Let me know by commenting below if there is any topic from the above diagram that you would prefer me to focus on first! Also, subscribe to email alerts in case you are interested in such posts in the future.
UPDATE: You can find the set up guide with introduction to set up Elastic search and Kibana on AWS here.