Taming your Datadog bills

I’ve been using Datadog for a few years now and it’s a pretty great observability tool. The only real drawback is the cost - and to be fair this doesn’t seem unique to Datadog, it seems to come with the observability-platform territory. Because their business model scales with the number of requests and log volume of your services, costs can spike very, very quickly. This has has been pointed out in a few different places, but most notably that $65 million Datadog bill.

Although my own usage (and bills!) were never anywhere near this absurd amount, I’ve learned a few documented-but-not-emphasized things in the process of trying to keep costs both manageable and predictable:

Set up a monitor/alert for overall log volume - because total number of log messages are one of the main pricing variables, having even a single (micro)service go into meltdown and start spewing logs can drive your bills up in no time. Start by alerting on a threshold of something like 2-5x the log volume from the past days. If this gets noisy, nudge it up a bit. Consider also setting some alerts for "red alert" levels, like 10x, 100x, 1000x log volumes, so that if those ever go off someone can respond quickly.
Filter out any logs that you don’t need at the agent level, not at ingestion/indexing time. Because you are charged by the ampunt of logs that you ingest - as well as how long they are retained - you want to avoid ingesting logs that will never be used. Datadog supports some pretty powerful pattern matching configuration that does specifically this, and other things as well.
Filter out your healthchecks! For most multi-tier applications there will be a series of healthchecks running to ensure that each service is available and responding. These can typically be removed from logging (following the step above) but they can also be removed from APM Tracing too.
Don’t be that team that logs PII. Again, use the filtering mentioned above to either remove or redact PII if you have to, but if your services make a habit of logging PII in plaintext, you have some work to do.
Finally, on the topic of alerting/paging, this Datadog blog post still has to be one of the best out there.

I’ll end with a comment on something strange I’ve noticed about Datadog: unlike almost every other dev tool or programming language, it’s nearly impossible to find anything Datadog-related outside of their own docs just by Googling. Like, almost nothing, whether it’s Stack Overflow, blog posts, etc. It’s either the world’s best SEO or not enough people are using it or it’s so intuitive that only I ever get confused…?