29. 3. 2022 Development

Google Cloud Platform – Costs and Optimizations

Ondřej Kristejn

The Google Cloud Platform (GCP) owned by Google is a suite of cloud computing services that runs on the same infrastructure that Google uses internally for its end-user products (Gmail, Google Drive, Google Search, YouTube, etc.). Alongside a set of management tools, it provides a series of modular cloud services including computing, data storage, data analytics, and machine learning, hosting of web applications, infrastructure as a service, server-less computing, and more.

What Is It All About?

That’s all very nice, but how do I work with GCP? It’s easy. You pay and it just works. The rule is that you only pay for what you use. That is a very important thing to remember because this is a key difference from on-prem solutions (To be honest, you also pay for what you use in on-prem, but generally no one cares until there is a shortage of resources). There are reams of books, workshops, and philosophies on how to exist in the cloud and not eat up the entire company’s budget overnight (which can happen). Generally, these fields are grouped under a term called FinOps which stands for cloud financial operations or cloud financial management. More on FinOps later.

General Recommendations – Best Practices

Know Your Target

One challenge with cost optimization is that there is a myriad of ways to achieve it, but not all options will result in positive outcomes. It is worth keeping in mind that cloud systems are not cheaper than physical (on-prem) infrastructure (cloud services also run on some hardware that eats up space, energy, and maintenance), but rather they provide a better time to value ratio (cloud services take care of a lot of things you would have to take care yourself – i. e. backups, database deployments, high availability, specialized tools, etc.). Cost-cutting thus requires an understanding of what you do and what you need. GCP provides a lot of products and it’s important to know what you require.

Key Products (not an exhaustive list):

Computing (App Engine, Computing Engine, Kubernetes, …)
Storage & Databases (Cloud Storage, Cloud SQL, Bigtable, Spanner, Datastore, …)
Networking (load balancing, CDN, DNS, armours – firewalls – DDoS protection, …)
Big Data (BigQuery, Cloud Dataflow, Datalab, …)
Cloud AI (AutoML, machine learning engine, speech-to-text, text-to-speech, …)
Management Tools (monitoring/logging/diagnostics, console, shell, APIs, …)
Identity & Security (IAM, SSO, data loss prevention API, resource management, …)
API platform (monetization, analytics, …)
…

Know Your Needs

Leverage the variable nature of the cloud and on-demand resources. GCP offers many ways to do so.

Beware of the unused! As we said: You only pay for what you use. If you use "NVIDIA Tesla P100 Virtual Workstation" for your hello-world app, you will have to pay for it. The easiest way of reducing your GCP bill is to get rid of resources that are not used.

– Identify idle VMs, apps, jobs, etc.

– Beware of excess of unused storage, RAM, and CPUs

Choose the appropriate products: Another significant place to save money is by understanding your application needs, your behaviour, and the products GCP offers. For example:

– Preemptible VM instances: These instances are up to 91% cheaper than regular ones. The difference is that these instances can be stopped by the GCP computing engine if it needs resources for other tasks. You can save a lot of money if your apps are fault-tolerant and can withstand possible instance pre-emptions.

Use discounts: There are two types of discounts in GCP that you can use to save some $$$.

– Sustained use discounts: Some resources (CPU, RAM, etc.) trigger a discount of up to 30% when used for 25% of a month.

– Committed use discounts: You can commit yourself to a contract and buy resources in bulk, including CPUs, disks, machines, etc. The contract lasts 1–3 years, which can earn you a discount of between 50%–70%. This is ideal for services you plan to use over the long term. Example (price is per hour): n2d-node-224–896 (machine) $8.3 (on-demand) $6.7 (1y CUD) $5.01 (3y CUD)

Specify your hardware: Tailoring hardware to your needs (based on Google internal statistics) can save you up to 19% in bills for your machine. For mid-sized or large projects, it is almost always more cost-efficient to set up your hardware yourself and not use predefined machines.

Storage: This is a very big topic. Choosing the right database and storage is not an easy task, but paying attention to storage type, utilization, and configuration can result in substantial cost savings.

– When talking about storage, GCP also provides a service for large dataset migration. It’s called the Transfer Appliance. Google will come to your company with a "suitcase" able to hold 1PT of data. When you fill it with your data (everything inside is encrypted and very secure), they take the suitcase to their data centre, plug it in, and you can migrate your data at a much faster pace.

Optimize & Automate: GCP is a very policy-driven environment. There are lots of tools for automating and optimizing your work and resources at your disposal.

– Autoscaling: You can turn on automatic scaling (CPUs, RAMs, instances), which can be a complex thing (it can be scripted, scheduled, etc.). Proper usage based on internal Google statistics can save up to 60% of your expenses. This process is also called rightsizing and GCP will even try to suggest and recommend an optimal resources base (for certain products) on the previous 8 days of usage. The use case for autoscaling is usually for tasks with high and infrequent CPU or RAM spikes and fluctuations (unpredictable environments) where you don’t have stable workloads. In rightsizing, you usually chose between cost-based and performance-based recommendations.

– Scheduling: Set your services, jobs, and instances to start and stop.

– Deduplication: Another common source of waste is duplicate data. Beware of duplication that creates itself via versioning (data buckets). Deduplication can get rid of duplicate data and save significant sums.

– Lifecycle management: Data in your cloud storage might require different treatment over time. For example, your logs for the last month need to be available in storage buckets, but older data is of no use. However, you need to keep them for audits or compliance obligations, for example. By setting lifecycle management, you can adjust these settings by migrating what you do not need to some cheaper type of storage (like Cold Storage or Archive Storage).

– Quotas: Restrictions save money

– Alerts: This is self-explanatory. Wherever you have quotas, you should have an alert to cover them. One special and very important set are budget alerts. There is no kill switch to stop a service when its budget limit has been reached. If you reach your limit. An email will notify you, but nothing else happens and your app eats up money. However, there is a solution. You need to set up a budget to monitor your bills, enable budget notifications, and then configure the Cloud Function to call Cloud Billing API that disables billing for your service.

– GCP Price Calculator: Use this tool to estimate your billing for a certain setup. It’s pretty robust. You can choose from lots of GCP services and apply a range of settings, but it’s not a very precise source, probably due to the very interconnected and dynamic GCP environment. You very probably won’t be able to include all the necessary external or internal influences affecting your actual and final bill.

Monitoring: Usually, applications are created with monitoring in mind, but GCP also provides many ways to monitor your apps via custom dashboards, logging, alerting, etc. But beware: These tools are VERY expensive. Try to use your own solutions whenever possible.
Network: Networking in GCP is very tricky with all the zones, regions, in and out traffic, inter-machine communications, etc. and it can eat up your bill very quickly, even if its not at all apparent. There is something called "VPC Flow Logs," which is a GCP service you can activate and is very useful for network pricing optimization and estimates.

– Ingress: Traffic coming into the GCP (GCP resources). This traffic is free.

– Egress: Traffic coming out of the GCP (GCP resources). You pay for this traffic.

– There are special charges on top of that for communications between GCP zones. For the type of network you use (the Standard Tier leverages the public internet to carry traffic between your services and your users; and the Premium Tier leverages Google’s premium backbone and more direct, faster, and lower latency routes for your services and users). You are charged (or not charged) if your traffic uses external or internal IP addresses.

– Cloud DNS: You pay not just for this service but also for DNS queries (the pricing is aggregated across all your GCP zones).

– Beware: Two machines communicating inside the GCP is billable too – one sends traffic out (egress) and one in (ingress).

When Messing Around

You’re not in Kansas anymore (as they say in the US). There will most probably be three types of environments. Dev, Pre-Prod, and Prod. Though we generally know that doing stupid things in production is not a good idea, we happily do stupid things (I mean testing things) in our stage environments. This won’t apply in GCP where you will have to pay for all of this, and of course there will be attempts to cap your Dev bills. You can end up eating all the money in the first day of a month.

Horror Stories

Story 1.

A start-up with great business and lots of nice products was working as expected, serving its customers, and caring for its employees. One day, a new feature was introduced, tested, and worked as expected. It was time to ship it to production. And…

Long story short, the next day when people got to work there were emails about "free plan being upgraded", 100% budget limit reached, „and finally“ declined credit card.“ Someone checked the billing dashboard and there was a bill of over $55k with $600 being added every minute. Panic broke out!

When they finally managed to halt it, they were something over 70k dollars poorer. It was all the worse because they needed to disconnect everything to gain control, halting their other products because of insufficient funding. So one deployment crippled the whole company.

What happened? Recursive behaviour with connection to autoscaling defaults. The app was a scraper that scraped a website and then saved the data to the DB and for each link in the page made another call for a scrape. It was all done via the GCP Cloud Run service feeding itself URLs as scraping targets. The defaults for autoscaling the Cloud Run service was capped at 1,000 instances, so the scraper very effectively and quickly achieved this limit, resulting in 33M database records, 116B reads, and 1.8 years of cloud computing time in around 5 hours of uptime.

It was a catastrophe, but they contacted Google about their mistake. They were excused from the $70k bill, which was nice, but there is no guarantee everyone will receive the same treatment.

Story 2.

A small crowdfunding company had a website where they had counters that showed total donations. These numbers were calculated on the client-side (per visitor). Everything was ok until they went viral. Then the big mistake manifested itself as a good-old complexity problem daemon!

What happened? If you have 5 donation types and 100 visitors, then that’s 500 database reads. If you have 5 donation types and 100k visitors, then you have 500k database reads. But if you have 5k donation types and 100k visitors, then its 500B database reads (each donation type will add 100k DB reads).

This mistake cost them around 30k dollars for around 40B database reads. Even here Google was not "evil" and forgave the bill.

Story 3.

The last story is about security. A developer accidentally leaked some private information in a public git repository. They were discovered and misused through a hack into the company’s GCP cluster and spinning up Cloud Computing instances (probably for mining). Before anyone noticed, they were $5k dollar lighter.

Story 4.

Last-last story. This is something I came across on the internet that apparently happens often: loops created by poorly designed GCP triggers. GCP is an event-based environment where you usually chain one action to another. This can lead to some very unfortunate consequences where you have a Cloud Function that is triggered by storage uploads, for example. If that file is than stored back in the storage bucket, the cloud function is triggered by the upload again and again and again… It sounds stupid, but apparently these kinds of mistakes happen.

FinOps & Conclusion

So back to FinOps. From now on it should be clear that working in the cloud effectively means that we developers need to not only care about the environments where we run our apps, but the apps themselves should be tailored and written with the cloud in mind to truly leverage the benefits it provides.

FinOps is a field that tries to deal with that. It tries to connect people from development, products, and finance because engineers usually don’t spend a lot of time in the finance department, and finance managers may not understand exactly what they are paying for, but cooperation and communication is key. Many of the recommendations mentioned above are one-time changes and should be considered as part of the initial deployment of new projects and services.

However, given new features and improvements, we could quite easily miss something if we stop paying attention. It’s crucial to build, report and review them regularly, and there’s an argument to be made that this increasingly requires a full-time employee to look for savings as we scale up, and use more and more services.

Sources

Author

Ondřej Kristejn

Ondra works as a SW developer in the One-Account team. He loves working (probably a workaholic), computers (technology in general), learning (self-improvement in anything he can), and… hamburgers (or sausages, or pizza, or… <insert any unhealthy food> )