Don Nguyen - Code & Life

Cracking the GCP Billing Code: My Two-Month Odyssey to Solve a Ridiculous Egress Charge

A software developer's frustrating journey to uncover the source of a mysterious and inflated egress charge in Google Cloud Platform (GCP). Learn how BigQuery and data analysis saved the day (and the budget).

  ·   5 min read

Let’s face it, nobody enjoys getting unexpected cloud bills. It’s even worse when those bills seem way too high for what you’re actually using.

That’s what happened to me with Google Cloud Platform (GCP). I went on a long and confusing journey to solve a billing problem that lasted for two months, involved many support teams, and made me question if I even understood GCP.

The Case of the Huge Outgoing Traffic: Finding the Mysterious Charge

It all began simply enough. I was looking at my GCP billing reports (like I usually do) when I saw a strange charge: “Network Internet Data Transfer Out from Sydney to Australia.” This particular SKU belongs to the Compute Engine category, which initially threw me off a bit. You see, most of my core services are running on the Managed Cloud Run service. I do have an older service still running on Kubernetes (k8s), which explains why both Compute Engine and Cloud Run ended up being part of this billing mystery.

The cost for this egress charge had become much bigger recently, and I didn’t know why. My first reaction was confusion and worry. My web service is a relatively small operation, serving about 8,000 users per day on average. At peak times, we don’t handle more than 25 requests per second. So, I hadn’t changed anything in my app that would explain such a big increase in outgoing traffic costs. But according to this SKU, we were somehow generating around 220GB of outbound traffic per day! That’s like trying to send the entire contents of a small library to every user, every day. It just didn’t make sense.

Looking at the Metrics: Something Doesn’t Add Up

I started digging into the metrics for my Cloud Run and Compute Engine services, looking at how much data they were sending out. But even at peak times, the numbers were way too low to explain the 220GB per day that the billing report was showing.

The Billing Team Couldn’t Help: “We Don’t Know What’s Wrong”

My first idea was to ask the GCP billing support team for help. Surely, they would know how to explain this strange charge. But my first conversations with the billing team weren’t very helpful. They couldn’t tell me where the charges were coming from and suggested I ask the Compute Engine team for technical support.

Compute Engine’s Wrong Guess: VMs and Numbers That Don’t Match

The Compute Engine team, who are experts in virtual machines and networks, thought the high outgoing traffic was coming from a VM instance or a Google Kubernetes Engine (GKE) node. We spent a lot of time looking at metrics, network logs, and trying to match the billing data with how much I was actually using the resources. But the numbers didn’t make sense. The traffic from my VMs and GKE pods was way less than what the billing charge said.

Cloud Run Gets Blamed (But Says It’s Not Their Fault)

Based on some things they saw in my billing reports, the Compute Engine team started to think that Cloud Run, GCP’s serverless platform, might be the problem. I opened a support case with the Cloud Run team, but they said it wasn’t their fault. “Our numbers are fine,” they said. “We don’t see any high outgoing traffic.”

Video Calls and Growing Frustration: Getting Lost in Support

At this point, the investigation was stuck. It had been two weeks, and I still didn’t have any answers. I was getting frustrated. The support teams suggested video calls to talk more about the problem. We had long conversations, but we still couldn’t figure out what was wrong. I felt like I was being passed around from one team to another, and nobody really understood the whole picture.

The BigQuery Solution: Finding Answers in the Data

One of the support engineers had a good idea: export my detailed billing data to BigQuery, which is GCP’s data warehouse. They said this would let me look at the data more closely and maybe find patterns that weren’t obvious in the regular billing reports.

I exported the data and it didn’t take me long to find the culprit!

The “Aha” Moment: Unmasking the Load Balancer’s Egress Shenanigans

I quickly discovered that the high outgoing traffic charges were coming from a Load Balancer that I had set up with a Cloud Storage bucket.

It turned out that the Load Balancer was sending a lot of traffic to the Cloud Storage bucket, and this traffic was being counted as outgoing traffic from my Sydney region. None of the standard metrics – not from Cloud Run, Compute Engine, or even the networking tools – provided any visibility into this type of traffic. It was easy to miss! We’ll talk more about how this happened – and the silly mistake I made that caused 220GB of outbound traffic per day for a small service – in a future post. Stay tuned!

Success (and The Most Important Lesson)

The feeling of finally solving the problem was great. I felt relieved, happy, and a little bit like saying “I told you so!” to the support teams (but I didn’t… mostly).

This whole experience taught me that if you have a weird billing charge for something you don’t understand, the best thing to do is export your detailed billing data to BigQuery. This gives you the power to look at the data in any way you want and find the source of the problem.

Sharing What I Learned (So You Don’t Have to Suffer)

I think sharing my story can help other GCP users avoid the same frustration I went through. I hope this blog post will help you:

  • Watch your bills closely: Look for anything unusual or unexpected in your billing reports.
  • Don’t be afraid to export your billing data to BigQuery: It’s the best way to get to the bottom of strange charges.

Cloud billing can be tricky, but with the right tools and a little detective work, you can manage it and avoid unnecessary stress (and big bills!).

Happy coding (and saving money on your cloud bill)!