Google Cloud Landing Zone Series – Part 12: Monitoring

Table of Contents:

Tags:

In the last blog post, we shortly introduced the different components and discussed the key points including an architecture for centralized logging and collection of audit logs.

In this blog post, we will shift our focus on monitoring, especially monitoring a landing zone.

Why Monitoring

But why monitoring is important. Let’s recap the most important points:

  • Security: Continuous monitoring helps detect potential security threats and vulnerabilities in real-time, enabling timely responses to mitigate risks as.
  • Performance: Monitoring helps in tracking the performance of cloud resources, ensuring optimal utilization and identifying underperforming or overutilized resources.
  • Cost Management: Monitoring usage patterns helps in tracking and managing cloud costs, identifying areas of wastage, and optimizing expenditure
  • Operational Efficiency: Automated monitoring tools can trigger alerts for any anomalies, allowing for proactive management of issues.
  • Scalabilty: Automated monitoring tools can trigger alerts for any anomalies, allowing for proactive management of issues.

In conclusion, effective monitoring in a cloud landing zone is essential for maintaining the security, performance, and cost-efficiency of cloud operations. It enables proactive management, ensures compliance, and provides valuable insights for continuous improvement.

Out-of-the-box monitoring and dashboards

If you are starting building your own dashboards, it is good to know that Google Cloud provides several out-of-the-box monitoring dashboards through its Google Cloud Monitoring (formerly Stackdriver) service. These dashboards offer pre-configured insights and visualizations for various Google Cloud resources. For example, if you go Monitoring > Dashboards, you will find preconfigured dashboards as well as a dashboard library. Here are some examples of those dashboards:

  • VM Instances Dashboard
    • Provides insights into the performance and health of virtual machine instances, including CPU utilization, memory usage, disk I/O, and network traffic.
  • Google Kubernetes Engine (GKE) Dashboard
    • Offers comprehensive monitoring for GKE clusters, including cluster health, node performance, pod metrics, and resource utilization.
  • Google Cloud Functions Dashboard
    • Monitors the performance and execution of serverless functions, including invocation count, execution times, and error rates.
  • Google Cloud Pub/Sub Dashboard
    • Tracks the performance and usage of Pub/Sub topics and subscriptions, including message delivery, processing latency, and backlog size.
  • Google Cloud BigQuery Dashboard
    • Provides insights into BigQuery performance, including query execution times, slot usage, and job statistics.
  • Google Cloud SQL Dashboard
    • Offers monitoring for Cloud SQL instances, covering metrics like CPU utilization, memory usage, disk space, and connection counts.
  • Google Cloud Storage Dashboard
    • Tracks the usage and performance of Cloud Storage buckets, including storage capacity, request counts, and latency metrics.
  • Google App Engine Dashboard
    • Provides insights into the performance of applications running on App Engine, including request counts, response times, and error rates.
  • Google Cloud Load Balancing Dashboard
    • Monitors the performance of load balancers, including request counts, response times, and backend instance health.
This screenshot depicts a Google Cloud Platform (GCP) monitoring dashboard for Google Compute Engine (GCE) Virtual Machine (VM) instances. Here is a detailed breakdown of the various metrics shown:

GCE VM Instance - CPU Utilization:

This graph shows the CPU utilization for multiple VM instances over the last hour.
The y-axis represents the percentage of CPU utilization.
The x-axis shows the time.
There are several lines, each representing different VM instances, with one instance showing a significant spike around 10:15 PM.
GCE VM Instance - Uptime:

This graph monitors the uptime of the VM instances.
The y-axis shows the uptime in milliseconds per second (ms/s), consistently around 1000 ms/s.
The x-axis shows the time.
All instances have a flat line, indicating stable uptime.
Disk Read Operations:

This graph tracks the rate of disk read operations.
The y-axis indicates the rate of operations per second (ops/s).
The x-axis shows the time.
There is a significant peak in read operations around 10:15 PM, which then drops to near zero.
Disk Write Operations:

This graph shows the rate of disk write operations.
The y-axis indicates the rate of write operations per second (ops/s).
The x-axis shows the time.
There are periodic spikes, with the highest around 10:15 PM.
Disk Read Bytes:

This graph measures the rate of bytes read from disk.
The y-axis indicates the rate in kilobytes per second (KiB/s).
The x-axis shows the time.
There is a notable peak at around 10:15 PM.
Disk Write Bytes:

This graph tracks the rate of bytes written to the disk.
The y-axis indicates the rate in kilobytes per second (KiB/s).
The x-axis shows the time.
There are intermittent spikes, with the highest activity around 10:15 PM.
Alt Text for the Image:
"Google Cloud Platform monitoring dashboard for GCE VM instances showing six graphs: CPU Utilization, Uptime, Disk Read Operations, Disk Write Operations, Disk Read Bytes, and Disk Write Bytes. CPU utilization shows occasional spikes, while uptime remains stable. Disk read and write operations and bytes have notable peaks around 10:15 PM."

What components should be monitored in a Landing Zone?

For a landing zone, you should monitor all the key components. The following list gives some examples:

  • Monitoring VPNs (Virtual Private Networks) and Interconnect is crucial in a cloud landing zone to ensure secure, reliable, and efficient connectivity between on-premises networks and the cloud environment. In detail, you should monitor the connection status, performance metrics, some basic traffic analysis, resource utilization, errors and event logs or redundancy.
  • If you have components like VPC Serverless connectors, you should monitor them as well. Like with VPN, you should check performance metrics and resource utilization.
  • Any network appliance should be monitored as it is crucial for the landing zone.
  • In the last blog post we discussed logging and auditing. If you export those logs, it is certainly useful to have some basic monitoring.
  • The violation of organizational policies can also be monitored.

What tools can be used

Google Monitoring is already quite powerful. Nevertheless, you can also use other tools:

  • Cloud Monitoring offers an importer that allows you to import Grafana dashboard files in JSON format into Cloud Monitoring. This document describes how to use the importer to convert Grafana dashboards and optionally upload them to your Google Cloud project.

With the importer, you can convert and upload Grafana dashboards to Cloud Monitoring in a single operation, or you can perform the conversion and upload steps separately. You might choose this approach if you want to edit the converted dashboards before uploading them.

  • You can also run your own Grafana instance based on Google Managed Prometheus. Google Managed Prometheus is a fully managed monitoring solution provided by Google Cloud that is compatible with the open-source Prometheus monitoring system. It allows users to leverage Prometheus‘ powerful monitoring and alerting capabilities while benefiting from Google Cloud’s scalability, security, and integration with other Google Cloud services.
  • You can use third party tools like Looker.

Autor

Dr. Guido Söldner

Geschäftsführer

Guido Söldner ist Geschäftsführer und Principal Consultant bei Söldner Consult. Sein Themenfeld umfasst Cloud Infrastruktur, Automatisierung und DevOps, Kubernetes, Machine Learning und Enterprise Programmierung mit Spring.