February 1, 2025 12 min to read

What is Prometheus and Thanos?

A comprehensive guide to Prometheus monitoring and Thanos scalability

Overview

Let’s explore Prometheus and Thanos, two powerful tools for monitoring and scaling metrics collection in cloud-native environments.

What is Prometheus?

Prometheus is an open-source monitoring and alerting system designed for reliability and scalability in dynamic cloud-native environments. It collects and stores time-series metrics with powerful querying capabilities, enabling real-time monitoring, trend analysis, and alerting for modern infrastructure and applications.

Introduction to Prometheus

Prometheus Fundamentals

Developed in 2012 at SoundCloud and now maintained by the Cloud Native Computing Foundation (CNCF), Prometheus has become the de facto standard for monitoring in Kubernetes environments.

Key design principles:

Pull-based Architecture: Prometheus actively scrapes metrics from monitored targets
Time Series Database: Optimized for storing metrics with timestamps
Dimensional Data Model: Uses metric names and key-value pairs (labels) for efficient querying
Resource Efficiency: Designed to operate with minimal overhead
Operational Simplicity: Single binary deployment with no external dependencies

Prometheus is particularly well-suited for microservices architectures, containerized environments, and dynamic infrastructure where traditional monitoring solutions often struggle.

Core Components and Architecture

graph TD A[Monitored Targets] --> |Expose Metrics| B[Prometheus Server] B --> |Stores| C[Time Series Database] B --> |Evaluates| D[Alert Rules] D --> |Sends Alerts| E[Alertmanager] E --> |Notifies| F[Notification Channels] B --> |Serves| G[PromQL API] G --> |Queries| H[Visualization Tools] I[Service Discovery] --> |Target Discovery| B style A fill:#f5f5f5,stroke:#333,stroke-width:1px style B fill:#a5d6a7,stroke:#333,stroke-width:1px style C fill:#64b5f6,stroke:#333,stroke-width:1px style D fill:#ffcc80,stroke:#333,stroke-width:1px style E fill:#ce93d8,stroke:#333,stroke-width:1px style F fill:#f5f5f5,stroke:#333,stroke-width:1px style G fill:#ef9a9a,stroke:#333,stroke-width:1px style H fill:#f5f5f5,stroke:#333,stroke-width:1px style I fill:#9fa8da,stroke:#333,stroke-width:1px

Component	Description
Prometheus Server	The core component that scrapes and stores time series data Evaluates alert rules against collected metrics Provides a query API for accessing stored data Manages service discovery to identify targets
Alertmanager	Handles alerts sent by the Prometheus server Implements deduplication, grouping, and routing of alerts Integrates with various notification channels (email, Slack, PagerDuty, etc.) Manages silencing and inhibition of alerts
Pushgateway	Allows ephemeral and batch jobs to expose metrics Provides a push-based metrics collection mechanism Bridges the gap for services that cannot be scraped directly
Exporters	Specialized programs that expose metrics from third-party systems Common examples: Node Exporter (system metrics), MySQL Exporter, Redis Exporter Convert system-specific metrics to Prometheus format
Client Libraries	Libraries for instrumenting application code Available for various languages (Go, Python, Java, Ruby, etc.) Enable custom metric collection directly from applications

Key Features and Capabilities

Prometheus Strengths

Multidimensional Data Model
- Time series identified by metric name and key-value pairs (labels)
- Enables flexible filtering, grouping, and aggregation
- Supports high-cardinality metrics with efficient storage
PromQL (Prometheus Query Language)
- Powerful functional query language specifically designed for time series
- Supports complex mathematical operations and transformations
- Enables sophisticated aggregations and joins across metrics
- Built-in functions for rate calculations, histograms, and trend analysis
Service Discovery Integration
- Native support for Kubernetes, Consul, AWS, Azure, GCP, and others
- Automatically adapts to changing infrastructure
- Supports both static and dynamic target configuration
Alert Management
- Declarative alert definitions with PromQL expressions
- Multi-stage alert pipeline with deduplication and grouping
- Silencing and inhibition capabilities for alert management

What is Thanos?

Thanos is an open-source project that extends Prometheus capabilities by adding long-term storage, global query view, and high availability features. It enables a unified monitoring system that can scale across multiple clusters and regions while maintaining compatibility with the Prometheus ecosystem.

Introduction to Thanos

Addressing Prometheus Limitations

While Prometheus excels at monitoring individual environments, it faces several limitations in large-scale, distributed deployments:

Limited Storage Capacity: Prometheus stores data locally, limiting retention periods
High Availability Challenges: Single Prometheus instance represents a potential single point of failure
Data Silos: Multiple Prometheus instances create isolated data that cannot be easily queried together
Cross-Cluster Visibility: No built-in way to view metrics across multiple clusters

Thanos was designed to address these limitations while maintaining full compatibility with the Prometheus ecosystem, enabling organizations to scale their monitoring infrastructure without sacrificing functionality or requiring a complete architecture redesign.

Core Components and Architecture

Component	Description
Sidecar	Runs alongside each Prometheus instance Uploads metrics data to long-term object storage Exposes Prometheus metrics via a common API Enables transparent query integration with Thanos Query
Querier	Implements the Prometheus API for querying Aggregates data from multiple sources (Sidecars, Store Gateways) Deduplicates metrics from redundant sources Presents a unified view across all connected Prometheus instances
Store Gateway	Accesses metrics in object storage Serves historical data to the Querier Implements intelligent caching for performance optimization Indexes object storage data for efficient queries
Compactor	Applies compression and downsampling to stored metrics Optimizes storage usage and query performance Handles data retention policies Ensures efficient long-term storage of metrics
Ruler	Evaluates recording and alerting rules Distributes rule evaluation across the cluster Stores rule results in object storage Ensures consistent alerting in distributed environments
Receiver	Implements the remote write API Receives data from Prometheus instances Stores received data in object storage Enables a push-based approach to data ingestion

Key Features and Benefits

Thanos Advantages

Global Query View
- Seamless querying across multiple Prometheus instances
- Unified view regardless of geographic distribution
- Deduplication of replicated metrics
- Cross-cluster and cross-region visibility
Unlimited Storage Retention
- Integration with object storage (S3, GCS, Azure Blob, etc.)
- Configurable retention periods beyond Prometheus capabilities
- Cost-efficient long-term metrics storage
- Downsampling for optimal storage utilization
High Availability
- Redundant Prometheus deployments with deduplication
- No single point of failure in the architecture
- Resilience against instance and zone failures
- Continuous operation during upgrades and maintenance
Prometheus Compatibility
- Maintains compatibility with the Prometheus API
- Works with existing Prometheus deployments
- Compatible with Prometheus alerting and recording rules
- Supports PromQL without modifications

Data Flow in Prometheus and Thanos

Understanding the complete data flow from collection to visualization is essential for implementing and troubleshooting a Prometheus and Thanos deployment. This section details how metrics move through the system and how the various components interact to provide a scalable monitoring solution.

Metrics Collection and Storage Workflow

sequenceDiagram participant Target as Monitored Target participant Prom as Prometheus Server participant Sidecar as Thanos Sidecar participant Object as Object Storage participant Store as Store Gateway participant Query as Thanos Query participant User as User/Grafana Target->>Prom: Expose metrics endpoint Prom->>Target: Scrape metrics Prom->>Prom: Process and store locally Prom->>Sidecar: Expose local storage Note over Prom,Sidecar: Every 2 hours (configurable) Sidecar->>Object: Upload blocks Note over Query,User: Real-time queries User->>Query: Execute PromQL query Query->>Prom: Query recent data Query->>Store: Query historical data Store->>Object: Fetch relevant blocks Store->>Query: Return historical data Prom->>Query: Return recent data Query->>Query: Deduplicate and process Query->>User: Display results

Component Interaction Details

Detailed Data Flow Process

Metrics Collection
- Prometheus scrapes metrics from monitored targets at configured intervals
- Metrics are processed and stored in the local time series database (TSDB)
- Local storage is organized in 2-hour blocks (configurable)
Data Uploading
- Thanos Sidecar monitors the Prometheus TSDB directory
- Completed blocks (typically 2 hours of data) are uploaded to object storage
- Blocks include both raw samples and metadata
- Original data remains in Prometheus local storage until its retention period expires
Long-term Storage
- Object storage acts as the central repository for historical metrics
- Data is organized in a well-defined structure for efficient access
- Thanos Compactor periodically processes the stored blocks
- Compaction and downsampling reduce storage requirements while maintaining data utility
Query Processing
- Thanos Query receives PromQL queries from users or visualization tools
- Determines which data sources (Prometheus instances, Store Gateways) to query
- Distributes the query to relevant sources in parallel
- Recent data comes directly from Prometheus instances via Sidecars
- Historical data is retrieved from object storage via Store Gateways
- Results are deduplicated, merged, and returned to the user

Data Flow Architecture in Prometheus and Thanos

graph TD; A[Prometheus] -->|Scrape metrics| B[Local Storage]; A -->|Real-time query and data push| C[Thanos Sidecar]; C -->|Query data| D[Thanos Query]; C -->|Upload blocks| G[Object Storage]; D -->|Query historical data| E[Thanos Store]; E -->|Serve historical data| G; F[Thanos Compactor] -->|Compact and downsample| G; D -->|Return unified results| H[Users/Dashboards]; style A fill:#a5d6a7,stroke:#333,stroke-width:1px style B fill:#64b5f6,stroke:#333,stroke-width:1px style C fill:#ffcc80,stroke:#333,stroke-width:1px style D fill:#ce93d8,stroke:#333,stroke-width:1px style E fill:#ef9a9a,stroke:#333,stroke-width:1px style F fill:#9fa8da,stroke:#333,stroke-width:1px style G fill:#80deea,stroke:#333,stroke-width:1px style H fill:#f5f5f5,stroke:#333,stroke-width:1px

Implementation Considerations

⚠️ Key Deployment Considerations

When implementing a Prometheus and Thanos monitoring solution, consider these critical factors:

Prometheus Deployment Strategies

Consider these architectural patterns for Prometheus deployments:

Per-Service Monitoring: Dedicated Prometheus instances for critical services
Per-Team Monitoring: Team-managed instances with relevant scrape targets
Hierarchical Federation: Local instances with aggregation at higher levels
Functional Sharding: Instances specialized by metric type or purpose

Key resource considerations:

Memory Requirements: Approximately 1-2 bytes per sample in memory
Storage Requirements: Around 1-2 bytes per sample on disk
CPU Scaling: Increases with query complexity and scrape target count
Retention Period: Local storage typically set to 15 days or less when using Thanos

Thanos Component Sizing

Resource Guidelines for Thanos Components

Sidecar
- CPU: 1-2 cores depending on upload frequency
- Memory: 512MB-1GB base plus ~100MB per concurrent request
- Storage: Temporary space for block processing
Query
- CPU: 2-4 cores, scales with query volume
- Memory: 1-4GB base plus additional memory for concurrent queries
- Consider horizontal scaling for high-traffic deployments
Store Gateway
- CPU: 2 cores minimum, scales with query volume
- Memory: 2-8GB depending on index cache size and block count
- Scale horizontally for large object storage datasets
Compactor
- CPU: 1-2 cores, periodic usage pattern
- Memory: 8-16GB for large block processing
- Temporary storage: 3x the size of the largest block

Object Storage Considerations

Object Storage Planning

Provider Selection
- S3: AWS S3, MinIO, Ceph Object Gateway
- GCS: Google Cloud Storage
- Azure: Azure Blob Storage
- Consider cost, performance, and geographic distribution
Data Growth Planning
- Raw metrics growth: ~1-2 bytes per sample
- Example: 1M active series at 15s scrape interval ≈ 2.5TB per year
- Downsampling can reduce storage by 10-100x for older data
Access Controls
- Use IAM roles with minimal necessary permissions
- Consider bucket policies to restrict access
- Enable encryption for sensitive metrics

Future Implementation Guide

In the upcoming post, we'll provide a detailed implementation guide for Prometheus and Thanos in Kubernetes environments using Helm charts, covering configuration, deployment, and maintenance best practices.

Next Steps

Topics for the Implementation Guide

Setting up Prometheus and Thanos using Helm Charts
- Detailed deployment configurations
- Resource requirements and scaling guidelines
- Security best practices
Configuring Multi-Cluster Metric Collection
- Cross-cluster communication setup
- Network considerations and security
- Unified service discovery
Implementing Unified Monitoring Across Clusters
- Global view configuration
- Cross-cluster alerting
- Grafana dashboard setup for unified visualization
Performance Tuning and Maintenance
- Query optimization techniques
- Resource management strategies
- Backup and disaster recovery planning

Key Points

💡 Prometheus and Thanos Summary

Prometheus Core Features
- Pull-based metric collection with service discovery
- Powerful query language (PromQL) for flexible data analysis
- Integrated alerting and recording rules
- Designed for dynamic cloud-native environments
Thanos Extensions
- Global query view across multiple Prometheus instances
- Unlimited metric retention with object storage integration
- High availability through redundant deployment
- Downsampling for efficient long-term storage
Implementation Considerations
- Proper resource allocation for each component
- Strategic deployment across environments
- Object storage planning for long-term metrics
- Security and access control implementation

somaz v3.1.2

What is Prometheus and Thanos?

Overview

What is Prometheus?

Introduction to Prometheus

Core Components and Architecture

Key Features and Capabilities

What is Thanos?

Introduction to Thanos

Core Components and Architecture

Key Features and Benefits

Data Flow in Prometheus and Thanos

Metrics Collection and Storage Workflow

Component Interaction Details

Data Flow Architecture in Prometheus and Thanos

Implementation Considerations

Prometheus Deployment Strategies

Thanos Component Sizing

Object Storage Considerations

Future Implementation Guide

Next Steps

Key Points

References

Installing Prometheus and Thanos with Helm

Somaz

Comments

What is Prometheus and Thanos?

Overview

What is Prometheus?

Introduction to Prometheus

Core Components and Architecture

Key Features and Capabilities

What is Thanos?

Introduction to Thanos

Core Components and Architecture

Key Features and Benefits

Data Flow in Prometheus and Thanos

Metrics Collection and Storage Workflow

Component Interaction Details

Data Flow Architecture in Prometheus and Thanos

Implementation Considerations

Prometheus Deployment Strategies

Thanos Component Sizing

Object Storage Considerations

Future Implementation Guide

Next Steps

Key Points

References

Installing Prometheus and Thanos with Helm

Share

Somaz

Comments