# Moose / Olap / Schema Optimization Documentation – Python ## Included Files 1. moose/olap/schema-optimization/schema-optimization.mdx ## schema-optimization Source: moose/olap/schema-optimization/schema-optimization.mdx # Schema Optimization Choosing the right data types and column ordering for your tables is crucial for ClickHouse performance and storage efficiency. Poor schema design can lead to 10-100x slower queries and 2-5x larger storage requirements. ## Data Types Keep the following best practices in mind when defining your column types: ### Avoid Nullable Columns Nullable columns in ClickHouse have significant performance overhead. Instead of using `| None` or `Optional[type]`, add the `Annotated[type, clickhouse_default("...")]` to your column type. ```py filename="AvoidNullable.py" from moose_lib import OlapTable from pydantic import BaseModel, Field # ❌ Bad: Using nullable columns class UserEvent(BaseModel): id: str user_id: str event_type: str description: str | None = None # Nullable metadata: dict | None = None # Nullable created_at: Date # ✅ Good: Use default values instead class UserEvent(BaseModel): id: str user_id: str event_type: str description: str = "" # Default empty string metadata: dict = Field(default_factory=dict) # Default empty dict created_at: Date user_events_table = OlapTable[UserEvent]("user_events", { "orderByFields": ["id", "created_at"] }) ``` ### Use `LowCardinality` where possible `LowCardinality` is ClickHouse's most efficient string type for columns with limited unique values. ```py filename="LowCardinality.py" from moose_lib import OlapTable, LowCardinality from pydantic import BaseModel from typing import Literal class UserEvent(BaseModel): id: str user_id: str event_type: Annotated[str, "LowCardinality"] # ✅ Good for limited values status: Literal["active", "inactive", "pending"]# ✅ Literals become LowCardinality automatically country: Annotated[str, "LowCardinality"] # ✅ Good for country codes user_agent: str # ❌ Keep as String for high cardinality created_at: Date user_events_table = OlapTable[UserEvent]("user_events", { "orderByFields": ["id", "created_at"] }) ``` ### Pick the right Integer types Choose the smallest integer type that fits your data range to save storage and improve performance. ```py filename="IntegerTypes.py" from moose_lib import OlapTable, UInt8, UInt16, UInt32, UInt64, Int8, Int16, Int32, Int64 from pydantic import BaseModel class UserEvent(BaseModel): id: str user_id: str age: Annotated[int, "int8"] # ✅ 0-255 (1 byte) score: Annotated[int, "uint16"] # ✅ -32,768 to 32,767 (2 bytes) view_count: Annotated[int, "int32"] # ✅ 0 to ~4 billion (4 bytes) timestamp: Annotated[int, "int64"] # ✅ Unix timestamp (8 bytes) event_type: str created_at: Date # Integer type ranges: # UInt8: 0 to 255 # UInt16: 0 to 65,535 # UInt32: 0 to 4,294,967,295 # UInt64: 0 to 18,446,744,073,709,551,615 # Int8: -128 to 127 # Int16: -32,768 to 32,767 # Int32: -2,147,483,648 to 2,147,483,647 # Int64: -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807 ``` ### Use the right precision for `DateTime` Choose appropriate DateTime precision based on your use case to balance storage and precision. ```py filename="DateTimePrecision.py" from moose_lib import OlapTable, DateTime, DateTime64 from pydantic import BaseModel from datetime import datetime class UserEvent(BaseModel): id: str user_id: str event_type: str created_at: datetime # ✅ Second precision (default) updated_at: clickhouse_datetime(3) # ✅ Millisecond precision processed_at: clickhouse_datetime(6) # ✅ Microsecond precision logged_at: clickhouse_datetime(9) # ✅ Nanosecond precision user_events_table = OlapTable[UserEvent]("user_events", { "orderByFields": ["id", "created_at"] }) ``` ### Use Decimal over Float Use `Decimal` for financial and precise calculations to avoid floating-point precision issues. ```py filename="DecimalOverFloat.py" from moose_lib import OlapTable, Decimal from pydantic import BaseModel class Order(BaseModel): id: str user_id: str amount: clickhouse_decimal(10, 2) # ✅ 10 total digits, 2 decimal places tax: clickhouse_decimal(8, 2) # ✅ 8 total digits, 2 decimal places discount: clickhouse_decimal(5, 2) # ✅ 5 total digits, 2 decimal places total: clickhouse_decimal(12, 2) # ✅ 12 total digits, 2 decimal places created_at: Date # ❌ Bad: Using float for financial data class BadOrder(BaseModel): id: str amount: float # Float - can cause precision issues tax: float # Float - can cause precision issues orders_table = OlapTable[Order]("orders", { "orderByFields": ["id", "created_at"] }) ``` ### Use `NamedTuple` over `Nested` `NamedTuple` is more efficient than `Nested` for structured data in ClickHouse. ```py filename="NamedTupleOverNested.py" from moose_lib import OlapTable, NamedTuple from pydantic import BaseModel class Location(BaseModel): latitude: float longitude: float city: str country: str class Metadata(BaseModel): version: str source: str priority: int class UserEvent(BaseModel): id: str user_id: str event_type: str location: Annotated[Location, "ClickHouseNamedTuple"] # lat, lon, city, country metadata: Annotated[Metadata, "ClickHouseNamedTuple"] # version, source, priority created_at: Date # ❌ Bad: Using nested objects (less efficient) class BadUserEvent(BaseModel): id: str location: Location # Nested - less efficient metadata: Metadata # Nested - less efficient user_events_table = OlapTable[UserEvent]("user_events", { "orderByFields": ["id", "created_at"] }) ``` ## Ordering ### Choose columns that you will use in WHERE and GROUP BY clauses Optimize your `orderByFields` (or `orderByExpression`) for your most common query patterns. ```py filename="OrderByOptimization.py" from moose_lib import OlapTable from pydantic import BaseModel class UserEvent(BaseModel): id: str user_id: str event_type: str status: str created_at: Date country: str # ✅ Good: Optimized for common query patterns user_events_table = OlapTable[UserEvent]("user_events", { "orderByFields": ["user_id", "created_at", "event_type"] # Most common filters first }) # Common queries this optimizes for: # - WHERE user_id = ? AND created_at > ? # - WHERE user_id = ? AND event_type = ? # - GROUP BY user_id, event_type ``` ### `ORDER BY` should prioritize LowCardinality columns first Place `LowCardinality` columns first in your `order_by_fields` (or reflect this priority in your `order_by_expression`) for better compression and query performance. ```py filename="LowCardinalityOrdering.py" from moose_lib import OlapTable, LowCardinality from pydantic import BaseModel class UserEvent(BaseModel): id: str user_id: str event_type: LowCardinality[str] # ✅ Low cardinality status: LowCardinality[str] # ✅ Low cardinality country: LowCardinality[str] # ✅ Low cardinality created_at: Date # High cardinality session_id: str # High cardinality # ✅ Good: LowCardinality columns first user_events_table = OlapTable[UserEvent]("user_events", { "orderByFields": ["event_type", "status", "country", "created_at", "session_id"] }) # ❌ Bad: High cardinality columns first bad_user_events_table = OlapTable[UserEvent]("user_events", { "orderByFields": ["created_at", "session_id", "event_type", "status"] # Less efficient }) ```