# Moose / Olap / Model Table Documentation – Python ## Included Files 1. moose/olap/model-table/model-table.mdx ## Modeling Tables Source: moose/olap/model-table/model-table.mdx Model your database schema in code using native TypeScript/Python typing # Modeling Tables ## Overview Tables in Moose let you define your database schema entirely in code using native TypeScript/Python typing. You can integrate tables into your pipelines as destinations for new data or as sources for analytics queries in your downstream transformations, APIs, and more. ```py filename="FirstTable.py" copy from pydantic import BaseModel from moose_lib import Key, OlapTable from pydantic import BaseModel class MyFirstTable(BaseModel): id: Key[str] name: str age: int # Create a table named "first_table" my_table = OlapTable[MyFirstTable]("first_table") # No export needed - Python modules are automatically discovered ``` ## Basic Usage ### Standalone Tables Create a table directly for custom data flows or when you need fine-grained control: ```py filename="StandaloneTable.py" from moose_lib import Key, OlapTable from pydantic import BaseModel class ExampleSchema(BaseModel): id: Key[str] date_field: Date numeric_field: float boolean_field: bool # Create a standalone table named "example_table" from moose_lib import OlapTable, OlapConfig from moose_lib.blocks import ReplacingMergeTreeEngine example_table = OlapTable[ExampleSchema]("example_table", OlapConfig( order_by_fields=["id", "date_field"], engine=ReplacingMergeTreeEngine() )) ``` ### Creating Tables in Ingestion Pipelines For end-to-end data flows, create tables as part of an ingestion pipeline: ```py filename="PipelineTable.py" from moose_lib import IngestPipeline, Key, OlapTable from pydantic import BaseModel class UserEvent(BaseModel): id: Key[str] user_id: str timestamp: Date event_type: str from moose_lib import IngestPipeline, IngestPipelineConfig, OlapConfig from moose_lib.blocks import ReplacingMergeTreeEngine events_pipeline = IngestPipeline[UserEvent]("user_events", IngestPipelineConfig( ingest_api=True, # Creates a REST API endpoint at POST localhost:4000/ingest/user_events stream=True, # Creates a Kafka/Redpanda topic table=OlapConfig( # Creates and configures the table named "user_events" order_by_fields=["id", "timestamp"], engine=ReplacingMergeTreeEngine() ) )) # Access the table component when needed: events_table = events_pipeline.get_table() ``` ## Data Modeling ### Special ClickHouse Types (LowCardinality, Nullable, etc) ```py filename="ClickHouseTypes.py" copy from moose_lib import Key, clickhouse_decimal, ClickHouseNamedTuple from typing import Annotated from pydantic import BaseModel from datetime import datetime class Customer(BaseModel): name: str address: str class Order(BaseModel): order_id: Key[str] amount: clickhouse_decimal(10, 2) status: Literal["Paid", "Shipped", "Delivered"] # translated to LowCardinality(String) in ClickHouse created_at: datetime customer: Annotated[Customer, "ClickHouseNamedTuple"] ``` ### Default values Use defaults instead of nullable columns to keep queries fast and schemas simple. You can specify defaults at the column level so Moose generates ClickHouse defaults in your table DDL. ```py filename="Defaults.py" copy from typing import Annotated from pydantic import BaseModel from moose_lib import OlapTable, Key, clickhouse_default, clickhouse_decimal from datetime import datetime class Event(BaseModel): id: Key[str] # Static defaults status: Annotated[str, clickhouse_default("'pending'")] # DEFAULT 'pending' retries: Annotated[int, clickhouse_default("0")] # DEFAULT 0 # Server-side timestamps created_at: Annotated[datetime, clickhouse_default("now()")] # Decimal with default amount: Annotated[float, clickhouse_decimal(10, 2)] = 0 events = OlapTable[Event]("events", { "orderByFields": ["id", "created_at"], }) ``` The value passed into the `clickhouse_default` function can either be a string literal or a stringified ClickHouse SQL expression. If a field is optional in your app model but you provide a ClickHouse default, Moose infers a non-nullable ClickHouse column with a DEFAULT clause. - Optional without default → ClickHouse Nullable type. - Optional with default (using `clickhouse_default("18")` in annotations) → non-nullable column with default `18`. This lets you keep optional fields at the application layer while avoiding Nullable columns in ClickHouse when a server-side default exists. ### Database Selection By default, tables are created in the database specified in your `moose.config.toml` ClickHouse configuration. You can override this on a per-table basis using the `database` field: ```py filename="DatabaseOverride.py" copy from moose_lib import Key, OlapTable, OlapConfig from pydantic import BaseModel class UserData(BaseModel): id: Key[str] name: str email: str # Table in default database (from moose.config.toml) default_table = OlapTable[UserData]("users") # Table in specific database (e.g., "analytics") analytics_table = OlapTable[UserData]("users", OlapConfig( database="analytics", order_by_fields=["id"] )) ``` To use custom databases, configure them in your `moose.config.toml`: ```toml [clickhouse_config] db_name = "local" additional_databases = ["analytics", "staging"] ``` The databases in `additional_databases` will be created automatically when you start your Moose application. ### Primary Keys and Sorting You must configure table indexing using one of these approaches: 1. Define at least one `Key` in your table schema 2. Specify `orderByFields` in the table config 3. Use both (all `Key` fields must come first in the `orderByFields` array) ```py filename="PrimaryKeyConfig.py" copy from moose_lib import Key, OlapTable from pydantic import BaseModel class Record1(BaseModel): id: Key[str] # Primary key field field1: str field2: int table1 = OlapTable[Record1]("table1") # id is the primary key ``` ### Order By Fields Only Leverage the `OlapConfig` class to configure your table: ```py filename="OrderByFieldsOnly.py" copy from moose_lib import Key, OlapTable, OlapConfig from pydantic import BaseModel from datetime import datetime class SchemaWithoutPrimaryKey(BaseModel): field1: str field2: int field3: datetime table2 = OlapTable[SchemaWithoutPrimaryKey]("table2", OlapConfig( order_by_fields=["field1", "field2"] # Specify ordering without primary key )) ``` ### Order By Expression Use a ClickHouse SQL expression to control ordering directly. This is useful for advanced patterns (functions, transformations) or when you want to disable sorting entirely. ```py filename="OrderByExpression.py" copy from moose_lib import OlapTable, OlapConfig from pydantic import BaseModel from datetime import datetime class Events(BaseModel): user_id: str created_at: datetime event_type: str # Equivalent to order_by_fields=["user_id", "created_at", "event_type"] events = OlapTable[Events]("events", OlapConfig( order_by_expression="(user_id, created_at, event_type)", )) # Advanced: functions inside expression events_by_month = OlapTable[Events]("events_by_month", OlapConfig( order_by_expression="(user_id, toYYYYMM(created_at))", )) # No sorting unsorted = OlapTable[Events]("events_unsorted", OlapConfig( order_by_expression="tuple()", )) ``` ### Using Both Primary Key and Order By Fields ```py filename="ComboKeyAndOrderByFields.py" copy from moose_lib import Key, OlapTable, OlapConfig from pydantic import BaseModel class SchemaWithKey(BaseModel): id: Key[str] field1: str field2: int table3 = OlapTable[SchemaWithKey]("table3", OlapConfig( order_by_fields=["id", "field1"] # Primary key must be first )) ``` ### Using Multiple Primary Keys ```py filename="MultiKeyTable.py" copy from moose_lib import Key, OlapTable, OlapConfig from pydantic import BaseModel class MultiKeyRecord(BaseModel): key1: Key[str] key2: Key[int] field1: str multi_key_table = OlapTable[MultiKeyRecord]("multi_key_table", OlapConfig( order_by_fields=["key1", "key2", "field1"] # Multiple keys must come first )) ``` ### Table engines By default, Moose will create tables with the `MergeTree` engine. You can use different engines by setting the `engine` in the table configuration. ```py filename="TableEngine.py" copy from moose_lib import OlapTable, OlapConfig from moose_lib.blocks import MergeTreeEngine, ReplacingMergeTreeEngine # Default MergeTree engine table = OlapTable[Record]("table", OlapConfig( order_by_fields=["id"] )) # Explicitly specify engine dedup_table = OlapTable[Record]("table", OlapConfig( order_by_fields=["id"], engine=ReplacingMergeTreeEngine() )) ``` #### Deduplication (`ReplacingMergeTree`) Use the `ReplacingMergeTree` engine to keep only the latest record for your designated sort key: ```py filename="DeduplicatedTable.py" copy from moose_lib import Key, OlapTable, OlapConfig from moose_lib.blocks import ReplacingMergeTreeEngine class Record(BaseModel): id: Key[str] updated_at: str # Version column deleted: int = 0 # Soft delete marker (UInt8) # Basic deduplication table = OlapTable[Record]("table", OlapConfig( order_by_fields=["id"], engine=ReplacingMergeTreeEngine() )) # With version column (keeps record with highest version) versioned_table = OlapTable[Record]("table", OlapConfig( order_by_fields=["id"], engine=ReplacingMergeTreeEngine(ver="updated_at") )) # With soft deletes (requires ver parameter) soft_delete_table = OlapTable[Record]("table", OlapConfig( order_by_fields=["id"], engine=ReplacingMergeTreeEngine( ver="updated_at", is_deleted="deleted" # UInt8 column: 1 marks row for deletion ) )) ``` ClickHouse's ReplacingMergeTree engine runs deduplication in the background AFTER data is inserted into the table. This means that duplicate records may not be removed immediately. **Version Column (`ver`)**: When specified, ClickHouse keeps the row with the maximum version value for each unique sort key. **Soft Deletes (`is_deleted`)**: When specified along with `ver`, rows where this column equals 1 are deleted during merges. This column must be UInt8 type. For more details, see the [ClickHouse documentation](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/replacingmergetree). #### Streaming from S3 (`S3Queue`) Use the `S3Queue` engine to automatically ingest data from S3 buckets as files are added: S3Queue tables only process **new files** added to S3 after table creation. When used as a source for materialized views, **no backfill occurs** - the MV will only start populating as new files arrive. See the [Materialized Views documentation](./model-materialized-view#backfill-destination-tables) for more details. ```py filename="S3StreamingTable.py" copy from moose_lib import OlapTable, OlapConfig from moose_lib.blocks import S3QueueEngine class S3Event(BaseModel): id: str timestamp: datetime data: dict # Modern API using engine configuration s3_events = OlapTable[S3Event]("s3_events", OlapConfig( engine=S3QueueEngine( s3_path="s3://my-bucket/data/*.json", format="JSONEachRow", # ⚠️ WARNING: See security callout below about credentials aws_access_key_id="AKIA...", aws_secret_access_key="secret..." ), settings={ "mode": "unordered", "keeper_path": "/clickhouse/s3queue/events" } )) ``` **Security Risk**: Hardcoding credentials in your code embeds them in Docker images and deployment artifacts, creating serious security vulnerabilities. **Solution**: Use `mooseRuntimeEnv` for runtime credential resolution: ```py filename="SecureS3Streaming.py" copy from moose_lib import OlapTable, OlapConfig, moose_runtime_env from moose_lib.blocks import S3QueueEngine # ✅ RECOMMENDED: Runtime environment variable resolution secure_s3_events = OlapTable[S3Event]("s3_events", OlapConfig( engine=S3QueueEngine( s3_path="s3://my-bucket/data/*.json", format="JSONEachRow", aws_access_key_id=moose_runtime_env.get("AWS_ACCESS_KEY_ID"), aws_secret_access_key=moose_runtime_env.get("AWS_SECRET_ACCESS_KEY") ), settings={ "mode": "unordered", "keeper_path": "/clickhouse/s3queue/events" } )) ``` **Then set environment variables:** ```bash filename="Terminal" copy export AWS_ACCESS_KEY_ID="AKIA..." export AWS_SECRET_ACCESS_KEY="your-secret-key" moose prod up ``` **Benefits:** - Credentials never embedded in Docker images - Supports credential rotation (changing passwords triggers table recreation) - Different credentials per environment (dev/staging/prod) - Clear error messages if environment variables are missing S3Queue requires ClickHouse 24.7+ and proper ZooKeeper/ClickHouse Keeper configuration for coordination between replicas. Files are processed exactly once across all replicas. #### Direct S3 Access (`S3`) Use the `S3` engine for direct read/write access to S3 storage without streaming semantics: ```py filename="S3Table.py" copy from moose_lib import OlapTable, OlapConfig, moose_runtime_env from moose_lib.blocks import S3Engine # S3 table with credentials (recommended with moose_runtime_env) s3_data = OlapTable[DataRecord]("s3_data", OlapConfig( engine=S3Engine( path="s3://my-bucket/data/file.json", format="JSONEachRow", aws_access_key_id=moose_runtime_env.get("AWS_ACCESS_KEY_ID"), aws_secret_access_key=moose_runtime_env.get("AWS_SECRET_ACCESS_KEY"), compression="gzip" ) )) # Public S3 bucket (no authentication needed - just omit credentials) public_s3 = OlapTable[DataRecord]("public_s3", OlapConfig( engine=S3Engine( path="s3://public-bucket/data/*.parquet", format="Parquet" ) )) ``` - **S3**: Direct read/write access to S3 files. Use for batch processing or querying static data. - **S3Queue**: Streaming engine that automatically processes new files as they arrive. Use for continuous data ingestion. Both engines support the same credential management and format options. #### In-Memory Buffer (`Buffer`) The `Buffer` engine provides an in-memory buffer that flushes data to a destination table based on time, row count, or size thresholds: ```py filename="BufferTable.py" copy from moose_lib import OlapTable, OlapConfig from moose_lib.blocks import MergeTreeEngine, BufferEngine # First create the destination table destination_table = OlapTable[Record]("destination", OlapConfig( engine=MergeTreeEngine(), order_by_fields=["id", "timestamp"] )) # Then create buffer that writes to it buffer_table = OlapTable[Record]("buffer", OlapConfig( engine=BufferEngine( target_database="local", target_table="destination", num_layers=16, min_time=10, # Min 10 seconds before flush max_time=100, # Max 100 seconds before flush min_rows=10000, # Min 10k rows before flush max_rows=1000000, # Max 1M rows before flush min_bytes=10485760, # Min 10MB before flush max_bytes=104857600 # Max 100MB before flush ) )) ``` - Data in buffer is **lost if server crashes** before flush - Not suitable for critical data that must be durable - Best for high-throughput scenarios where minor data loss is acceptable - Buffer and destination table must have identical schemas - Cannot use `orderByFields`, `partitionBy`, or `sampleByExpression` on buffer tables For more details, see the [ClickHouse Buffer documentation](https://clickhouse.com/docs/en/engines/table-engines/special/buffer). #### Distributed Tables (`Distributed`) The `Distributed` engine creates a distributed table across a ClickHouse cluster for horizontal scaling: ```py filename="DistributedTable.py" copy from moose_lib import OlapTable, OlapConfig from moose_lib.blocks import DistributedEngine # Distributed table across cluster distributed_table = OlapTable[Record]("distributed_data", OlapConfig( engine=DistributedEngine( cluster="my_cluster", target_database="default", target_table="local_table", sharding_key="cityHash64(id)" # Optional: how to distribute data ) )) ``` - Requires a configured ClickHouse cluster with remote_servers configuration - The local table must exist on all cluster nodes - Distributed tables are virtual - data is stored in local tables - Cannot use `orderByFields`, `partitionBy`, or `sampleByExpression` on distributed tables - The `cluster` name must match a cluster defined in your ClickHouse configuration For more details, see the [ClickHouse Distributed documentation](https://clickhouse.com/docs/en/engines/table-engines/special/distributed). #### Replicated Engines Replicated engines provide high availability and data replication across multiple ClickHouse nodes. Moose supports all standard replicated MergeTree variants: - `ReplicatedMergeTree` - Replicated version of MergeTree - `ReplicatedReplacingMergeTree` - Replicated with deduplication - `ReplicatedAggregatingMergeTree` - Replicated with aggregation - `ReplicatedSummingMergeTree` - Replicated with summation ```py filename="ReplicatedEngines.py" copy from moose_lib import OlapTable, OlapConfig from moose_lib.blocks import ( ReplicatedMergeTreeEngine, ReplicatedReplacingMergeTreeEngine, ReplicatedAggregatingMergeTreeEngine, ReplicatedSummingMergeTreeEngine ) class Record(BaseModel): id: Key[str] updated_at: datetime deleted: int = 0 # Basic replicated table with explicit paths replicated_table = OlapTable[Record]("records", OlapConfig( order_by_fields=["id"], engine=ReplicatedMergeTreeEngine( keeper_path="/clickhouse/tables/{database}/{shard}/records", replica_name="{replica}" ) )) # Replicated with deduplication replicated_dedup = OlapTable[Record]("dedup_records", OlapConfig( order_by_fields=["id"], engine=ReplicatedReplacingMergeTreeEngine( keeper_path="/clickhouse/tables/{database}/{shard}/dedup_records", replica_name="{replica}", ver="updated_at", is_deleted="deleted" ) )) # For ClickHouse Cloud or Boreal (no parameters needed) cloud_replicated = OlapTable[Record]("cloud_records", OlapConfig( order_by_fields=["id"], engine=ReplicatedMergeTreeEngine() )) ``` The `keeper_path` and `replica_name` parameters are **optional** for replicated engines: - **Omit both parameters** (recommended): Moose uses smart defaults that work in both ClickHouse Cloud and self-managed environments. The default path pattern `/clickhouse/tables/{uuid}/{shard}` with replica `{replica}` works automatically with Atomic databases (default in modern ClickHouse). - **Provide custom paths**: You can still specify both parameters explicitly if you need custom replication paths for your self-managed cluster. **Note**: Both parameters must be provided together, or both omitted. The `{uuid}`, `{shard}`, and `{replica}` macros are automatically substituted by ClickHouse at runtime. For more details, see the [ClickHouse documentation on data replication](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/replication). ### Irregular column names and Python Aliases If a ClickHouse column name isn't a valid Python identifier or starts with an underscore, you can use a safe Python field name and set a Pydantic alias to the real column name. MooseOLAP then uses the alias for ClickHouse DDL and data mapping, so your model remains valid while preserving the true column name. ```python from pydantic import BaseModel, Field class CHUser(BaseModel): # ClickHouse: "_id" → safe Python attribute with alias UNDERSCORE_PREFIXED_id: str = Field(alias="_id") # ClickHouse: "user name" → replace spaces, keep alias user_name: str = Field(alias="user name") ``` ## Externally Managed Tables If you have a table that is managed by an external system (e.g Change Data Capture like ClickPipes), you can still use Moose to query it. You can set the config in the table config to set the lifecycle to `EXTERNALLY_MANAGED`. ```py filename="ExternallyManagedTable.py" copy from moose_lib import OlapTable, OlapConfig, LifeCycle # Table managed by external system external_table = OlapTable[UserData]("external_users", OlapConfig( order_by_fields=["id", "timestamp"], life_cycle=LifeCycle.EXTERNALLY_MANAGED # Moose won't create or modify this table in prod mode )) ``` Learn more about the different lifecycle options and how to use them in the [LifeCycle Management](/stack/olap/lifecycle) documentation. ## Invalid Configurations ```py filename="InvalidConfig.py" copy from moose_lib import Key, OlapTable, OlapConfig from typing import Optional class BadRecord1(BaseModel): field1: str field2: int bad_table1 = OlapTable[BadRecord1]("bad_table1") ## No primary key or orderByFields class BadRecord2(BaseModel): id: Key[str] field1: str bad_table2 = OlapTable[BadRecord2]("bad_table2", OlapConfig( order_by_fields=["field1", "id"] # Wrong order - primary key must be first )) class BadRecord3(BaseModel): id: Key[str] field1: str field2: Optional[int] bad_table3 = OlapTable[BadRecord3]("bad_table3", OlapConfig( order_by_fields=["id", "field2"] # Can't have nullable field in orderByFields )) ``` ## Development Workflow ### Local Development with Hot Reloading One of the powerful features of Moose is its integration with the local development server: 1. Start your local development server with `moose dev` 2. When you define or modify an `OlapTable` in your code and save the file: - The changes are automatically detected - The TypeScript compiler plugin processes your schema definitions - The infrastructure is updated in real-time to match your code changes - Your tables are immediately available for testing For example, if you add a new field to your schema: ```py filename="HotReloading.py" copy # Before class BasicSchema(BaseModel): id: Key[str] name: str # After adding a field class BasicSchema(BaseModel): id: Key[str] name: str created_at: datetime ``` The Moose framework will: 1. Detect the change when you save the file 2. Update the table schema in the local ClickHouse instance 3. Make the new field immediately available for use ### Verifying Your Tables You can verify your tables were created correctly using: ```bash filename="Terminal" copy # List all tables in your local environment moose ls ``` #### Connecting to your local ClickHouse instance You can connect to your local ClickHouse instance with your favorite database client. Your credentials are located in your `moose.config.toml` file: ```toml filename="moose.config.toml" copy [clickhouse_config] db_name = "local" user = "panda" password = "pandapass" use_ssl = false host = "localhost" host_port = 18123 native_port = 9000 ```