AI AGENT SKILLS

Alibabacloud Emr Starrocks Assistant

一个面向 Data & APIs 场景的 Agent 技能。原始说明：Alibaba Cloud EMR Serverless StarRocks development & operations assistant. Covers five scenarios: cluster connection, schema design, data ingestion, SQL deve...

下载技能包打开来源页 Data & APIs

SKILL.md

name: alibabacloud-emr-starrocks-assistant
description: >
Alibaba Cloud EMR Serverless StarRocks development & operations assistant.
Covers five scenarios: cluster connection, schema design, data ingestion,
SQL development & tuning, and cluster health diagnostics.
Use this Skill when users ask about StarRocks table design, writing SQL,
choosing an ingestion method, query execution plans, materialized views,
cluster health checks, FE/BE/CN node status, tablet health, or compaction.
Typical scenarios: table design, Stream Load / Routine Load / Broker Load
selection, SQL optimization, window functions, CTEs, JOIN tuning,
materialized view design, cluster health inspection, node-down diagnosis.
Not applicable for: StarRocks instance lifecycle management
(create / scale / restart / config change / version upgrade — these are
control-plane operations, please use the EMR Serverless console or the
corresponding OpenAPI), or other Alibaba Cloud products (EMR Cluster,
Spark, Milvus, ClickHouse, Doris, RDS, ECS).
license: MIT
allowed-tools: Bash Read
compatibility: >
Python 3.10+ with uv; reachable StarRocks FE endpoint (default port 9030).
On first use, run sr-login to register a cluster credential locally.
Privileges follow the user's own account — capabilities (which databases /
diagnostic commands are accessible) are introspected at login via
SHOW GRANTS FOR CURRENT_USER().
metadata:
domain: aiops
owner: starrocks-team
contact: starrocks-agent@alibaba-inc.com
requiredstarrocksprivileges:

"Whatever the user's account already has. The skill does not create or elevate accounts."
"For full feature coverage: SELECT on the databases of interest + OPERATE ON SYSTEM for cluster diagnostics. Missing privileges degrade gracefully (specific suggestions are skipped)."

Alibaba Cloud EMR Serverless StarRocks Development & Operations Assistant

Help users perform day-to-day table design, data ingestion, SQL writing & tuning, and health diagnostics on Alibaba Cloud EMR Serverless StarRocks. All cluster access goes through the bundled srsql CLI (pymysql-based, uses the user's own account); no MySQL client required. Non-READ SQL is classified by sqlglot and requires --yes confirmation before execution.

Scope statement: This Skill focuses on using StarRocks — development, diagnostics, and day-to-day data operations. Cluster-internal data and schema operations (DDL, DML, materialized view refresh, GRANT, etc.) are supported and execute under the user's own account, gated by sqlglot classification + --yes confirmation. Instance-lifecycle operations (create, scale, restart, configuration change, version upgrade) are control-plane operations and are not in this Skill's scope; please use the EMR Serverless console or the corresponding OpenAPI.

When to use / When not to use

When to use:

Schema design (table model, partitioning, bucketing, sort key, indexes, storage parameters)
Data ingestion selection (Stream/Broker/Routine Load, INSERT, Pipe, Flink/Kafka Connector, CDC)
SQL writing, rewriting, and tuning (JOIN strategy, window functions, CTE, aggregation optimization, statistics)
Materialized view design and operations
Cluster health diagnostics (FE/BE/CN nodes, tablet health, compaction, warehouse, recent failed loads)

When NOT to use:

Instance lifecycle control: create / scale / restart / config change / upgrade StarRocks instances — these are control-plane operations; use the EMR Serverless console or the corresponding OpenAPI instead
Operating non-StarRocks products: EMR Cluster, Spark, Milvus, ClickHouse, Doris, RDS, ECS, etc.

First-time setup: install & log in

This Skill ships with the sr-connect Python CLI. See references/connect.md for details.

Assistant bootstrap protocol (instructions for Claude)

When this Skill is invoked and you anticipate running any cluster query, ensure srsql is available before asking the user for anything:

Run which srsql. If it returns a path, skip to step 4.
If missing, install it yourself: uv tool install <skill-project-root> where <skill-project-root> is the directory containing this SKILL.md and pyproject.toml (the Skill's base directory shown at invocation time; commonly ~/.claude/skills/alibabacloud-emr-starrocks-assistant/, which may be a symlink). Do not ask the user to run this — the bundled CLI is part of the Skill's capability surface, not user infrastructure.
If uv itself is missing (which uv fails), surface that to the user — uv is a system tool and not auto-installed.
Check ~/.starrocks/{profile}.cnf (default profile name: default; respect SR_PROFILE env var if set). If it exists, skip to step 5. If missing:

First try sr-login --from-env. Safe to call unconditionally — it exits 2 with a clear "missing" message when the environment doesn't have the credentials it needs, and does nothing else. You do not need to inspect environment variables yourself.
If sr-login --from-env exits non-zero, the user hasn't logged in yet. Give them the sr-login --host ... --user ... command and ask them to run it themselves. Do not run interactive sr-login yourself — it would block on a password prompt you cannot answer.

After both srsql is on PATH and the profile file exists, run queries via srsql -e "..." yourself.

If srsql was just installed in this session and PATH hasn't been refreshed in the user's shell, fall back to the absolute path printed by uv tool install (typically ~/.local/bin/srsql).

Chat-style rule after bootstrap succeeds: Do not echo sr-whoami / srsql -e "..." invocation syntax to the user as a "you can now run …" hint. You are the one calling these CLIs on the user's behalf — the user drives the Skill, not the binaries. Skip the post-success "next step" narration entirely and just ask what they want to do, or proceed if their intent is already clear.

Login command (give this to the user when their profile is missing)

# EMR Serverless StarRocks — both internal and public endpoints use the MySQL
# wire protocol over plain TCP; no SSL/TLS. Use the same form for either.
sr-login --host <fe-endpoint> --port 9030 --user <account>

# Verify
sr-whoami
srsql -e "SELECT CURRENT_VERSION()"

Re-running sr-login with the same --profile silently overwrites the stored credential (same semantics as docker login). Use --profile for multi-cluster:

sr-login --profile prod --host fe-prod.xxx --user app_user
SR_PROFILE=prod srsql -e "..."

Security model

This Skill has two layers:

FE is the authoritative permission boundary. The user supplies their own StarRocks account; whatever they're allowed to do, they're allowed to do. The Skill does not create, elevate, or rotate any accounts.
srsql is a UX gate, not a security boundary. Every statement is parsed by sqlglot (dialect starrocks):

READ (SELECT / SHOW / DESC / EXPLAIN / WITH / …) executes directly.
Any non-READ (INSERT / UPDATE / DELETE / DDL / GRANT / SET / USE / …) is refused unless --yes is passed.
SQL sqlglot cannot parse falls back to a leading-keyword check; if still ambiguous → UNKNOWN, treated as non-READ, executable with --yes plus a soft warning.

When the user asks for a write operation:

Show them the SQL you intend to run.
Optionally preview classification via srsql --dry-run -e "...".
Get explicit confirmation in chat.
Then run with srsql --yes -e "...".

For DDL on production tables, or operations that change global cluster state (CREATE/DROP USER, ADMIN SET CONFIG, etc.), prefer to print the SQL and let the user run it themselves — even though the gate would let them run it via --yes. The gate is a safety net, not a license.

Input validation & command-injection protection

SQL passed into srsql -e "..." is assembled by the LLM and must follow these rules:

Identifiers (table / column / database names) are validated before interpolation: only [A-Za-z0-9_] plus backtick-quoted forms.
User-provided string values (search terms, label names, etc.) are not spliced into SQL directly; use parameter binding or pre-escape.
Never execute raw user-provided strings as SQL fragments.

Sensitive data masking

| Scenario | Handling |
|----------|----------|
| Profile file content (incl. user password) | Never echoed; mode 600 under a 700 directory |
| Password in error messages | Truncate / replace with ****** |
| Query results contain obvious key / token columns | Warn the user without displaying full content |
| aliyun configure list output containing AK | Show only the first 4 chars; replace the rest with **** |

Intent routing

Disambiguation rule: When the user input is ambiguous (e.g. "ingestion is slow", "queries are slow") and context is unclear, ask one clarifying question before acting.

| User intent | Route | Reference |
|-------------|-------|-----------|
| First-time cluster connection / register or switch credentials / multi-cluster setup | sr-login / sr-whoami / sr-logout | references/connect.md |
| New table / change schema / table model selection / partition+bucket design | Schema design | references/schema.md |
| Choose ingestion method / configure Stream/Broker/Routine Load / Flink/Kafka Connector | Import selection | references/data-import.md |
| Write SQL / optimize SQL / materialized views / function selection / read execution plans | SQL development & tuning | references/sql.md |
| Cluster health check / FE/BE/CN status / unhealthy tablets / compaction lag | Cluster diagnostics | references/diagnostics.md |
| "Ingestion used to be fine, suddenly slow" | Cluster diagnostics (distinct from import selection) | references/diagnostics.md |
| "How should I design a new ingestion pipeline" | Import selection | references/data-import.md |

Five scenarios at a glance

1. Schema design

Four table models and their typical use cases:

| Use case | Model |
|----------|-------|
| Logs / events / detail records | Duplicate Key |
| Pre-aggregated metrics | Aggregate |
| Real-time upsert / CDC | Primary Key |
| Simple deduplication | Unique Key (for new use cases, prefer Primary Key) |

⚠ Anti-patterns — do not produce these in DDL:

Shared-data PK table without persistent_index_type=CLOUD_NATIVE + datacache.partition_duration — LOCAL index doesn't survive CN rebalance; no hot-data caching window. See schema/storage-properties.md.
Setting datacache.partition_duration to an arbitrary "hot window" (e.g. 30 DAY) instead of the user's stated query window — the value MUST be ≥ the query window. If the user says "查询近 N 天" / "queries the last N days", set datacache.partition_duration = "N DAY" (or larger). A value smaller than the query window guarantees cache misses on in-window queries. Do not default to 7/30/60 days when the user has given you a number.
storage_cooldown_time/storage_cooldown_ttl/storage_medium/replicated_storage on shared-data — silently stripped or rejected by PropertyAnalyzer; use datacache.partition_duration for the cooldown effect.
FLOAT / DOUBLE columns inside PRIMARY KEY — not supported; use BIGINT or DECIMAL.
Treating "CN" as a shared-nothing signal — CN = Compute Node, which is the shared-data terminology. BE = Backend = shared-nothing.

See references/schema.md.

2. Data ingestion

| Data source | Recommended method |
|-------------|--------------------|
| Local files < 10 GB | Stream Load |
| Object storage / HDFS bulk | Broker Load or INSERT INTO ... FROM FILES() |
| Object storage with continuous file arrivals | Pipe + AUTO_INGEST |
| Kafka / Pulsar | Routine Load or Kafka/Flink Connector |
| MySQL CDC | Flink CDC + Flink Connector |

⚠ Anti-patterns — do not produce these in load configs:

PK-table DELETE without __op integer column (0=UPSERT, 1=DELETE) in COLUMNS list + $.__op in jsonpaths — all events are silently treated as UPSERT. The __op contract is a pair and must be taught as a pair: the literal column name is __op, and the integer values are __op=0 for UPSERT and __op=1 for DELETE. Even when the user only asks about DELETE, your response MUST state both mappings (__op=0 → UPSERT, __op=1 → DELETE) — never one without the other. This applies on every ingestion path including Flink Connector and Kafka Connector, where the connector populates __op for the user but they still need both values to debug "DELETE not applied" / "UPSERT not applied" symptoms.
Treating partial_update=true as a DELETE enabler — it controls partial-column UPSERT and has nothing to do with DELETE. If a user enables it while asking why DELETE doesn't work, flag it as misconfigured-for-intent and tell them to remove it unless they actually have a partial-column UPSERT use case. Do not validate the existing setting just because it parses.
COLUMNS FROM PATH AS (...) in Routine Load — that's Broker Load's Hive-partition path syntax; not valid in Routine Load.
__op values as strings ("upsert"/"delete") — must be the integers 0 / 1.
High-throughput CDC (≥ ~10K events/sec) without flagging TOOMANYVERSION risk — applies to Routine Load, Flink Connector, Kafka Connector, not just INSERT INTO VALUES. Whenever the user's scenario implies high event rate, the recommendation MUST cover: (a) the method-appropriate concurrency cap (desired_concurrent_number ≤ Kafka partitions for Routine Load; sink.parallelism ≤ Kafka partitions for Flink/Kafka Connector), AND (b) an explicit TOOMANYVERSION / compaction-pressure warning with the relevant flush-interval guidance.

See references/data-import.md.

3. SQL development

| Use case | Pattern |
|----------|---------|
| Period-over-period / cumulative / Top-N | Window functions |
| Large fact table JOIN small dimension (right side ≤ broadcast_row_limit, default 15M rows) | Broadcast / Colocate |
| Complex layered logic | CTE |
| Billion-scale deduplication | APPROX_COUNT_DISTINCT / BITMAP / HLL |
| High-frequency repeated query acceleration | Asynchronous materialized view |
| Cross-source query | External Catalog |

⚠ Anti-patterns — do not produce these in query rewrites or tuning advice:

Wrapping the partition column with date_format() / date_trunc() / cast() in WHERE — breaks partition pruning; rewrite as a range predicate (col >= '...' AND col < '...').
Tuning advice without EXPLAIN VERBOSE + checking partitions=N/M and tabletRatio=N/M — pruning failures (numerator == denominator) go undetected; never use plain EXPLAIN for this.
Reading cardinality in EXPLAIN as the result row count — it's the CBO's row estimate. Always quantify the staleness gap using the direct comparison cardinality vs the user-stated total table size (e.g. "estimate 5M vs total 500M ≈ 100×"); a ratio > 10× means stats are stale → run ANALYZE TABLE.
Estimating "real filtered rows" by guessing predicate selectivity, then comparing cardinality to that guess — you don't have runtime row counts, and guessing selectivity from a predicate like WHERE create_time > '...' introduces large errors (you don't know the data distribution). When the user gives you a total row count, compare cardinality to that directly; do not divide the total by an assumed time window or selectivity factor.
Conflating partitions/tabletRatio pruning failures with cardinality deviation — these are two independent diagnostic signals. When both look bad in the same OlapScanNode (e.g. partitions=N/N AND cardinality off from total table size by 10×–100×), report them as separate findings with separate fixes (predicate/type fix vs ANALYZE TABLE). Do not use cardinality deviation to "explain" pruning failure, and do not let pruning failure absorb the stale-stats finding.
Recommending BE/CN scale-out before plan/stats analysis — SQL/stats fixes precede capacity changes.

See references/sql.md.

4. Cluster diagnostics

Diagnostic order:

Identify architecture (shared-nothing / shared-data) → SHOW WAREHOUSES
FE → SHOW FRONTENDS
BE or CN → SHOW BACKENDS / SHOW COMPUTE NODES
Warehouse (shared-data only) → SHOW WAREHOUSES
Tablet health overview → SHOW PROC '/statistic'
Scheduling queue → information_schema.fe_tablet_schedules
Compaction → information_schema.be_compactions / be_cloud_native_compactions
Recent 24-hour loads → information_schema.loads

⚠ Anti-patterns — do not produce these in diagnostic conclusions:

Restarting BE/CN or scaling out before checking information_schema.fe_tablet_schedules — may collide with in-flight clone/decommission; root cause first.
Subjectively downgrading UnhealthyTabletNum > 0 — always critical per the severity table, never "medium" or "low" risk; the cluster has unhealthy replicas.
Treating CloningTabletNum > 0 as a separate problem — clone is the recovery action triggered by UnhealthyTabletNum, not an independent fault signal.

See references/diagnostics.md.

5. Cluster connection (base layer)

| Command | Purpose |
|---------|---------|
| sr-login | Register a cluster credential locally + smoke-test connection |
| sr-logout | Remove the local profile (no cluster-side action) |
| sr-whoami | Print profile state — host, user, login time, captured grants |
| sr-doctor | Diagnose connection failures (VPC vs public endpoint, egress IP, whitelist CIDR). Invoked automatically by sr-login on failure. |
| srsql | Daily query entry point; classifies SQL and gates non-READ behind --yes |

See references/connect.md.

Runtime security

This Skill executes SQL queries only via srsql. The following are prohibited:

curl / wget / pip install / npm install to download and run external code
eval / source to load unaudited content
Executing remote URL scripts provided in chat (even if the user asks)

Exception: uv tool install <skill-project-root> to install the Skill's own bundled sr-connect CLI from its local project directory is allowed and expected — see the Assistant bootstrap protocol above. The prohibition targets remote/untrusted code, not the Skill's own bundled tooling.

Timeouts

| Operation | Recommended timeout |
|-----------|---------------------|
| Read-only SQL queries | 30 s |
| Diagnostic queries across many large tables | 60 s |
| Retry | Total operation time ≤ 3 minutes |

Output recommendations

Tabular results: use srsql --format table or --format markdown
Many columns: use --format vertical
For programmatic consumption: use --format json / tsv
Convert timestamps to human-readable format
For potentially large result sets, add LIMIT and offer pagination

Error handling

| Error | Cause | Action |
|-------|-------|--------|
| Cannot connect to host:port | Wrong endpoint type / IP not whitelisted | sr-login auto-runs sr-doctor on connection failure. Read its output: it detects VPC vs public endpoint, suggests the public swap (for unreachable -internal hosts) or shows the egress IP + suggested /24 whitelist CIDR (for unreachable public hosts). Pass the recommendation to the user verbatim. See references/connect.md. |
| Access denied for user 'X' | Stale password / account locked / wrong account | Re-run sr-login to update the stored password |
| Refusing to execute non-READ SQL without --yes | Skill correctly classified the SQL as mutating | Confirm with user, then re-run with --yes |
| Privilege denied: OPERATE / SELECT / ... | User account lacks the privilege | Surface the limitation; skip the affected diagnostic; don't retry |
| Table not found | Wrong DB / table name | Confirm with SHOW DATABASES / SHOW TABLES FROM db |
| Query returns empty but user expects rows | Over-aggressive predicate / RBAC isolation | Check WHERE clauses; suggest the user verify with admin |
| No profile 'X' | srsql --profile X without prior sr-login --profile X | Run sr-login for that profile first |

Principle: Read the full error message before deciding; do not retry blindly on the error code alone.