文件内容
references/ack.md
# ACK (Container Service for Kubernetes) Reference
Script: `scripts/ack.py` | API style: ROA | Version: CS/2015-12-15
ACK is suited for containerized workloads and Kubernetes orchestration scenarios. Cluster creation takes a long time (5-15 minutes), so **prefer reusing existing clusters**.
Dependency: `pip install kubernetes` (Job execution uses the K8s Python SDK; no local kubectl install required).
## End-to-end workflow
```
1. Decide node specs
→ 2. Cost confirmation (ECS price query + Pro edition management fee)
→ 3. Network preparation (VPC)
→ 4. ensure_cluster (name match → reuse any available cluster → create + wait_nodes_ready)
→ 5. run_script_as_job (submit Job via K8s SDK, dynamic image resolution)
→ 6. cleanup_resources (reverse order: cluster → wait → SG)
```
> `create_and_run` wraps steps 3-6 to complete the full flow in a single call.
>
> **Key**: step 4 not only waits for the cluster to be running, it also waits for nodes to become Ready, so Jobs can actually be scheduled.
## Cost confirmation (mandatory)
**Before creating a cluster you must estimate the cost and confirm with the user. Do not call create_cluster without confirmation.**
ACK itself **has no price-query API**. ACK cost = cluster management fee + Worker node ECS cost:
- Managed basic edition (ManagedKubernetes): cluster management fee is **free**
- Managed Pro edition: cluster management fee ~CNY 0.64/hour
- Worker node cost: estimate via ECS `describe_price`
```python
# Use the ECS pricing helper (see SKILL.md)
result = describe_price(
instance_type="ecs.g7.xlarge", # Worker node spec
price_unit="Hour",
instance_charge_type="PostPaid",
system_disk_category="cloud_essd",
system_disk_size=120,
)
hourly_per_node = result["PriceInfo"]["Price"]["TradePrice"]
total_hourly = hourly_per_node * num_of_nodes
```
**Determining cluster edition**: when reusing an existing cluster, check the `profile` field returned by `describe_cluster_detail`:
- `profile="Default"` → basic edition (management fee is free)
- `profile="XEnhance"` or user explicitly requests Pro → Pro edition (CNY 0.64/hour)
Cost display template (basic edition):
```
ACK cost estimate:
Cluster type: ManagedKubernetes (managed basic edition, management fee free)
Worker spec: ecs.g7.xlarge (4vCPU, 16GB) x 2 nodes
Per-node price: CNY 0.84/hour
Total node cost: CNY 1.68/hour
System disk: cloud_essd 120GB x 2
Estimated runtime: ~1 hour
Estimated total cost: CNY 1.68
Billing method: pay-as-you-go (PostPaid)
Proceed with creation?
```
Cost display template (Pro edition):
```
ACK cost estimate:
Cluster type: ManagedKubernetes Pro (managed Pro edition)
Cluster management fee: CNY 0.64/hour
Worker spec: ecs.g7.xlarge (4vCPU, 16GB) x 2 nodes
Per-node price: CNY 0.84/hour
Total node cost: CNY 1.68/hour
System disk: cloud_essd 120GB x 2
Estimated runtime: ~1 hour
Estimated total cost: CNY 2.32 (management CNY 0.64 + nodes CNY 1.68)
Billing method: pay-as-you-go (PostPaid)
Proceed with creation?
```
## API quick reference
### ensure_cluster(...) -> str
**Recommended entry point**. Three-tier lookup: name match → `acf-` prefix cluster → create new. Returns cluster_id and **guarantees nodes are Ready**.
```python
from ack import ensure_cluster
cluster_id = ensure_cluster(
cluster_name="acf-cluster", # used for name matching
vpcid="vpc-xxx", # VPC ID (used when creating)
vswitch_ids=["vsw-xxx"], # VSwitch ID list (used when creating)
worker_instance_types=["ecs.g7.xlarge"], # Worker node ECS spec
num_of_nodes=1, # Worker node count
reuse_any=False, # whether to search for available acf- prefixed clusters (default False)
region=None,
)
# cluster_id -> "c-xxx" (may be a reused existing cluster)
```
Logic:
1. `describe_clusters(name=cluster_name)` looks up clusters with the same name
2. Same-name + running -> reuse directly after RBAC check passes
3. Same-name + initial -> `wait_cluster_running` + `wait_nodes_ready`
4. No same-name + `reuse_any=True` -> search running clusters with `acf-` prefix (concurrent probing, 3s timeout); reuse any cluster that has Ready nodes and RBAC permission via the K8s API
5. None of the above -> `create_cluster` + `wait_cluster_running` + `wait_nodes_ready`
> **Note**: `reuse_any` defaults to `False`, only same-name clusters are reused. Other `acf-` prefixed clusters are only searched when the user explicitly sets `reuse_any=True`. Clusters whose names do not start with `acf-` are never reused.
### wait_nodes_ready(cluster_id, min_nodes, region, timeout) -> int
Polls node Ready status via the K8s API. Returns the count of Ready nodes.
```python
from ack import wait_nodes_ready
ready_count = wait_nodes_ready(
cluster_id="c-xxx",
min_nodes=1, # minimum number of Ready nodes required
timeout=300, # timeout in seconds
)
```
> `ensure_cluster` calls `wait_nodes_ready` internally, so manual invocation is usually unnecessary.
### describe_clusters(name, cluster_type, region) -> list[dict]
Lists clusters. Calls the `DescribeClustersV1` API (`GET /api/v1/clusters`).
```python
from ack import describe_clusters
clusters = describe_clusters(name="acf-cluster")
for c in clusters:
print(f"{c['cluster_id']}: {c['name']} ({c['state']})")
```
### create_cluster(...) -> dict
Creates a managed ACK cluster. Returns `{cluster_id, task_id, request_id}`.
```python
from ack import create_cluster
result = create_cluster(
cluster_name="acf-cluster", # required
cluster_type="ManagedKubernetes", # managed edition (recommended)
vpcid="vpc-xxx", # required
vswitch_ids=["vsw-xxx"], # required
container_cidr="172.20.0.0/16", # Pod CIDR
service_cidr="172.21.0.0/20", # Service CIDR
worker_instance_types=["ecs.g7.xlarge"], # Worker ECS spec
num_of_nodes=2, # node count
worker_system_disk_category="cloud_essd",
worker_system_disk_size=120,
# security_group_id=None -> auto-create an enterprise security group, released with the cluster
# login_password=None -> auto-generate a random password
region=None,
)
```
> **Security group**: when `security_group_id` is omitted an enterprise security group is auto-created (`is_enterprise_security_group=True`) and released automatically when the cluster is deleted, no manual cleanup needed.
> **Login credentials**: when neither `login_password` nor `key_pair` is provided, a 16-character random password is generated.
### wait_cluster_running(cluster_id, region, timeout=900) -> dict
Waits for the cluster to become ready. Default timeout is 15 minutes.
### describe_cluster_detail(cluster_id, region) -> dict
Returns cluster details. State values: `initial`, `running`, `failed`, `deleted`, `deleting`.
### run_script_as_job(...) -> str
**Runs a script in the cluster via the Kubernetes Python SDK**. Automatically resolves the in-cluster image registry, polls Pod status, and raises on errors.
```python
from ack import run_script_as_job
output = run_script_as_job(
cluster_id="c-xxx", # required
script_content="echo hello", # script content
job_name="acf-job", # Job name
script_type="shell", # "shell" or "python"
# image=None -> dynamically resolve the image registry prefix from kube-system pods
poll_interval=10, # status polling interval (seconds)
timeout=600, # maximum wait time (seconds)
region=None,
)
# output -> "hello\n" (Job stdout)
```
Key features:
- **Dynamic image resolution**: extracts the in-cluster registry prefix from kube-system pods (e.g. `registry-cn-hangzhou-vpc.ack.aliyuncs.com/acs/`), guaranteeing the image is pullable from inside the cluster
- **Polling-based status checks**: checks every `poll_interval` seconds; raises immediately on fatal errors such as `ImagePullBackOff` or `CrashLoopBackOff`
- **Auto cleanup of stale Jobs**: if a Job with the same name exists, it is deleted and recreated
- **Auto Job reclamation**: sets `ttlSecondsAfterFinished=300`, so completed Jobs are cleaned up 5 minutes after completion
### delete_cluster(cluster_id, region) -> dict
Submits a cluster delete request (asynchronous). Use together with `wait_cluster_deleted`.
### wait_cluster_deleted(cluster_id, region, timeout=600) -> None
Waits for the cluster to be fully deleted. Polls until the state is `deleted` or the API returns 404.
### cleanup_resources(cluster_id, security_group_id, region) -> None
**Reverse-order resource cleanup**: delete cluster -> wait for deletion -> optionally delete security group.
```python
from ack import cleanup_resources
# Auto-created security groups need no manual cleanup (released with the cluster)
cleanup_resources(cluster_id="c-xxx")
# Externally provided security groups must be specified explicitly
cleanup_resources(cluster_id="c-xxx", security_group_id="sg-xxx")
```
### create_and_run(...) -> dict
**One-stop convenience entry point**. Runs the full flow automatically: VPC preparation -> cluster reuse/creation -> script execution -> auto cleanup.
```python
from ack import create_and_run
# Option 1: auto-create network (specify zone_id)
result = create_and_run(
script_content="echo hello",
cluster_name="acf-cluster",
zone_id="cn-hangzhou-h", # auto-create VPC/VSwitch
worker_instance_types=["ecs.t6-c1m2.large"],
num_of_nodes=1,
script_type="shell",
auto_cleanup=True, # delete the cluster automatically after execution
)
# Option 2: use existing network
result = create_and_run(
script_content="print('hello')",
vpcid="vpc-xxx",
vswitch_ids=["vsw-xxx"],
script_type="python",
auto_cleanup=False, # keep the cluster for later use
)
# result -> {"cluster_id": "c-xxx", "job_output": "hello\n"}
```
Key features:
- **Cluster reuse**: looks up an existing running cluster by name to avoid redundant creation
- **Automatic network preparation**: when `zone_id` is provided, the network is created automatically via the network preparation helpers (see SKILL.md)
- **Hands-off security group**: when no SG is provided, ACK auto-creates an enterprise security group that is released with the cluster
- **try/finally auto cleanup**: cleans up the cluster even if script execution fails (when `auto_cleanup=True`)
### Other APIs
- `describe_regions(region)` -> dict — query available regions
- `get_cluster_kubeconfig(cluster_id, region)` -> str — fetch kubeconfig YAML
- `create_cluster_node_pool(cluster_id, ...)` -> dict — create a node pool
- `describe_cluster_node_pools(cluster_id, region)` -> dict — list node pools
## Documentation search
When uncertain about parameters, encountering unknown error codes, or needing the latest API docs:
```python
from doc_search import search_and_format
print(search_and_format("CreateCluster managed cluster parameters", product="ack"))
print(search_and_format("node pool GPU scheduling", product="ack"))
```
ACK API reference: https://help.aliyun.com/zh/ack/ack-managed-and-ack-dedicated/developer-reference/api-cs-2015-12-15-overview
## Notes
- Cluster creation takes 5-15 minutes; prefer `ensure_cluster` to reuse same-name clusters; pass `reuse_any=True` explicitly to search other clusters
- When `reuse_any=True`, only `acf-` prefixed clusters are searched; clusters whose names do not start with `acf-` are never auto-reused
- Cluster probing uses a 3-second timeout, concurrency of 10, and inspects up to 100 candidates while validating RBAC permissions (batch/jobs)
- After the cluster is running, `ensure_cluster` calls `wait_nodes_ready` automatically and only returns once nodes are schedulable
- container_cidr and service_cidr must not overlap with the VPC CIDR
- When `security_group_id` is omitted, ACK auto-creates an enterprise security group that is released with the cluster
- Cluster cost mainly comes from Worker node ECS instances and is billed continuously even with no workload; the Pro edition adds a CNY 0.64/hour management fee
- Images are resolved dynamically from kube-system pods, no manual registry configuration needed
- When reusing an existing cluster, set `auto_cleanup=False` to avoid deleting a cluster that was not created by this run