文件预览

README.md

查看 ASR Personal Hotwords 技能包中的文件内容。

文件内容

README.md

# ASR Personal Hotwords

An OpenClaw skill that automatically mines personalized hotwords and ambiguous terms from OpenClaw conversation histories, powered by the [asr-corrector](https://github.com/asr-corrector) service.

## Overview

ASR (Automatic Speech Recognition) systems often misrecognize domain-specific terms, proper nouns, and niche vocabulary. This tool extracts conversation data from OpenClaw agent sessions and feeds it into a hotword mining pipeline to build a **per-agent personalized** disambiguation vocabulary table.

The generated hotword table (`hotwords.md`) can be directly injected into ASR model prompts (e.g., Qwen2.5-Omni) to improve transcription accuracy.

**Pipeline:**

```
OpenClaw Sessions (.jsonl) → Extract & Filter → asr-corrector API → Vocabulary Table → hotwords.md
```

## Features

- **Per-agent isolation** — Each agent maintains its own hotword table; `["self"]` mode auto-detects the current agent from skill path + `openclaw.json`
- **Smart filtering** — Skips system noise (cron tasks, heartbeats, sub-agent metadata, NO_REPLY, pure English, URLs, code blocks)
- **Incremental anchors** — Loads **all** historical vocab results as anchors (not just the latest), vocabulary only grows
- **Auto-chunking** — Splits large data into API-compliant chunks (≤400 lines / ≤10,000 chars) and merges results with deduplication
- **Zero-config credentials** — Reads LLM API key, base URL, and model from `~/.openclaw/openclaw.json` automatically
- **UTC→CST date handling** — Correctly filters messages by CST date, not UTC
- **HTTP retry** — 3 retries with exponential backoff for 502/503/504 errors
- **Safe re-runs** — Automatically backs up previous results when re-running the same date
- **Multiple export formats** — txt, json, csv, prompt (for ASR injection)

## Project Structure

```
asr-personal-hotwords/
├── README.md               # This file
├── SKILL.md                # OpenClaw skill instructions & install wizard
├── config.yaml             # Configuration
├── extract_sessions.py     # Session data extraction module
├── run.py                  # Main pipeline controller (includes export)
├── run.sh                  # Shell entry point
├── hotwords.md             # Generated hotword table (git-ignored)
└── output/                 # Mining results archive (git-ignored)
    └── vocab_YYYY-MM-DD.json
```

## Prerequisites

- Python 3.9+
- A running [asr-corrector](https://github.com/asr-corrector) service
- OpenClaw installed with session history in `~/.openclaw/agents/`

**Python dependencies:**

```bash
pip install requests pyyaml
```

## Installation

This skill is designed to be installed **per-agent** in the agent's workspace:

1. Determine the agent's workspace from `~/.openclaw/openclaw.json` (`agents.list[].workspace` or `agents.defaults.workspace`)
2. Install to `{workspace}/skills/asr-personal-hotwords/`

```bash
# Install to agent workspace
clawhub install asr-personal-hotwords
```

See `SKILL.md` for the full 7-step installation wizard.

## Quick Start

```bash
# Mine yesterday's conversations (default)
bash run.sh

# Mine a specific date
bash run.sh --date 2026-04-26

# Mine a date range
bash run.sh --start 2026-04-20 --end 2026-04-26

# Export hotword table only (no mining)
bash run.sh --export-only -f prompt -o hotwords.md
```

## Configuration

Edit `config.yaml`:

```yaml
# Remote asr-corrector service
server_url: "http://124.174.11.138:65000"

# Extraction settings
extract:
  agents: ["self"]            # ["self"] = auto-detect, ["*"] = all, or ["main", "claude"]
  max_content_len: 300        # Max chars per message (truncates beyond)
  min_freq: 3                 # Minimum word frequency threshold
```

### Credential Resolution

LLM API key, base URL, and model are **automatically** resolved from `~/.openclaw/openclaw.json`:

```
agents.defaults.model.primary → "provider/model_id"
    ↓
models.providers.{provider} → apiKey, baseUrl, api format
```

No manual API key or model configuration is needed.

## How It Works

### 1. Session Extraction (`extract_sessions.py`)

Scans `~/.openclaw/agents/{agent}/sessions/*.jsonl` files and extracts `user`/`assistant` message pairs for the target date(s).

**Agent resolution (`["self"]` mode):**
1. Infers workspace path from the skill's file location
2. Looks up `~/.openclaw/openclaw.json` to find the matching agent by workspace path
3. Falls back to `OPENCLAW_AGENT_NAME` env var, then `"main"`

**Filtering rules:**
- Only `type: "message"` with `role: "user"` or `role: "assistant"`
- UTC timestamps converted to CST before date comparison
- Skips: NO_REPLY, HEARTBEAT_OK, cron contexts, sub-agent metadata, system internals, pure JSON, code blocks, pure English (no Chinese characters), URLs
- Files with mtime before target date are skipped entirely (fast path)
- Messages truncated to `max_content_len` characters
- Output format: `发送者:内容` (one message per line)

### 2. Anchor Loading

Loads **all** `output/vocab_*.json` files (excluding `.bak` backups), merges by term with deduplication. This ensures the vocabulary only grows over time — terms discovered in earlier runs are never lost.

### 3. API Submission (`run.py`)

Chat data is submitted to the asr-corrector service's `/api/build-vocab` endpoint. If data exceeds API limits, it is automatically split into chunks. Each chunk is submitted as a separate task, polled for completion, and results are merged with deduplication and frequency aggregation.

**Reliability:**
- HTTP requests use a retry-enabled session (3 retries, exponential backoff, 502/503/504)
- Task polling with configurable timeout (default 600s)
- Same-date re-runs automatically back up previous results

### 4. Export (integrated in `run.py`)

`run.py` automatically exports `hotwords.md` after mining. For standalone export:

```bash
bash run.sh --export-only -f prompt -o hotwords.md
```

Merges all historical vocab files and exports a clean hotword table.

**Supported formats:**

| Format | Command | Use Case |
|--------|---------|----------|
| `prompt` | `bash run.sh --export-only -f prompt -o hotwords.md` | Direct ASR prompt injection |
| `json` | `bash run.sh --export-only -f json -o hotwords.json` | Programmatic access |
| `csv` | `bash run.sh --export-only -f csv -o hotwords.csv` | Human review / spreadsheet |
| `txt` | `bash run.sh --export-only -f txt` | Simple word list |

Entries with both empty `desc` and empty `possible_asr_errors` are automatically filtered out.

### 5. ASR Integration

The generated `hotwords.md` is in prompt format, ready to be injected into ASR model system prompts:

```
以下是用户场景中的专有名词和易混淆词汇表。
语音转录时,如果听到与「常见误识别」列发音相近的内容,请优先转录为「正确词」列的写法。

| 正确词 | 常见误识别 |
|--------|-----------|
| 飞书 | 飞鼠、非书、肥书 |
| VLA | 微辣、微拉、为了啊 |
| 持久化 | 持酒话、迟就化 |
```

## Modules

### `extract_sessions.py`

Standalone module for OpenClaw session data extraction.

```bash
python3 extract_sessions.py --date 2026-04-26 --output chat.txt
python3 extract_sessions.py --start 2026-04-20 --end 2026-04-26 --agents main claude
```

| Argument | Default | Description |
|----------|---------|-------------|
| `--date` | Yesterday | Target date (YYYY-MM-DD) |
| `--start` | - | Start date for range |
| `--end` | - | End date for range |
| `--agents` | `*` (all) | Agent names to extract |
| `--max-len` | 300 | Max characters per message |
| `--output` | stdout | Output file path |

### `run.py`

Main pipeline controller.

```bash
python3 run.py --date 2026-04-26
python3 run.py --start 2026-04-20 --end 2026-04-26
```

| Argument | Default | Description |
|----------|---------|-------------|
| `--date` | Yesterday | Target date |
| `--start` | - | Start date for range |
| `--end` | - | End date for range |

### `run.py --export-only`

Standalone hotword table export.

```bash
bash run.sh --export-only -f prompt -o hotwords.md
bash run.sh --export-only -f json --min-freq 3
```

| Argument | Default | Description |
|----------|---------|-------------|
| `-f, --format` | `prompt` | Output format: txt, json, csv, prompt |
| `-o, --output` | stdout | Output file path |
| `--output-dir` | `./output` | Vocab files directory |
| `--min-freq` | 1 | Minimum frequency filter |

## Example Output

```
$ bash run.sh --date 2026-04-26

18:07:54 INFO 从 openclaw.json 自动读取 LLM 配置...
18:07:54 INFO   Model: claude-opus-4-6, Format: anthropic
18:07:54 INFO 检查远端服务 http://124.174.11.138:65000...
18:07:54 INFO   服务正常 ✓
18:07:54 INFO 提取对话: 2026-04-26 ~ 2026-04-26 (agents: ['self'])
18:07:54 INFO 自动识别当前 agent: main
18:07:54 INFO 扫描 10 个文件, 2 个匹配, 提取 3 条消息
18:07:54 INFO 从 2 个历史文件加载 91 个 anchors(去重后)
18:08:33 INFO 块 1 完成: 100 条热词
18:08:33 INFO 结果保存到: output/vocab_2026-04-26.json

{
  "status": "success",
  "date": "2026-04-26",
  "messages_count": 3,
  "vocab_count": 100,
  "top_terms": [
    {"term": "飞书", "freq": 2, "asr_errors": ["飞鼠", "飞输", "非书"]},
    {"term": "机器之心", "freq": 2, "asr_errors": ["机器知心", "机器之新"]},
    {"term": "反爬虫", "freq": 2, "asr_errors": ["反扒虫", "翻爬虫"]}
  ]
}
```

## License

Internal use only.