Auto-Updater Skill
一个面向 Other 场景的 Agent 技能。原始说明:Automatically update Clawdbot and all installed skills once daily. Runs via cron, checks for updates, applies them, and messages the user with a summary of what changed.
name: remotion-word-highlight-subtitles
description: Add word-level highlighted subtitles to local short videos using Whisper word timestamps and Remotion rendering.
version: 0.1.4
metadata:
openclaw:
requires:
bins:
emoji: "🎬"
This skill turns a local video into a subtitled video using the reusable "fine version": Whisper word timestamps plus Remotion-rendered current-word highlighting. Use this instead of plain SRT burn-in unless the user explicitly asks for simple static subtitles.
Detailed usage guide, README, and effect preview: https://github.com/0x00000003/remotion-word-highlight-subtitles
Use this skill for requests like:
If the user only gives one video path, write the output next to the source video.
.mp4, .mov, .m4v, or audio/video file with usable audio.<source-stem>_remotion逐词高亮字幕.mp4 in the same directory unless the user names another target.--language zh --word_timestamps True --output_format json.startMs and endMs. Long Whisper segments must be split by punctuation, timing gaps, duration, or length before rendering.height * 0.28 bottom padding, then adjust only if the source framing clearly needs it.height * 0.056 (about 60px) and never render below 56px; for vertical 720x1280, target 40-46px. For other sizes, scale from height but enforce the same perceived size in still QA.。 , 、 ? ! : ; , . ? ! : ;. Keep punctuation only for splitting/timing unless the user explicitly asks for verbatim punctuation.ffprobe for width, height, fps, duration, and audio presence.turbo is a good default when available. Use --initial_prompt when the glossary contains high-risk names or English terms.public/captions.json using scripts/whisper_json_to_captions.py, passing corrections with --replace or --replace-phrase. Apply transcript corrections, merge terms, strip display punctuation, and keep the helper's caption splitting enabled if Whisper returns long paragraphs.public/input.mp4 or encode an H.264 compatibility copy as public/input-h264.mp4.ffprobe; extract and inspect a still from the rendered video to confirm the final encoded file kept the approved subtitle look.Use this shape, adjusting model and paths as needed:
whisper "/absolute/path/video.mp4" --model turbo --language zh --word_timestamps True --output_format json --output_dir "/absolute/path"
For names and mixed Chinese/English topics, add a short prompt rather than relying on Whisper to infer them:
whisper "/absolute/path/video.mp4" \
--model turbo \
--language zh \
--word_timestamps True \
--output_format json \
--initial_prompt "本视频可能出现这些词:Cursor、Kimi 2.5、马斯克、AI大模型、转推、套壳。" \
--output_dir "/absolute/path"
If Whisper fails on the video container, extract audio first:
ffmpeg -y -i "/absolute/path/video.mp4" -vn -ac 1 -ar 16000 "/absolute/path/video_audio.wav"
Then run Whisper on the WAV and keep the output naming clear.
After Whisper finishes, print or read the segment transcript before building captions. Treat this as a required gate, not a nice-to-have.
Check especially:
Cursor, Kimi 2.5, ChatGPT, Claude, OpenAICreate a short correction map and apply it before rendering. Use --replace for single-token fixes and --replace-phrase for words split across adjacent Whisper tokens.
Example:
python3 scripts/whisper_json_to_captions.py \
"/absolute/path/transcript.json" \
"/absolute/path/remotion-project/public/captions.json" \
--replace-phrase "科舍=Cursor" \
--replace-phrase "KMI 2.5=Kimi 2.5" \
--replace-phrase "Kimi 2.5=Kimi 2.5" \
--replace-phrase "AI却=AI圈" \
--replace-phrase "死腿=转推" \
--merge-term "Cursor" \
--merge-term "Kimi 2.5" \
--keyword "AI" \
--keyword "Kimi 2.5"
If a correction is uncertain, prefer rerunning Whisper with a better --initial_prompt or inspect the relevant audio/video moment before deciding. Report the correction map in the final response.
Remotion should load public/captions.json with this shape:
[
{
"text": "我们用手机随便拍张照片",
"startMs": 0,
"endMs": 1600,
"tokens": [
{ "text": "我们", "startMs": 0, "endMs": 300, "keyword": false },
{ "text": "手机", "startMs": 440, "endMs": 700, "keyword": true }
]
}
]
Use the helper script:
python3 scripts/whisper_json_to_captions.py \
"/absolute/path/transcript.json" \
"/absolute/path/remotion-project/public/captions.json" \
--keyword "提示词" \
--keyword "Codex" \
--replace-phrase "错识别词=正确词" \
--max-caption-chars 28 \
--max-caption-duration-ms 4200 \
--split-gap-ms 260 \
--min-punctuation-caption-ms 900
Run the transcript QA pass before this conversion command. If any correction changes adjacent tokens into one display term, also pass that final term with --merge-term or a same-text --replace-phrase so the highlight appears as a clean word instead of broken characters.
By default, scripts/whisper_json_to_captions.py strips visible display punctuation from final caption text and tokens while still using punctuation as a split signal. Do not pass --keep-display-punctuation unless the user explicitly asks for verbatim transcript punctuation. If building JSON manually, remove punctuation-only tokens and trim trailing punctuation before rendering.
The helper splits long Whisper segments by visible length, duration, punctuation, and word timing gaps. Keep this behavior on for short-video subtitles; otherwise a single Whisper paragraph can become an unreadable multi-line caption.
The caption layer should:
OffthreadVideo for the source video.captions.json with delayRender, continueRender, and staticFile.currentMs >= startMs && currentMs < endMs.currentMs is within the token's timing.letterSpacing: 0.60px and never below 56px; if a still looks like small UI annotation text, enlarge and rerender.WebkitTextStroke as a thick Chinese subtitle outline. It easily eats the white fill at small resolutions. Prefer multi-direction textShadow; if WebkitTextStroke is used at all, keep it at or below 1.5px and verify a still.<= 0.16, tight padding, and verify it does not look like a dark banner.Reject and revise any still where the caption has muddy/gray text, a thick black halo, a large black box, clipped words, awkward wrapping, visible sentence punctuation, text that is too small for the frame, or placement over the mouth/chin in a talking-head video.
Use these style constants as the baseline:
const captionBottom = Math.round(height * 0.28);
const isLandscape = width >= height;
const baseCaptionFontSize = Math.round(height * (isLandscape ? 0.056 : 0.032));
const minCaptionFontSize = isLandscape && height >= 1000 ? 56 : Math.round(height * 0.032);
const captionFontSize = Math.max(baseCaptionFontSize, minCaptionFontSize);
const captionMaxWidth = Math.round(width * 0.88);
const activeColor = "#FFE45C";
const keywordColor = "#D6FFF8";
const outlinePx = Math.min(3, Math.max(1.5, captionFontSize * 0.055));
Use a clean shadow outline rather than a heavy stroke:
const textShadow = [
`${outlinePx}px 0 0 rgba(0, 0, 0, 0.96)`,
`-${outlinePx}px 0 0 rgba(0, 0, 0, 0.96)`,
`0 ${outlinePx}px 0 rgba(0, 0, 0, 0.96)`,
`0 -${outlinePx}px 0 rgba(0, 0, 0, 0.96)`,
`0 ${outlinePx * 1.4}px ${outlinePx * 1.4}px rgba(0, 0, 0, 0.72)`,
].join(", ");
For a 1920x1080 landscape video, this yields about fontSize: 60, and it must not fall below 56. For a 720x1280 vertical video, this is roughly paddingBottom: 358, fontSize: 41, maxWidth: 634, and a 2px outline. For the original 2160x2974 source that inspired this skill, the equivalent bottom padding was about 830.
If the user does not specify keywords, infer a small set from the transcript:
Codex, Remotion, Whisper提示词, 照片, 手机, 封面Keep keyword highlighting sparse. Too many colored words makes the active yellow word less clear.
Prefer:
<source-stem>_remotion逐词高亮字幕.mp4
If iterating for the same source, add _v2, _v3, etc. Do not overwrite the user's source file.
Tell the user the output path, mention that it used the reusable Remotion + Whisper word timestamp flow, and include the transcript correction map or say no corrections were needed. If any verification step could not be run, say that plainly.