Multi Search Engine
一个面向 Research 场景的 Agent 技能。原始说明:Multi search engine integration with 16 engines (7 CN + 9 Global). Supports advanced search operators, time filters, site search, privacy engines, and Wolfra...
name: vocational-ed-policy-scraper
displayName: 职业教育政策信息抓取工具 | Vocational Education Policy Scraper
description: 自动抓取教育部、人社部及各省教育厅官网的职业教育政策文件和课题申报信息。支持按关键词筛选和定期汇总。| Automatically scrapes vocational education policy documents and project announcements from Ministry of Education, Ministry of Human Resources, and provincial education departments. Supports keyword filtering and periodic summaries. Use when searching for: (1) Vocational education policy documents, (2) Teaching achievement award applications, (3) Industry-education integration files, (4) 1+X certificate policies, (5) Double high plan notifications
category: research
version: 1.0.0
author: Hermes
tags: [vocational-education, policy-scraper, government-documents, research, chinese]
languages: [zh, en]
自动抓取教育部、人社部及各省教育厅官网的职业教育政策文件、课题申报信息,支持按关键词筛选和定期汇总。
Automatically scrapes vocational education policy documents and project announcements from Ministry of Education, Ministry of Human Resources, and provincial education departments. Supports keyword filtering and periodic summaries.
支持的数据源 | Supported Sources:
抓取内容 | Content Types:
分类体系 | Classification System:
policy: 政策文件 (Policy Documents)project: 课题申报 (Project Applications)achievement: 教学成果奖 (Teaching Achievement Awards)integration: 产教融合 (Industry-Education Integration)certificate: 1+X证书 (1+X Certificates)double_high: 双高计划 (Double High Plan)筛选功能 | Filtering Capabilities:
汇总功能 | Summary Features:
中文示例 | Chinese Examples:
# 抓取最近30天的所有政策文件
python scripts/scrape_voc_ed_policy.py --days 30
# 按关键词筛选(双高计划、产教融合)
python scripts/scrape_voc_ed_policy.py --keywords "双高计划" "产教融合" --days 30
# 按类别筛选(仅政策文件)
python scripts/scrape_voc_ed_policy.py --category policy --days 7
# 综合筛选(多个关键词 + 类别 + 时间)
python scripts/scrape_voc_ed_policy.py --keywords "1+X证书" --category certificate --days 14
# 保存到指定文件
python scripts/scrape_voc_ed_policy.py --keywords "教学成果奖" --output results.json
English Examples:
# Scrape all policy documents from the last 30 days
python scripts/scrape_voc_ed_policy.py --days 30 --lang en
# Filter by keywords
python scripts/scrape_voc_ed_policy.py --keywords "双高计划" "产教融合" --days 30 --lang en
# Filter by category
python scripts/scrape_voc_ed_policy.py --category policy --days 7 --lang en
# Comprehensive filtering
python scripts/scrape_voc_ed_policy.py --keywords "1+X证书" --category certificate --days 14 --lang en
# Save to specified file
python scripts/scrape_voc_ed_policy.py --keywords "教学成果奖" --output results.json --lang en
| 参数 | 说明 | 示例 |
|------|------|------|
| --keywords | 关键词列表 | --keywords "双高计划" "产教融合" |
| --days | 回溯天数(默认30) | --days 7 |
| --category | 筛选类别 | --category policy |
| --output | 输出文件路径 | --output results.json |
| --lang | 语言 (zh/en) | --lang zh |
中文:
明确需要抓取的内容类型、时间范围、关键词和类别。
English:
Clarify the content type, time range, keywords, and category needed.
示例 | Example:
中文:
根据需求配置参数,运行抓取脚本。
English:
Configure parameters based on requirements and run the scraping script.
python scripts/scrape_voc_ed_policy.py --keywords "双高计划" --category policy --days 30
中文:
抓取完成后,查看生成的JSON文件或终端输出摘要。
English:
After scraping is complete, review the generated JSON file or terminal summary.
输出格式 | Output Format:
{
"websites_scraped": 3,
"total_documents": 45,
"results": [
{
"title": "教育部关于公布中国特色高水平高职学校和专业建设计划名单的通知",
"url": "https://www.moe.gov.cn/...",
"date": "2024-01-15",
"source": "教育部",
"category": "double_high",
"keywords": ["双高计划", "高职学校"]
}
],
"errors": [],
"timestamp": "2024-01-20T10:30:00",
"filters": {
"keywords": ["双高计划"],
"days": 30,
"category": "policy"
}
}
中文:
根据抓取结果进行分析,生成汇总报告。
English:
Analyze the scraped results and generate summary reports.
中文:
使用cronjob设置定期抓取任务。
English:
Use cronjob to set up scheduled scraping tasks.
# 每天早上8点抓取最近30天的政策文件
0 8 * * * python /path/to/scripts/scrape_voc_ed_policy.py --days 30 --output /path/to/results/daily_$(date +\%Y\%m\%d).json
# 每周一抓取最近7天的政策文件
0 8 * * 1 python /path/to/scripts/scrape_voc_ed_policy.py --days 7 --output /path/to/results/weekly_$(date +\%Y\%m\%d).json
中文:
在脚本中添加新的网站配置。
English:
Add new website configurations in the script.
EDU_WEBSITES = {
"新增网站": {
"base_url": "https://example.gov.cn",
"policy_url": "https://example.gov.cn/policy/",
"selectors": {
"title": "a[title]",
"date": ".date",
"link": "a[href]"
},
"keywords": ["职业教育", "政策"]
}
}
中文:
将JSON结果转换为其他格式(CSV、Markdown、HTML)。
English:
Convert JSON results to other formats (CSV, Markdown, HTML).
# 导出为CSV
import pandas as pd
df = pd.DataFrame(results['results'])
df.to_csv('results.csv', index=False, encoding='utf-8-sig')
# 导出为Markdown
def to_markdown(results):
md = "# 职业教育政策抓取结果\n\n"
for item in results['results']:
md += f"## {item['title']}\n"
md += f"- **来源**: {item['source']}\n"
md += f"- **日期**: {item['date']}\n"
md += f"- **链接**: {item['url']}\n\n"
return md
核心抓取脚本,支持:
教育部、人社部及各省教育厅官网列表,包含:
国际化辅助模块,支持:
翻译文件,包含:
中文:
English:
中文:
English:
中文:
English:
中文:
references/edu_websites.md 中添加网站信息scripts/scrape_voc_ed_policy.py 的 EDU_WEBSITES 字典中添加配置English:
references/edu_websites.mdEDU_WEBSITES dictionary in scripts/scrape_voc_ed_policy.py中文:
English:
中文:
English:
中文:
欢迎提交问题和改进建议。在提交PR之前,请确保:
English:
Issues and improvement suggestions are welcome. Before submitting a PR, ensure:
版本: 1.0.0 | Version: 1.0.0
最后更新: 2024年 | Last Updated: 2024