AI AGENT SKILLS

Hfpclawer Paper Search

一个面向 Research 场景的 Agent 技能。原始说明：Discover, download, and organize academic papers from arXiv, HuggingFace Papers, and OpenReview. Multi-source search → dedup → PDF download → Markdown conver...

下载技能包打开来源页 Research

SKILL.md

name: hfpclawer-paper-search
description: >
Discover, download, and organize academic papers from arXiv, HuggingFace Papers,
and OpenReview. Multi-source search → dedup → PDF download → Markdown conversion →
optional wiki sync. Designed for researchers who want to monitor new papers daily.
category: research
author: Li Shen
version: 1.0.0
metadata:
hermes:
tags: [paper, search, pdf, download, research, arxiv, monitoring]
related_skills: [hfpclawer-citation-audit]

hfpclawer Paper Search & Download

A multi-source academic paper pipe: search across arXiv / HuggingFace Papers /
OpenReview / PapersWithCode, deduplicate by title, download PDFs, convert to
Markdown, and optionally sync to a wiki.

Who this is for: Researchers who want a daily "new papers on my topic"

feed without manually checking multiple websites.

Overview

Typical workflow in one command:

hfpclawer search           # Discover new papers across sources
   └── ranked by relevance to your keywords
hfpclawer download         # Download PDFs for matched papers  
   └── 8 concurrent streams
hfpclawer convert --to-wiki # PDF → readable Markdown + wiki sync

Or run the full pipeline at once:

hfpclawer full --max-pages 3 --to-wiki

Prerequisites

pip install hfpclawer>=0.5.0
hfpclawer init                      # Creates config.yaml in current directory

Edit config.yaml with your search interests (see Configuration section below).

Quick Start

1. First-time Setup

# Create default config
hfpclawer init

# Edit the config to match your research interests
vim config.yaml
# → Change: search.queries, keywords.include_high, keywords.exclude

2. One-Shot Full Pipeline (daily use)

# Discover → Download → Convert → Wiki sync in one command
hfpclawer full

# Limit pages for a quick check
hfpclawer full --max-pages 3 --to-wiki

3. Step-by-Step (for debugging)

# Step 1: Search across all sources
hfpclawer search --max-pages 5

# Step 2: Download PDFs for matched papers
hfpclawer download

# Step 3: Convert PDFs to Markdown
hfpclawer convert

# Step 4: Sync to wiki directory
hfpclawer convert --to-wiki

4. Monitor New Papers Regularly

# Check what papers have been downloaded
hfpclawer list

# Show paper store statistics
hfpclawer store stats

# Start the real-time download monitor
hfpclawer monitor start

Configuration

The config file config.yaml controls what papers are searched and downloaded:

search:
  max_per_dim: 50           # Papers per search query per source
  queries:
    - query: "neural operator"
      category: neural-operator
    - query: "physics-informed"
      category: physics-informed
    - query: "PDE solver deep learning"
      category: pde-solver

keywords:
  include_high:              # Papers must match these (OR)
    - "neural operator"
    - "pde"
    - "deep learning"
  include_low:               # Optional bonus keywords
    - "fourier"
    - "self-attention"
  exclude:                   # Exclude these topics
    - "quantum"
    - "llm"

classification:
  threshold_pass: 30         # Relevance score threshold (0-100)
  title_similarity_min: 0.40 # Dedup threshold

paths:
  data_dir: "data"           # SQLite DB location
  pdf_dir: "pdfs"            # Downloaded PDFs
  md_dir: "mds"              # Converted Markdown files

Available Commands

| Command | Purpose | Common Flags |
|---------|---------|-------------|
| hfpclawer search | Discover new papers | --max-pages, --dry-run |
| hfpclawer download | Download PDFs | (runs from search results) |
| hfpclawer convert | Convert PDF → MD | --to-wiki syncs to raw/papers/ |
| hfpclawer full | All-in-one pipeline | --max-pages, --to-wiki |
| hfpclawer list | List downloaded papers | |
| hfpclawer store stats | Paper store statistics | |
| hfpclawer store export | Export store as JSON/CSV | --format json |
| hfpclawer store verify | Cross-verify paper metadata | --arxiv-id |
| hfpclawer config | Show current config | |
| hfpclawer mcp | Start MCP server | (for LLM integration) |
| hfpclawer monitor | Download daemon control | start, stop, status |
| hfpclawer dedup | Show dedup statistics | |

Daily Routine Examples

Morning — Check What's New

# Quick scan (3 pages per query, ~50 papers)
hfpclawer search --max-pages 3

# View results
hfpclawer store stats

Afternoon — Download & Read

# Download all new papers
hfpclawer download

# Convert to readable markdown
hfpclawer convert

# Read the best one
cat mds/2010.08895.md | head -80

Weekly — Full Pipeline

# Full sweep with wiki sync
hfpclawer full --max-pages 10 --to-wiki

# Validate references in newly added papers
hfpclawer audit verify "Key cited paper" --source openalex

Data Storage

hfpclawer uses three tiers:

| Storage | Location | Content | Persistence |
|---------|----------|---------|-------------|
| SQLite | data/papers.db | Metadata, dedup, cross-ref | Persistent |
| PDFs | pdfs/ | Raw paper PDFs | Download once, keep |
| Markdown | mds/ | Converted text | Regeneratable from PDFs |

The paper store tracks:

arXiv ID, title, authors, abstract
Source of discovery (HF / arXiv / OpenReview)
Download status, conversion status
Wikified path (if synced)
Cross-verification with Crossref (DOI validation)

Common Pitfalls

pip install needs to be in the right venv. If hfpclawer command is not

found, check the active Python environment.

HuggingFace CLI rate limits. Too many queries per minute will trigger 429s.

Reduce max_per_dim to 10 if this happens.

Scrapy spiders need scrapy extra installed. If you see `ModuleNotFoundError:

scrapy, run pip install hfpclawer[scrapy]`.

PDF conversion needs pymupdf4llm. Run pip install hfpclawer[pdf] if

hfpclawer convert complains about missing pymupdf4llm.

Wiki sync defaults to raw/papers/. If you do not have a wiki directory,

skip --to-wiki and read from mds/ directly.

First run creates a config.yaml. Edit it before running hfpclawer full,

otherwise the default queries may not match your research area.

Verification Checklist

[ ] hfpclawer init creates a valid config.yaml
[ ] hfpclawer search --dry-run validates config without network calls
[ ] hfpclawer search --max-pages 3 returns real papers
[ ] hfpclawer download downloads PDFs correctly
[ ] hfpclawer convert produces readable Markdown
[ ] hfpclawer store stats shows non-zero counts
[ ] hfpclawer store verify --arxiv-id 2010.08895 cross-checks via Crossref

适用场景

分类

Research Research 快速安装试用低风险技能筛选

风险等级

风险标签

network access

文件

1

MD SKILL.md SKILL.md 6,782 B

Research

Multi Search Engine

一个面向 Research 场景的 Agent 技能。原始说明：Multi search engine integration with 16 engines (7 CN + 9 Global). Supports advanced search operators, time filters, site search, privacy engines, and Wolfra...

Research 低风险

Polymarket

一个面向 Research 场景的 Agent 技能。原始说明：Query Polymarket prediction markets. Check odds, find trending markets, search events, track price movements.

Research 低风险

Baidu web search

一个面向 Research 场景的 Agent 技能。原始说明：Search the web using Baidu AI Search Engine (BDSE). Use for live information, documentation, or research topics.

Research 低风险

Clawdbot Documentation Expert

一个面向 Research 场景的 Agent 技能。原始说明：Clawdbot documentation expert with decision tree navigation, search scripts, doc fetching, version tracking, and config snippets for all Clawdbot features

Research 低风险

Find Skills Skill

一个面向 Research 场景的 Agent 技能。原始说明：Search and discover OpenClaw skills from various sources. Use when: user wants to find available skills, search for specific functionality, or discover new s...

Research 低风险

Memory Setup

一个面向 Research 场景的 Agent 技能。原始说明：Enable and configure Moltbot/Clawdbot memory search for persistent context. Use when setting up memory, fixing "goldfish brain," or helping users configure memorySearch in their config. Covers MEMORY.md, daily logs, and vector search setup.

SKILL.md

hfpclawer Paper Search & Download

Overview

Prerequisites

Quick Start

1. First-time Setup

2. One-Shot Full Pipeline (daily use)

3. Step-by-Step (for debugging)

4. Monitor New Papers Regularly

Configuration

Available Commands

Daily Routine Examples

Morning — Check What's New

Afternoon — Download & Read

Weekly — Full Pipeline

Data Storage

Common Pitfalls

Verification Checklist

相关技能

Multi Search Engine

Polymarket

Baidu web search

Clawdbot Documentation Expert

Find Skills Skill

Memory Setup