# Webpage to Markdown Fetch a URL and convert the main content to clean Markdown. ## What it does - Downloads a web page and extracts the primary article content. - Strips navigation, ads, cookie banners, and boilerplate. - Converts to Markdown, preserving headings, links, lists, and code. - Flags JS-heavy pages that need a renderer instead of guessing. ## Files | File | Purpose | |------|---------| | `SKILL.md` | Instructions the agent receives on activation | | `scripts/fetch_md.py` | Fetch a URL and emit clean Markdown | | `references/extraction.md` | How extraction works; when to fall back | ## Requirements Installs `requests`, `readability-lxml`, and `markdownify` on first use. ## License Apache-2.0. --- name: webpage-to-markdown display_name: Webpage to Markdown description: "Fetch a web page and convert it to clean, readable Markdown, stripping nav, ads, and boilerplate. Use when the user shares a URL and wants the article content, a readable copy, a summary source, or to save a page as Markdown. Do NOT use for pages behind a login or for downloading binary files." license: Apache-2.0 --- # Webpage to Markdown Fetch a URL and extract the main content as clean Markdown, dropping navigation, ads, cookie banners, and other boilerplate. ## When to use The user shares a link and wants the readable article text — to read, summarize, archive, or feed into another step. ## Execution steps 1. **Fetch + extract**: `python scripts/fetch_md.py <url> -o out.md`. The script downloads the page, runs readability-style main-content extraction, and converts the result to Markdown (installs `requests`, `readability-lxml`, and `markdownify` on first use). 2. **Fallback**: if extraction yields too little (heavy-JS site), note that the page likely needs a JS renderer (e.g. the Firecrawl or Fetch MCP) and report what little was recovered rather than inventing content. 3. **Clean up**: collapse excessive blank lines and keep links/headings/code. 4. **Return** the Markdown (and the saved path). Include the source URL at the top. ## Rules - Preserve headings, links, lists, and code blocks; drop nav/ads/footers. - Never fabricate content that wasn't on the page. - Respect the site — single fetch, honor an obvious paywall/login wall by stopping. ## Available resources - `scripts/fetch_md.py` — fetch a URL and emit clean Markdown of the main content. - `references/extraction.md` — how extraction works and when to fall back to a renderer.
Webpage to Markdown by langbot-team
Fetch a web page and convert its main content to clean Markdown, stripping boilerplate.
Loading...