Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I vibed something like this for markdown extraction just a week ago: https://github.com/promptware/readweb

Opensourced it just now.

More specifically, it works like this:

  suggestPreset: HTML -> Preset (via LLM)
  applyPreset: HTML + Preset -> Markdown (programmatically)
Where preset is:

  type Preset = {
    // anchors to make this preset more fragile on purpose.
    // Elements that identify website engine layout go here.
    preset_match_detectors: CSSSelector[];
    // main content extractors
    main_content_selectors: CSSSelector[];
    // filter selectors to trim the main content.
    // banners, subscription forms, sponsor content, etc.
    main_content_filters: CSSSelector[];
  };
suggestPreset uses a feedback loop that enhances + applies preset until the markdown is really clean


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: