suggestPreset: HTML -> Preset (via LLM)
applyPreset: HTML + Preset -> Markdown (programmatically)
Where preset is:
type Preset = {
// anchors to make this preset more fragile on purpose.
// Elements that identify website engine layout go here.
preset_match_detectors: CSSSelector[];
// main content extractors
main_content_selectors: CSSSelector[];
// filter selectors to trim the main content.
// banners, subscription forms, sponsor content, etc.
main_content_filters: CSSSelector[];
};
suggestPreset uses a feedback loop that enhances + applies preset until the markdown is really clean
Opensourced it just now.
More specifically, it works like this:
Where preset is: suggestPreset uses a feedback loop that enhances + applies preset until the markdown is really clean