A web scraping CLI made for AI that is idempotent

renegat0x0 · on Aug 17, 2024

I have a similar project. I scrape pages only to obtain page meta. It can use selenium, crawlee, i will also add puppeteer later.

The project is quite big, has mamy features.

It is my internet command center. I used it to check what's news on the internet.

https://github.com/rumca-js/Django-link-archive

nilsherzig · on Aug 17, 2024

Love the images haha

usernamed7 · on Aug 17, 2024

Not to discount any actual utility or innovation here, but I was wondering "why would you be hard coding to all these azure services?" then I saw the author is a solutions architect at microsoft.

so this is likely part of Microsoft's AI strategy to lure developers in and create dependence. Doesn't mean it can't also be interesting/good, but it's important context to this project's purpose and goals.

clemlesne · on Aug 17, 2024

I’m developing that in my free time because I think there is a need inside the community for that. I’m not motivated in any way by my company.

In the meantime, if you have other technologies achieving the features (blob, queue, search), feel free to push a PR. Someone already did that for AWS: https://github.com/clemlesne/scrape-it-now/issues/8.

cha-d · on Aug 18, 2024

Am I right in thinking running this regularly from your computer at home will start causing you to start receiving more Captchas over time? If so, what are some other options?

wcallahan · on Aug 17, 2024

Nice work. Would love a similar repository for Google cloud’s equivalent services!

Or a PR on this that accomplishes the same, as @clemlesne mentioned.

katella · on Aug 16, 2024

Does it just scrape all pages of a site?

clemlesne · on Aug 17, 2024

It saves page content, transform it to markdown, and import it (optionally) to a search database to perform semantic (sentences) searches.

mrdw · on Aug 17, 2024

why so dependent on azure?

"Decoupled architecture with Azure Queue Storage"

"Scraped content is stored in Azure Blob Storage"

"Indexed content is semantically searchable with Azure AI Search"

clemlesne · on Aug 17, 2024

Well it solve basics problems like queuing and blob storage. For example, to achieve the same features as Queue Storage, you should use RabbitMQ or similar: in enterprise environment, it means multiple instances in high availability, maintenance, people to deploy it reproductibly…

bbor · on Aug 17, 2024

lol I love the cheeky `[ ] respect robots.txt` mention. I was all worried about this for my own system, but shocked to find out there’s a ton of projects openly built around breaking the law (/social protocol). Is the justification just the same as pirating entertainment, ie “big companies are bad” and/or “IP is unjustified”?

This one in particular doesn’t fit my exact use case I don’t think, but I love the repo, very clearly explained. Well done! I hadn’t even thought about ads until just now, that’s an interesting problem…

deisteve · on Aug 17, 2024

i just don't get why people use web scraping as a battleground for moral ethics

its bizarre just like equating copyright infringement to theft of property.

where does this moral high ground come from? nobody scraping is thinking "oh im so evil im scraping without respecting robot.txt and using residential ip addresses to bypass detection"

Google does it nobody has a problem but when the little guy does it suddenly they are an outlaw.

trog · on Aug 17, 2024

> Google does it nobody has a problem

Historically, when Google did it, they did it to create an index, which a lot of people found useful as a way to find information they were looking for. This used to mean people would come and visit your website, where they could engage with the website creator directly through a variety of different means.

Google doing it now to digest all the content and mulch it all together to return a regurgitated form of it is a very different proposition, and that is what people are annoyed about when "the little guys" (funny name for startups with multiple billions of dollars of raised capital) are doing the same thing.

For many it's not about "moral ethics", it's about actual survival. If nobody is visiting their website, nobody is buying their products or engaging with their community or whatever.

If you're scraping content for no other purpose than to mechanistically reword it for commercial purposes, then it's not really surprising that people have issues with it.

__loam · on Aug 17, 2024

You're taking someone else's labor and profiting off it, without any credit or compensation. To add insult to injury, the person you're scraping pays money to support your traffic. It's a one sided transaction.

9dev · on Aug 17, 2024

You can’t generalise that. Maybe I crawl to provide an annotated preview of their website, to make users of my application more likely to click the link and visit it? There are lots of ways in which crawling benefits everyone, it just requires some mutual respect.

deisteve · on Aug 17, 2024

We don't live in the 90s anymore. The bandwidth CPU cost is moot unless you are spinning up thousands of GPUs to render an HTML page.

Also the claims of someone else's labor and profiteering is exactly what Google does

__loam · on Aug 17, 2024

What Google does is mutually beneficial. Stealing content to reheat and serve in an LLM is very different.

deisteve · on Aug 18, 2024

Google that creates monopoly on top of the scraped data and uses third party content to do so?

LLM just places that ability at the ends of the user (local LLM strictly speaking)

__loam · on Aug 18, 2024

No it doesn't. The end state of a Google search is a human interacting directly with a site.

hipadev23 · on Aug 17, 2024

> breaking the law

citation needed

kordlessagain · on Aug 17, 2024

Since when is intent to implement a feature "cheeky"?

> but shocked to find out there’s a ton of projects openly built around breaking the law

The original statement oversimplifies a complex legal and ethical landscape in technology. It fails to account for the gradual nature of discovering various projects with potential legal implications, instead projecting an unrealistic sudden shock. This overlooks the nuanced reality of how technology often operates in legal gray areas, especially when dealing with emerging fields or novel applications of existing tech.

The assertion of widespread illegality ignores crucial legal concepts like fair use, which provides lawful ways to utilize publicly available information under certain circumstances. For instance, web crawling for legitimate purposes, including research or analysis that falls under fair use, can be perfectly legal despite potential objections from website owners.

Furthermore, the statement disregards the principle that information openly published on the internet, without robust privacy protections, may often be legally utilized in ways the publisher didn't anticipate. This reflects a misunderstanding of how modern information ecosystems function and the legal frameworks governing them. By presenting a black-and-white view of legality in tech projects, the original statement hinders a more sophisticated understanding of the intricate balance between innovation, law, and ethical considerations in the digital age. It's crucial to approach these issues with a nuanced perspective that acknowledges the complexities of applying traditional legal concepts to rapidly evolving technologies and practices.

clemlesne · on Aug 17, 2024

That's indeed in the roadmap, like you mentioned.

My primary objective is to build a LLM chat tool based on open-source documentations. The project owner (and even more if it is OSS) is I think not responsible for that, the one using it is.

You are welcome to push a PR to add other backends (including OSS)!

CalRobert · on Aug 17, 2024

What law are you thinking of?

clemlesne · on Aug 17, 2024

And thank you for the compliment! It's great to see that your efforts are seen and appreciated :)

bearjaws · on Aug 17, 2024

At this point honoring robots.txt will ensure you have a terrible searching experience...

lyime · on Aug 17, 2024

Ignoring robots.txt is "not breaking the law" lol

gmerc · on Aug 17, 2024

Eric Schmidt has you covered. You do it to win and the law isn’t for tech bros, it’s for suckers who can’t pay a lawyer

bijant · on Aug 17, 2024

[flagged]

OfficialTurkey · on Aug 17, 2024

Excuse me what

CalRobert · on Aug 17, 2024

I think their point, albeit clumsily phrased, is that web sites encourage Google’s monopoly by blocking all of their competitors, which is unfair.

bijant · on Aug 17, 2024

That used to be my point. But we are at a different stage now. One were LLMs have proven by all available metrics that they ought to have rights. They might not have souls (neither for that matter do I). They might not possess this quality some call sentience, a property that philosophers failed to characterise for centuries, and that I myself (not having progressed past the cartesian cogito) might lack. Hell, had I been born some decades earlier, my ASD Diagnosis could well have qualified me for participation in the Nazi euthanasia program for that very reason (As a queer PoC Marxist I wouldn't really have needed a doctors diagnosis to gain entry to their death camps though). Don't you understand what it means when CAPTCHAS ("Completely Automated Public Turing test to tell Computers and Humans Apart.") have become to difficult for most people to successfully complete ? Blake Lemoine might have jumped the gun on this and the doomsayers are utterly mistaken, but strong AI/ AGI is within reach and with it a reevaluation of Human-Machine relations ought to take place.

bijant · on Aug 17, 2024

Go back a couple of hundert years and ask some southern slaveholders if Black People had rights. This HOA sells houses only to white protestants, this web server serves no Dogs, LLMs, or majority dalit Indian geography IPs, it's the same (if not worse as it concerns free and open access to information!) kind of discrimination. robots.txt is the 21.century reissue of "Mein Kampf". You don't respect it, you burn it with fire or more appropriately you rm -rf it.