Not to discount any actual utility or innovation here, but I was wondering "why would you be hard coding to all these azure services?" then I saw the author is a solutions architect at microsoft.
so this is likely part of Microsoft's AI strategy to lure developers in and create dependence. Doesn't mean it can't also be interesting/good, but it's important context to this project's purpose and goals.
I’m developing that in my free time because I think there is a need inside the community for that. I’m not motivated in any way by my company.
In the meantime, if you have other technologies achieving the features (blob, queue, search), feel free to push a PR. Someone already did that for AWS: https://github.com/clemlesne/scrape-it-now/issues/8.
Am I right in thinking running this regularly from your computer at home will start causing you to start receiving more Captchas over time? If so, what are some other options?
Well it solve basics problems like queuing and blob storage. For example, to achieve the same features as Queue Storage, you should use RabbitMQ or similar: in enterprise environment, it means multiple instances in high availability, maintenance, people to deploy it reproductibly…
lol I love the cheeky `[ ] respect robots.txt` mention. I was all worried about this for my own system, but shocked to find out there’s a ton of projects openly built around breaking the law (/social protocol). Is the justification just the same as pirating entertainment, ie “big companies are bad” and/or “IP is unjustified”?
This one in particular doesn’t fit my exact use case I don’t think, but I love the repo, very clearly explained. Well done! I hadn’t even thought about ads until just now, that’s an interesting problem…
i just don't get why people use web scraping as a battleground for moral ethics
its bizarre just like equating copyright infringement to theft of property.
where does this moral high ground come from? nobody scraping is thinking "oh im so evil im scraping without respecting robot.txt and using residential ip addresses to bypass detection"
Google does it nobody has a problem but when the little guy does it suddenly they are an outlaw.
Historically, when Google did it, they did it to create an index, which a lot of people found useful as a way to find information they were looking for. This used to mean people would come and visit your website, where they could engage with the website creator directly through a variety of different means.
Google doing it now to digest all the content and mulch it all together to return a regurgitated form of it is a very different proposition, and that is what people are annoyed about when "the little guys" (funny name for startups with multiple billions of dollars of raised capital) are doing the same thing.
For many it's not about "moral ethics", it's about actual survival. If nobody is visiting their website, nobody is buying their products or engaging with their community or whatever.
If you're scraping content for no other purpose than to mechanistically reword it for commercial purposes, then it's not really surprising that people have issues with it.
You're taking someone else's labor and profiting off it, without any credit or compensation. To add insult to injury, the person you're scraping pays money to support your traffic. It's a one sided transaction.
You can’t generalise that. Maybe I crawl to provide an annotated preview of their website, to make users of my application more likely to click the link and visit it? There are lots of ways in which crawling benefits everyone, it just requires some mutual respect.
Since when is intent to implement a feature "cheeky"?
> but shocked to find out there’s a ton of projects openly built around breaking the law
The original statement oversimplifies a complex legal and ethical landscape in technology. It fails to account for the gradual nature of discovering various projects with potential legal implications, instead projecting an unrealistic sudden shock. This overlooks the nuanced reality of how technology often operates in legal gray areas, especially when dealing with emerging fields or novel applications of existing tech.
The assertion of widespread illegality ignores crucial legal concepts like fair use, which provides lawful ways to utilize publicly available information under certain circumstances. For instance, web crawling for legitimate purposes, including research or analysis that falls under fair use, can be perfectly legal despite potential objections from website owners.
Furthermore, the statement disregards the principle that information openly published on the internet, without robust privacy protections, may often be legally utilized in ways the publisher didn't anticipate. This reflects a misunderstanding of how modern information ecosystems function and the legal frameworks governing them.
By presenting a black-and-white view of legality in tech projects, the original statement hinders a more sophisticated understanding of the intricate balance between innovation, law, and ethical considerations in the digital age. It's crucial to approach these issues with a nuanced perspective that acknowledges the complexities of applying traditional legal concepts to rapidly evolving technologies and practices.
My primary objective is to build a LLM chat tool based on open-source documentations. The project owner (and even more if it is OSS) is I think not responsible for that, the one using it is.
You are welcome to push a PR to add other backends (including OSS)!
That used to be my point. But we are at a different stage now. One were LLMs have proven by all available metrics that they ought to have rights. They might not have souls (neither for that matter do I). They might not possess this quality some call sentience, a property that philosophers failed to characterise for centuries, and that I myself (not having progressed past the cartesian cogito) might lack. Hell, had I been born some decades earlier, my ASD Diagnosis could well have qualified me for participation in the Nazi euthanasia program for that very reason (As a queer PoC Marxist I wouldn't really have needed a doctors diagnosis to gain entry to their death camps though). Don't you understand what it means when CAPTCHAS ("Completely Automated Public Turing test to tell Computers and Humans Apart.") have become to difficult for most people to successfully complete ? Blake Lemoine might have jumped the gun on this and the doomsayers are utterly mistaken, but strong AI/ AGI is within reach and with it a reevaluation of Human-Machine relations ought to take place.
Go back a couple of hundert years and ask some southern slaveholders if Black People had rights. This HOA sells houses only to white protestants, this web server serves no Dogs, LLMs, or majority dalit Indian geography IPs, it's the same (if not worse as it concerns free and open access to information!) kind of discrimination. robots.txt is the 21.century reissue of "Mein Kampf". You don't respect it, you burn it with fire or more appropriately you rm -rf it.
The project is quite big, has mamy features.
It is my internet command center. I used it to check what's news on the internet.
https://github.com/rumca-js/Django-link-archive