More

paps · on May 15, 2020

Would you mind explaining why fail2ban is bad?

paps · on Dec 5, 2019

We use it in our CI, just before deployment, to compare the DB structure of what's being tested with the DB structure of our staging or production environment.

It's a last minute check that prevented a lot of mistakes.

djrobstep · on Dec 6, 2019

You might be interested in the service I'm building, which does a similar thing in a more structured way: https://djrobstep.com/ci-for-databases

larkost · on Dec 6, 2019

Wouldn't a dump and then diff of the schema into text form accomplish the same thing?

hobs · on Dec 6, 2019

It absolutely should, though you might want to have a specific set of questions you ask in an ordered fashion as things like column order might differ between production and development due to data sizes, dropping and recreating dev instead of migration, etc.

None of those things actually being a problem could give you false positives, so you might want some minor shuffling.

fanf2 · on Dec 6, 2019

Or something a bit more high-level like Migra? https://github.com/djrobstep/migra

paps · on June 1, 2018

Facebook Local?

paps · on April 23, 2018

It's similar in the sense that it's a listing of APIs. But we're providing non-official APIs that work by automating / scraping websites. Our platform runs Headless Chrome instances behind the scenes.

As a matter of fact, the automation code is available here: https://github.com/phantombuster/api-store

chrismeller · on April 23, 2018

So eventually you’ll have reliability issues because your data source is a violation of someone’s TOS?

paps · on April 23, 2018

Well, it's up for debate. We automate websites on behalf of our users (that is, logged in as them). Which means the site knows at all time who's doing what and can take action in case of abuse.

Also, we see more and more ruling indicating that scraping is in fact legal. Websites can block users according to their ToS but they can't take legal action against them or us. Maybe.

In any case, our platform also provides the tools for anyone to automate any website (make them into an API). That part is just a developer tool.

madamelic · on April 23, 2018

>We automate websites on behalf of our users (that is, logged in as them). Which means the site knows at all time who's doing what and can take action in case of abuse.

So you break ToS on your user's account, thereby risking their own and not yours... Even better.

>Also, we see more and more ruling indicating that scraping is in fact legal.

It doesn't matter if it is legal. What matters more is if the service considers it a violation of an implicit agreement not to abuse servers with rapid API requests (Big props if you are already throttling)

---

Like your service is a great idea, but breaking ToS on your user's accounts is super no-bueno in my opinion. I scrape too but I am always under the complete understanding the service can ban my account or IP at any time.

dosycorp · on April 24, 2018

How is it breaking TOS?

If you are projeting the users intent its just like they are accessing the site through a prosthesis.

Its doing things they could do by hand, if they spent the time.

Even a browser "automates" a http request instead of you having to type it by telnet.

drusepth · on April 24, 2018

>How is it breaking TOS?

Obviously, it's up to each site's specific TOS. Tons of sites explicitly call out scrapers and non-human/automated means of accessing the site. You might debate over definitions and intent, but ultimately it's up to the site owners when they say, "you know what? X _is_ against the ToS and we're just gonna ban anyone doing it"; users won't/don't have any recourse to argue their point.

For example, here's a few relevant parts for the top sites on Phantom Buster:

Facebook: https://www.facebook.com/apps/site_scraping_tos_terms.php

>These terms govern your collection of data from Facebook through automated means, such as through harvesting bots, robots, spiders, or scrapers ("Automated Data Collection"), as well as your use of that data. You will not engage in Automated Data Collection without Facebook's express written permission.

Instagram: https://help.instagram.com/478745558852511

>We prohibit crawling, scraping, caching or otherwise accessing any content on the Service via automated means, including but not limited to, user profiles and photos (except as may be the result of standard search engine protocols or technologies used by a search engine with Instagram's express consent).

Linkedin: https://www.linkedin.com/help/linkedin/answer/56347/prohibit...

> In order to protect our members’ data and our website, we don't permit the use of any third party software, including "crawlers", bots, browser plug-ins, or browser extensions (also called "add-ons"), that scrapes, modifies the appearance of, or automates activity on LinkedIn’s website.

I'd give a read through each of the APIs offered and make sure that users know 1) your service has the potential to get accounts banned for use, and 2) since the service is on behalf of the user's accounts, it'll be their accounts getting banned if the websites ban anyone.

FWIW I wrote this comment while watching a bot (that I wrote) play a game on my behalf on a second monitor. :)

dosycorp · on April 24, 2018

Good examples.

LinkedIn example tho: I think an interesting argument could be made that they should be blocking accessibility extensions / tools. Since these ( to some extent ) modify and automate UX.

I guess the question in the end is not terms. It is enforcement. Clearly ToS do not cover all cases, and even tho LI ToS say "Thou shalt not scrape" the courts adjudicated differently. So what matters is -- what is enforceable and actually enforced?

The issue of acting as "agent" for user is very important. I don't think the current way this tool does it is OK, because banning is a bad thing. Maybe there is a better way to set it up. Or maybe I'm wrong.

nunez · on April 23, 2018

Yikes.

naiv · on April 23, 2018

Wow, you are located in San Francisco. I would have guessed in India or somewhere else where you might not fear being sued.

>> "is in fact legal"

There is a big difference between "legal" and "court decision" . If a court will rule in hiq labs favor vs. linkedin it doesn't make scraping linkedin automagically legal for you.

amelius · on April 23, 2018

Wouldn't it be easy for e.g. Facebook to train a classifier on the behavior of your scrapers, and from there block you?

paps · on Feb 14, 2018

In France we have Anticafe[1]. They charge 5€ per hour spent there in exchange for unlimited wifi, coffee and pastries.

As I understand it they're quite successful. They have opened multiple new cafes in Paris' central neighborhoods.

[1] https://www.anticafe.eu/

paps · on Dec 18, 2017

> I think its also worth pointing out that less than 1000 entities owns 40% of the bitcoins.

This is FUD. 1000 wallets own 40% of BTC. See here https://bitinfocharts.com/top-100-richest-bitcoin-addresses....

sytelus · on Dec 18, 2017

Not sure what your link proves. In reality, no one knows if this statement to be true because addresses and wallets can be shared. There was a Bloomberg article with lots of interviews on this topic with people in crypto business and no one seem to deny this: https://news.ycombinator.com/item?id=15877838

g09980 · on Dec 18, 2017

Many of these wallets are actually exchange cold storage wallets, so they hold the BTC but not per se "own" it

Retric · on Dec 18, 2017

In practical terms exchanges do own these bitcoins in that they a physically capable of trading them.

workthrowaway27 · on Dec 18, 2017

You'd probably find that 40% of US equities are owned by fewer than 1000 asset managers as well, so I'm not sure that this is all that relevant. If Vanguard decided to liquidate their holdings overnight there would be pandemonium just as if these people decided to liquidate their BTC, but it's not going to happen in either case.

paps · on Aug 30, 2017

We're developing a library[1] for this use-case. The goal is to have the same simple API for multiple browsers. Right now it supports both Headless Chrome and PhantomJS.

We think we'll begin work on Firefox headless soon. PRs welcome :)

[1] https://github.com/phantombuster/nickjs

paps · on Aug 30, 2017

Where I work (https://phantombuster.com) we use them for scraping the web.

They're a hassle-free way of getting the data. No need to worry about CORS, sessions, cookies, CSRF and other modern web stuff. Just simulate a human and you’re in.

AndyNemmity · on Sept 1, 2017

Yeah, I used to work for a company doing similar things. Was more expensive that way, but there were tons of sites you could only get the data being in a proper browser.

paps · on Aug 29, 2017

Where I work we prefer jQuery to the native DOM API for scraping. It really speeds up the process of extracting data.

For example with Puppeteer you can do page.injectFile("jquery-3.2.1.min.js"). I think that would simplify your evaluate() calls.

It would also be easy to speed up the whole process by doing a single evaluate() call per page with all your scraping code in it.

BTW we just released an article with tips & tricks for Headless Chrome: https://blog.phantombuster.com/web-scraping-in-2017-headless... What do you think?

emadehsan · on Aug 29, 2017

Good suggestion. I would update soon. Thank you

paps · on July 28, 2017

It's the first time we're talking about Phantombuster here on HN :) Happy to discuss and receive constructive feedback!

Hope you like it as much as we liked building it.