More

strin · on Oct 28, 2021

So meta

ldbooth · on Oct 28, 2021

Without "-data" meta struck me as corporate appropriation of slang. 'Hella' would be more entertaining.

strin · on June 23, 2021

My two cents on paid search engines: yes so much awareness around privacy these days. so there might be a potential shift in user behavior, which opens up a big market. however, i am skeptical because people are used to free search engine that surfaces good results.

so privacy is great value prop. but not so much that i am willing to pay for it while living with worse search quality.

strin · on June 1, 2021

Interesting fact. Also all convos lead to bitcoin at some point these days...

strin · on May 28, 2021

https://techcrunch.com/2021/05/28/anthropic-is-the-new-ai-re...

strin · on Feb 17, 2021

lol they didn’t open source the model weights

buildbot · on Feb 18, 2021

If I did the math right it would be 3.12TB of weights, maybe they are trying to upload it to gdrive still. (/s, probably)

schoen · on Feb 18, 2021

They released more data than that for their Google Books n-grams datasets:

https://storage.googleapis.com/books/ngrams/books/datasetsv3...

(I don't remember exactly how much it is, but I remember that the old version was already in the terabytes.)

lifthrasiir · on Feb 18, 2021

Another example of Google giving much data away is 50 trillion digits of pi [1], which contains about 42 TB of data (decimal and hexadecimal combined).

[1] https://storage.googleapis.com/pi50t/index.html

dheera · on Feb 18, 2021

The Waymo open dataset is about 1TB. I don't think releasing a 3TB dataset would present a technical challenge for Google.

londons_explore · on Feb 18, 2021

Even a 3PB model would be very doable for Google...

pradn · on Feb 18, 2021

The daily upload quota for a user is ~750 GB. It'll take a few days to upload that much data to Google Drive!

dekhn · on Feb 18, 2021

Google Cloud Storage. The files could be dumped as tfrecord in a bucket with "requester pays". So anybody could reproduce it using the open source code, by paying for the costs incurred to move the data from GCS to the training nodes.

shepherdjerred · on Feb 18, 2021

Weights are just numbers (probably floats?), right?

This model has 3.12TB of floats??? That's insane. How do you load that into memory for inferencing?

exikyut · on Feb 18, 2021

Use x1e.32xlarge on AWS with 3TB of RAM. Just $12,742/mo - https://calculator.aws/#/estimate?id=7428fa81192c57087ac8cdf...

Alternatively order something like the HP Z8 with 3TB RAM configured, which is only $75k - https://zworkstations.com/configurations/2040422/

It's interesting. It would take ~six years for the Z8 to break even compared to AWS, but traffic into and out of the machine would be $0, and I don't think you're running directly on the metal with AWS, so performance would probably be a bit higher. And then there's storage - I configured, uhh, 120TB of a mixture of SSDs and HDDs. I'm not even going to try and ask AWS for a comparible quote there.

I may or may not have added dual Xeon Platinum 8280s to the Z8 as well. :P

tjbiddle · on Feb 18, 2021

When you're spending that kind of money on a machine, there's no way you're paying retail price. Sales reps would give you a significant discount.

Also - think you meant 6 months, not 6 years anyhow :)

exikyut · on Feb 18, 2021

Interesting. I'm very curious... 20%? 35%?

And I did mean 6 months, woops. Didn't even notice...

jsnell · on Feb 18, 2021

> It would take ~six years for the Z8 to break even

Do you mean six months?

exikyut · on Feb 18, 2021

Oh *dear*. I definitely tripped over there, and I didn't even notice.

Yup.

high_byte · on Feb 18, 2021

Z8 sounds like fun. but I might just buy two teslas (roadster and X, or a cybertruck) and a gaming PC. :D

visarga · on Feb 18, 2021

hate to break the party but this model only loads a small part of itself in RAM when inferencing

exikyut · on Feb 18, 2021

That's a good thing. Less completely means more energy for interestingness, and less expense means more accessibility.

swirepe · on Feb 18, 2021

(They are definitely going to exceed their storage quotas.)

I want to see how well weights for these models compress, but it will take me some time to run this code and generate some. I'm guessing they won't compress well, but I can't articulate a reason why.

wisty · on Feb 18, 2021

If weights compress, they have low information, which would suggest they're either useless or the architecture is bad.

notretarded · on Feb 17, 2021

In the source code it says "I have discovered truly marvelous weights for this, which this header file is too small to contain"

verdverm · on Feb 17, 2021

data is the new oil, what's the analogy for the data industry's impact on society akin climate change?

ALittleLight · on Feb 17, 2021

Surveillance Capitalism

https://en.wikipedia.org/wiki/Surveillance_capitalism

selfhoster11 · on Feb 18, 2021

Social Cooling (https://www.socialcooling.com/).

coef2 · on Feb 18, 2021

Is this because they are afraid of the model misused, like used for generating fake reviews? It is frustrating that I've been hearing great news on NLP but am able to try none of them myself.

briga · on Feb 18, 2021

It's because the model weights are the valuable thing here. The fancy new architectures are nice and everything, but transformer models are a dime a dozen these days. Seems like they're using this as an example to point at and say "Hey, look at us, we support open source!", whereas unless you're willing to go ahead and spend a small fortune on compute (possibly using their GPUs), these models are somewhat useless.

JZL003 · on Feb 17, 2021

hah! yeah that's what I was looking for too

strin · on Feb 15, 2021

is the "retrofit" strategy living in the past? living in 2021 it seems a bad choice to buy gasoline cars. Most new cars coming out will have some kind of driving assist (L2 autopilot).

jedberg · on Feb 15, 2021

It's the opposite. I already own a compatible gas car. Instead of wasting resources on a new car, I can just retrofit the one I already own.

Also, I desperately want an electric car. But I need a minivan because (post-pandemic) I'm often driving six people around who are elderly or children and can't climb into an SUV.

There is no such electric van. This is the only way I can get "autopilot" in a van.

strin · on Feb 15, 2021

I wonder how clubhouse would monetize its traffic. most social networks had to make the advertisement presented in similar ways to the content, e.g. twitter, instagram, tiktok. but I can't imagine clicking on and listen to a "conversation" that advertises a product...

strin · on Feb 10, 2021

That works only for static page though. Many modern pages would require you to run a selenium or puppetteer to scrape the content.

edmundsauto · on Feb 10, 2021

For these sites, I crawl using a JS powered engine, and just save the relevant page content to disk.

Then I can craft my regex/selectors/etc., once I have the data stored locally.

This helps if you get caught and shut down - it won't turn off your development effort, and you can create a separate task to proxy requests.

alephu5 · on Feb 11, 2021

I did web-scraping professionally for two years, in the order of 10M pages per day. The performance with a browser is abysmal and requires tonnes of memory so not financially viable. We used them for some jobs, but rendered content isn't a problem, you can also simulate the API calls (common) and read the JSON, or regex the script and try to do something with that.

I'd say 99% of the time you can get by without a browser.

inovica · on Feb 11, 2021

Fully agree. It takes some thought :)

thaumasiotes · on Feb 10, 2021

That's never required; the data shows up in the web page because you requested it from somewhere. You can do the same thing in your scraper.

dewey · on Feb 10, 2021

> You can do the same thing in your scraper

Rendering the page in Puppeteer / Selenium and then scraping it from there sounds like a lot easier than somehow trying to replicate that in your scraper?

thaumasiotes · on Feb 10, 2021

Sure. How does that relate to the claim that your scraper is actually unable to make the same requests your browser does?

dewey · on Feb 10, 2021

How are you going to deal with values generated by JS and used to sign requests?

thaumasiotes · on Feb 10, 2021

If they're really being generated client-side, you're free to generate them yourself by any means you want. But also, that's a strange thing for the website to do, since it's applying a security feature (signatures) in a way that prevents it from providing any security.

If they're generated server-side like you would expect, and sent to the client, you'd get them the same way you get anything else, by asking for them.

dewey · on Feb 10, 2021

I'm not sure what's your point. Of course you can replicate every request in your scraper / with curl if you want to if you know all the input variables.

Doing that for web scraping purposes where everything is changing all the time and you have more than one target website is just not feasible if you have to reverse engineer some custom JS for every site. Using some kind of headless browser for modern websites will be way easier and more reliable.

pocket_cheese · on Feb 10, 2021

As someone who has done a good bit of scraping, how a website is designed dictates how I scrape.

If it's a static website that has consistently structured HTML and is easy to enumerate through all the webpages I'm looking for, then simple python requests code will work.

The less clear case is when to use a headless browser vs reverse engineering JS/server side APIs. Typically, I will do like a 10 minute dive into the client side js and monitor ajax requests to see if it would be super easy to hit some API that returns JSON to get my data. If reverse engineering seems to hairy, then I will just do headless browser.

I have a really strong preference for hitting JSON apis directly because, well, you get JSON! Also you usually get more data then you even knew existed.

Then again, if I was creating a spider to recursively crawl a non-static website, then I think Headless is the path of least resistance. But usually, I'm trying to get data in the HTML, and not the whole document.

shiyason · on Feb 10, 2021

I’ve been doing web scraping for the past 5 years and this is exactly the approach I take as well!

tester756 · on Feb 10, 2021

>If they're really being generated client-side, you're free to generate them yourself by any means you want. But also, that's a strange thing for the website to do

what??

Page loads -> Javascript sends request to backend -> it returns data -> javascript does stuff with it and renders it.

thaumasiotes · on Feb 10, 2021

Sure, that's the model from several comments up. It doesn't involve signing anything.

strin · on Dec 11, 2020

Follow a normal SaaS growth curve this will IPO in 25 years!

strin · on Dec 10, 2020

I wonder if the world would converge to an open standard for social networks: https://news.stanford.edu/news/2014/march/privacy-economy-ap...