Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Okay this is super impressive. I just downloaded Whisper and fed it a random flac file I had handy and it did a really good job. Also impressive that it works on my weak CPU:

A 3m07s flac took 5m to transcribe:

  $ whisper --device cpu 'BLACKPINK - BORN PINK/01 Pink Venom.flac'
  Detecting language using up to the first 30 seconds. Use `--language` to specify the language
  Detected language: korean
  [00:00.000 --> 00:10.000]  Blackpink
  [00:11.000 --> 00:14.000]  Kick in the door, wave in the coco
  [00:14.000 --> 00:16.000]  팝콘이는 친게 껴들 생각 말고
  [00:16.000 --> 00:19.000]  I talk to talk, run ways I walk walk
  [00:19.000 --> 00:21.000]  힘 감고 팝 팝 안 봐도 척
  [00:21.000 --> 00:24.000]  By one and two by two
  [00:24.000 --> 00:26.000]  내 손끝 두 하나에 타면 아지은 중
  [00:26.000 --> 00:30.000]  갓 자쇼 지금 화려해 T makes no sense
  [00:30.000 --> 00:32.000]  You couldn't get a dollar out of me
  [00:33.000 --> 00:38.000]  자 오늘 밤이야 눈톱을 품고
  [00:38.000 --> 00:41.000]  미혼을 뺏음 down
  [00:41.000 --> 00:43.000]  Look what you made us do
  [00:43.000 --> 00:47.000]  천천히 널 잠재울 파이어
  [00:48.000 --> 00:52.000]  잠이 날 만큼 아름다워
  [00:52.000 --> 00:53.000]  I bring the pain like
  [00:53.000 --> 00:57.000]  디스탑, 팽팽, 디스탑, 팽팽, 디스탑, 팽팽, 팽팽
  [00:57.000 --> 00:58.000]  Get em, get em, get em
  [00:58.000 --> 01:00.000]  Straight till you don't like
  [01:00.000 --> 01:01.000]  Whoa, whoa, whoa
  [01:01.000 --> 01:03.000]  Straight till you don't like
  [01:03.000 --> 01:04.000]  Ah, ah, ah
  [01:04.000 --> 01:05.000]  Taste that, pink venom
  [01:05.000 --> 01:06.000]  Taste that, pink venom
  [01:06.000 --> 01:08.000]  Taste that, pink venom
  [01:08.000 --> 01:09.000]  Get em, get em, get em
  [01:09.000 --> 01:11.000]  Straight till you don't like
  [01:11.000 --> 01:12.000]  Whoa, whoa, whoa
  [01:12.000 --> 01:13.000]  Straight till you don't like
  [01:13.000 --> 01:14.000]  Ah, ah, ah
  [01:14.000 --> 01:15.000]  Blackpink and Amo
  [01:15.000 --> 01:17.000]  Got it by the smack ram
  [01:17.000 --> 01:18.000]  But rest in peace
  [01:18.000 --> 01:19.000]  Please light up a candle
  [01:19.000 --> 01:20.000]  This the knife of a vando
  [01:20.000 --> 01:22.000]  Messed up and I'm still in saline
  …SNIP…


Looks like it defaults to the model called "small".

I just ran some benchmarks - M1 Max, pytorch, with a 1.29 second flac (looks like the matrix math was running on a single thread):

    tiny
    146.522ms detect_lang
    549.131ms decode_one
    0.057ms tokenizer

    base
    354.885ms detect_lang
    1046.679ms decode_one
    0.011ms tokenizer

    small
    803.892ms detect_lang
    3194.503ms decode_one
    0.017ms tokenizer

    medium
    2279.689ms detect_lang
    10128.255ms decode_one
    0.023ms tokenizer

    large
    3656.478ms detect_lang
    17249.024ms decode_one
    0.016ms tokenizer


For more benchmarks on an rtx 2060 (6gb), the "small" model for me is roughly 10x real-time and the tiny model is 30x real-time.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: