nihit-desai's comments

nihit-desai · on June 21, 2023

function calling, as I understand it, makes LLM outputs easier to consume by downstream APIs/functions (https://openai.com/blog/function-calling-and-other-api-updat...).

Autolabel is quite orthogonal to this - it's a library that makes interacting with LLMs very easy for labeling text datasets for NLP tasks.

We are actively looking at integrating function calling into Autolabel though, for improving label quality, and support downstream processing.

nihit-desai · on June 21, 2023

Yep! I totally understand the concerns around not being able to share data externally - the library currently supports open source, self-hosted LLMs through huggingface pipelines (https://github.com/refuel-ai/autolabel/blob/main/src/autolab...), and we plan to add more support here for models like llama cpp that can be run without many constrains on hardware

devjab · on June 22, 2023

Very interesting. I’ll most certainly favorite this and keep an eye on it. I think that sort of thing will be the future of LLMs for many of us.

nihit-desai · on June 20, 2023

Hi!

The earlier post was a report summarizing LLM labeling benchmarking results. This post shares the open source library.

Neither is intended to be an ad. Our hope with sharing these is to demonstrate how LLMs can be used for data labeling, and get feedback from the community

nihit-desai · on June 19, 2023

>> don't trust that there was no funny business going on in generating the results for this blog

All the datasets and labeling configs used for these experiments are available in our Github repo (https://github.com/refuel-ai/autolabel) as mentioned in the report. Hope these are useful!

poomer · on June 22, 2023

Thank you, I appreciate your transparency with this work.

nihit-desai · on June 18, 2023

Partially agree, but it's a continuous value rather than a boolean. We've seen LLM performance largely follow this story: https://twitter.com/karpathy/status/1655994367033884672/phot...

From benchmarking, we've been positively surprised by how effective few-shot learning and PEFT are, at closing the domain gap.

"When it encounters novel data (value) it will likely perform poorly" -- is that not true of human annotators too? :)

orangepurple · on June 19, 2023

> "When it encounters novel data (value) it will likely perform poorly" -- is that not true of human annotators too? :)

Some humans have intelligence and reasoning abilities. No LLMs do :)

yawnxyz · on June 21, 2023

I always regret making that assumption

nihit-desai · on June 18, 2023

Good question - one followup question there is value for who? If it is to train the LLM that is labeling, then I agree. If it is to train a smaller downstream model (e.g. finetune a pretrained BERT model) then the value is as good as coming from any human annotator and only a function of label quality

voz_ · on June 18, 2023

Why retrain that smaller model from scratch tho? Just do a little transfer learning, or get creative and see if you can prune down to a smaller model algorithmically instead of doing the whole label and train rigamarole from scratch on what is effectively regurgitation.

I’m not sold this has directional value.

nihit-desai · on June 18, 2023

Hmm, I'm not suggesting training a smaller model from scratch - in most cases you'd want to finetune a pretrained model (aka, transfer learning) for your specific usecase/problem domain.

The need for labeled data for any kind of training is a constant though :)

nihit-desai · on June 18, 2023

Hi, one of the authors here. Good question! For this benchmarking, we evaluated performance on popular open source text datasets across a few different NLP tasks (details in the report).

For each of these datasets, we specify task guidelines/prompts for the LLM and human annotators, and compare each of their performance against ground truth labels.

morelisp · on June 18, 2023

You didn't answer the question at all, although to be fair the answer is both obvious and completely undermines your claim so I can see why you wouldn't.

natch · on June 18, 2023

>compare each of their performance against supposed ground truth labels.

Fixed it for you.

nihit-desai · on June 18, 2023

I mean, sure. For ground truth, we are using the labels that are part of the original dataset: * https://huggingface.co/datasets/banking77 * https://huggingface.co/datasets/lex_glue/viewer/ledgar/train * https://huggingface.co/datasets/squad_v2 ... (exhaustive set of links at the end of the report).

Is there some noise in these labels? Sure! But the relative performance with respect to these is still a valid evaluation

natch · on June 19, 2023

Agreed, thanks for highlighting these links!

nihit-desai · on May 21, 2023

A comprehensive list of GPU options and pricing from cloud vendors. Very useful if you're looking to train or deploy large machine learning/deep learning models.

nihit-desai · on May 21, 2023

Very neat! I was looking for something exactly like this for a library I'm building - will try it out

nihit-desai · on April 20, 2022

This upcoming course covers topics such as bootstrapping datasets and labels, model experimentation, model evaluation, deployment and observability.

The format is 4 weeks of project-driven learning with a peer cohort of motivated, interesting learners. It takes about 10 hours total per week including interactive discussion time and project work. First iteration of the course starts July 11th. We are offering a limited number of scholarships for the course (details on the course page)