Apple releases adapted SlowFast-LLaVA model for long-form video analysis

Terretta · 2025-08-23T16:45:22 1755967522

Paper: SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding:

“We introduce SlowFast-LLaVA-1.5 (abbreviated as SF-LLaVA-1.5), a family of video large language models (LLMs) offering a token-efficient solution for long-form video understanding... Experimental results demonstrate that SF-LLaVA-1.5 achieves superior performance on a wide range of video and image tasks, with robust results at all model sizes (ranging from 1B to 7B). Notably, SF-LLaVA-1.5 achieves state-of-the-art results in long-form video understanding (e.g., LongVideoBench and MLVU) and excels at small scales across various video benchmarks.” -- https://arxiv.org/abs/2503.18943

Github: https://github.com/apple/ml-slowfast-llava

Hugging Face: https://huggingface.co/papers/2503.18943