Benchmarks don't necessarily reflect real-world performance. Especially given that they poorly measure more esoteric aspects of the model that, for now, can only be judged qualitatively. I would wait for a bit to see what the community comes up with before writing off Falcon-180B.
There are fine-tuned 70B models have higher benchmark scores than gpt-3.5, and when quantized they can run on a 64GB Macbook.
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...
https://huggingface.co/spaces/gsaivinay/open_llm_leaderboard