I've literally build a dynamic bench mark where I test reasoning models on their performance on deriving conclusions from assumptions through sequent calculus.
o3-mini high effort can derive chains that are 8 inference rules deep with >95% confidence I didn't have the money to test it further. This is better than the average professor in logic when given pen and paper.
It seems like a course critiquing 5 year old technology at this point.
o3-mini high effort can derive chains that are 8 inference rules deep with >95% confidence I didn't have the money to test it further. This is better than the average professor in logic when given pen and paper.
It seems like a course critiquing 5 year old technology at this point.