Can you provide an example of where this has been successful?
I've spoken with many researchers and grad students only to find that there was a critical typo in the algorithm or undescribed setting (e.g., only converges when a learning rate scheduler is applied) in the code. I'll see the same algorithm implemented differently in different repositories. This is even the case for papers I've found with thousands of citations. It can be tremendously difficult to reproduce the results of papers, especially when they may require large amounts of compute that small researchers don't have. And I don't know whether it's a bug in my code or the paper or the algorithm.
I mean, what you describe is an unfortunate and unavoidable issue in academia (and in the world in general). GPT4 doesn't work magic here, of course.
You still have to:
1) Understand the work and the motivation. (GPT4 can help by playing the role of junior PhD if you can play the role of astute advisor.)
2) Sniff out things that are underspecified or seem wrong. (GPT4 also can help here, see above.)
3) Email the authors with questions, compare against shitty published codebases, etc.
depending upon how gnarly/rushed the prose is.
With that said, it's also "research smell" (compare "code smell") if a paper is so hastily written and undercited that you're the first person replicating it. And maybe instead of going for "my new bleeding edge approach that got 0.1% score better than boring old model with 50 cites", you probably should just implement boring old model.
So, where this has been successful for me is in implementing denoising diffusion for different problem domains. Given that there is broad literature on denoising diffusion, when some things are underspecified you can start looking at best practices for other researchers.
Alternately, other things like specific transformers etc.
Basically what I'm saying is that if you are trying to reimplement something that is so niche, it's like catching butterflies. A better research agenda involves working within a particular field of study where there is supporting evidence and approaches to compare and constrast to. This goes without saying, regardless of whether an AI is involved or not.