You're right to be wary, I re-read your use case and hand tracking alone won't cut it. However I still think that hand tracking _combined_ with audio input can be precise enough (above 99%).
Regardless of the feasibility, it won't be as reliable as direct midi input, that's for sure! I think your goal of getting things working for a simpler case first is right, and wish you best luck :)
Here is an example of raw audio to midi: https://magenta.tensorflow.org/onsets-frames
Regardless of the feasibility, it won't be as reliable as direct midi input, that's for sure! I think your goal of getting things working for a simpler case first is right, and wish you best luck :)