It's not at all strange to get something to work before you start optimizing. I mean if you can only run small models then how would you even know what you're losing by optimizing for space? Heck you wouldn't even know how the model behaves, so you won't know where to start shaving away.
I'm not saying it's impossible but if resources allow it makes a lot of sense to start with the biggest model you can still train. Especially since for whatever reason things seem to get a lot easier if you simply throw more computing power at it (kind of like how no matter how advanced your caching algorithm it's not going to be more than 2 times faster than the simplest LRU algorithm with double the amount of cache).
I'm not saying it's impossible but if resources allow it makes a lot of sense to start with the biggest model you can still train. Especially since for whatever reason things seem to get a lot easier if you simply throw more computing power at it (kind of like how no matter how advanced your caching algorithm it's not going to be more than 2 times faster than the simplest LRU algorithm with double the amount of cache).