Hi, I was being assigned to a new team as a lead 2 weeks ago. I am not familiar with the team's existing system. There are many bits and pieces that I find non-ideal.
For example, existing monitoring and alerting are lacking and noisy at the same time, integration test is non-existent, unit test coverage is low and many important code paths are not being tested.
How should I tackle these tech debts and still deliver business features? Happy to know your thoughts
Beyond that, the standard evaluation of where you are, where you're going, and evolving the system to get there is always necessary.
For monitoring and alerting - aim for actionable alerts first (raises on-call when there's either a known or unknown event), then warnings (slack channel message, for SLOs, not SLAs) second. It's better to be more aware than unaware at first, as it will help your team quantify the situation, so play with the thresholds.
Testing is always a balance - I'd suggest ensuring that the critical paths of your system are covered, then only strive to do better from there.
As far as scheduling to tackle the tech debt, work toward including it in the requirements for future work. You must address X to build Y, etc. Addressing technical debt as a project that leads to no new end result (eliminating toil is one), has little (but not zero) value.
[1] https://landing.google.com/sre/sre-book/chapters/eliminating...