I'm curious about the long-context, did you evaluate on benchmark such as RULER/HELMET or just check the perplexity ? We've evaluate the 1B on helmet at 32k and the result are worst than qwen/llama or smollm-16k. Also did you only extend the context during finetuning or did a long context extension stage at the end of the pre-training stage? seems like the former work better but not sure for small models..