12 samples were enough to teach the 8b model just the 'format' for our desired outputs. We didn't train it on any factual knowledge base.
By using LoRA, we were training only 0.8% of all the trainable params of the 8b (quantized) model, which comes out to be ~65 million. I think since this is a small number, 12 samples were sufficient.
But if we talk about 405b params, even if we take 0.8% of those, we will get 3.24 billion params, which is no way near 65M. To effectively tune these many params, I think we would require a higher number of training samples for sure.