SEAI Language Model
Software engineering Language Model
Training
...
Transparent Reproducible Training
...
Reproducibility
...
Data
- all data used for training of the LM (including human feedback data) should be collected, aggregated and stored in a structured & properly formatted way (according to a certain data format) for later usage in LM training.
- Data has to be versioned and released with git tags as a dataset (opens in a new tab) in order to maintain reproducibility of the training. (What if I want to train one model at some point? Then at a different time, another model? And compare their performance? Or build one from scratch, my own? Etc.). Introduce Transparent Reproducible Training.
Format
...
Collection
...
Aggregation
...
Storing
huggingface repository
Versioning
...
Releasing
...