SEAI Language Model

Software engineering Language Model

Training

...

...

...

all data used for training of the LM (including human feedback data) should be collected, aggregated and stored in a structured & properly formatted way (according to a certain data format) for later usage in LM training.
Data has to be versioned and released with git tags as a dataset (opens in a new tab) in order to maintain reproducibility of the training. (What if I want to train one model at some point? Then at a different time, another model? And compare their performance? Or build one from scratch, my own? Etc.). Introduce Transparent Reproducible Training.

...

...

...

huggingface repository

...

...