Data Commit Types

Reflect about what commit types would be optimal for tracking changes for a dataset development in a changelog. Define commit types standard for datasets.

Commit type standard for datasets

chatGPT Q/A

here are commit types for application development:

{ message = ".*BREAKING", group = "<!-- 0 -->🟡 BREAKING CHANGES" },
{ message = "^feat", group = "<!-- 1 -->🚀  Features" },
{ message = "^fix", group = "<!-- 2 -->💊  Bug Fixes" },
{ message = "^test", group = "<!-- 3 -->🧪  Testing" },
{ message = "^ci", group = "<!-- 4 -->🛠  Continuous Integration" },
{ message = "^perf", group = "<!-- 5 -->⚡️  Performance" },
{ body = ".*security", group = "<!-- 6 -->🛡️  Security" },
{ message = "^refactor", group = "<!-- 7 -->🚜  Refactor" },
{ message = "^chore", group = "<!-- 8 -->⚙️  Chore" },
{ message = "^style", group = "<!-- 9 -->🎨 Styling" },
{ message = "^doc", group = "<!-- 10 -->📚 Documentation" },
{ message = "^revert", group = "<!-- 11 -->⏪ Revert" },

what commit types can you suggest for a dataset development (for training of a language model) ?

For dataset development, especially for training a language model, you can use commit types similar to those in application development, but tailored to dataset-specific changes. Here are some commit types that might be relevant:

data - For changes or updates to the dataset:
- Group: 📊 Dataset Updates
- Example: { message = "^data", group = "📊 Dataset Updates" }
schema - For modifications to dataset structure or metadata:
- Group: 🗂 Schema Changes
- Example: { message = "^schema", group = "🗂 Schema Changes" }
label - For adding or modifying labels or annotations in the dataset:
- Group: 🏷️ Label Updates
- Example: { message = "^label", group = "🏷️ Label Updates" }
source - For changes to the dataset's source, such as fetching data from new locations:
- Group: 🌍 Source Changes
- Example: { message = "^source", group = "🌍 Source Changes" }
preprocess - For changes related to data preprocessing or augmentation:
- Group: 🔧 Preprocessing
- Example: { message = "^preprocess", group = "🔧 Preprocessing" }
augment - For applying data augmentation techniques:
- Group: 📈 Data Augmentation
- Example: { message = "^augment", group = "📈 Data Augmentation" }
sample - For adding or modifying samples in the dataset:
- Group: 🧪 Sample Updates
- Example: { message = "^sample", group = "🧪 Sample Updates" }
eval - For changes related to evaluation data or metrics:
- Group: 📊 Evaluation Data
- Example: { message = "^eval", group = "📊 Evaluation Data" }
metadata - For changes to metadata such as dataset description, versioning, etc.:
- Group: 📝 Metadata Updates
- Example: { message = "^metadata", group = "📝 Metadata Updates" }
split - For changes in the train/validation/test splits of the dataset:
- Group: 🔀 Data Splits
- Example: { message = "^split", group = "🔀 Data Splits" }

These commit types should help organize and track changes in dataset development, especially when working on a project involving training language models.

Changelog Generation Releasing