Data Commit Types
Reflect about what commit types would be optimal for tracking changes for a dataset development in a changelog. Define commit types standard for datasets.
Commit type standard for datasets
chatGPT Q/A
here are commit types for application development:
{ message = ".*BREAKING", group = "<!-- 0 -->🟡 BREAKING CHANGES" },
{ message = "^feat", group = "<!-- 1 -->🚀 Features" },
{ message = "^fix", group = "<!-- 2 -->💊 Bug Fixes" },
{ message = "^test", group = "<!-- 3 -->🧪 Testing" },
{ message = "^ci", group = "<!-- 4 -->🛠 Continuous Integration" },
{ message = "^perf", group = "<!-- 5 -->⚡️ Performance" },
{ body = ".*security", group = "<!-- 6 -->🛡️ Security" },
{ message = "^refactor", group = "<!-- 7 -->🚜 Refactor" },
{ message = "^chore", group = "<!-- 8 -->⚙️ Chore" },
{ message = "^style", group = "<!-- 9 -->🎨 Styling" },
{ message = "^doc", group = "<!-- 10 -->📚 Documentation" },
{ message = "^revert", group = "<!-- 11 -->⏪ Revert" },what commit types can you suggest for a dataset development (for training of a language model) ?
For dataset development, especially for training a language model, you can use commit types similar to those in application development, but tailored to dataset-specific changes. Here are some commit types that might be relevant:
-
data - For changes or updates to the dataset:
- Group: 📊 Dataset Updates
- Example: { message = "^data", group = "📊 Dataset Updates" }
-
schema - For modifications to dataset structure or metadata:
- Group: 🗂 Schema Changes
- Example: { message = "^schema", group = "🗂 Schema Changes" }
-
label - For adding or modifying labels or annotations in the dataset:
- Group: 🏷️ Label Updates
- Example: { message = "^label", group = "🏷️ Label Updates" }
-
source - For changes to the dataset's source, such as fetching data from new locations:
- Group: 🌍 Source Changes
- Example: { message = "^source", group = "🌍 Source Changes" }
-
preprocess - For changes related to data preprocessing or augmentation:
- Group: 🔧 Preprocessing
- Example: { message = "^preprocess", group = "🔧 Preprocessing" }
-
augment - For applying data augmentation techniques:
- Group: 📈 Data Augmentation
- Example: { message = "^augment", group = "📈 Data Augmentation" }
-
sample - For adding or modifying samples in the dataset:
- Group: 🧪 Sample Updates
- Example: { message = "^sample", group = "🧪 Sample Updates" }
-
eval - For changes related to evaluation data or metrics:
- Group: 📊 Evaluation Data
- Example: { message = "^eval", group = "📊 Evaluation Data" }
-
metadata - For changes to metadata such as dataset description, versioning, etc.:
- Group: 📝 Metadata Updates
- Example: { message = "^metadata", group = "📝 Metadata Updates" }
-
split - For changes in the train/validation/test splits of the dataset:
- Group: 🔀 Data Splits
- Example: { message = "^split", group = "🔀 Data Splits" }
These commit types should help organize and track changes in dataset development, especially when working on a project involving training language models.