features
software-engineer-ai
Data Commit Types

Data Commit Types

Reflect about what commit types would be optimal for tracking changes for a dataset development in a changelog. Define commit types standard for datasets.

Commit type standard for datasets

chatGPT Q/A

here are commit types for application development:

{ message = ".*BREAKING", group = "<!-- 0 -->🟡 BREAKING CHANGES" },
{ message = "^feat", group = "<!-- 1 -->🚀  Features" },
{ message = "^fix", group = "<!-- 2 -->💊  Bug Fixes" },
{ message = "^test", group = "<!-- 3 -->🧪  Testing" },
{ message = "^ci", group = "<!-- 4 -->🛠  Continuous Integration" },
{ message = "^perf", group = "<!-- 5 -->⚡️  Performance" },
{ body = ".*security", group = "<!-- 6 -->🛡️  Security" },
{ message = "^refactor", group = "<!-- 7 -->🚜  Refactor" },
{ message = "^chore", group = "<!-- 8 -->⚙️  Chore" },
{ message = "^style", group = "<!-- 9 -->🎨 Styling" },
{ message = "^doc", group = "<!-- 10 -->📚 Documentation" },
{ message = "^revert", group = "<!-- 11 -->⏪ Revert" },

what commit types can you suggest for a dataset development (for training of a language model) ?


For dataset development, especially for training a language model, you can use commit types similar to those in application development, but tailored to dataset-specific changes. Here are some commit types that might be relevant:

  1. data - For changes or updates to the dataset:

    • Group: 📊 Dataset Updates
    • Example: { message = "^data", group = "📊 Dataset Updates" }
  2. schema - For modifications to dataset structure or metadata:

    • Group: 🗂 Schema Changes
    • Example: { message = "^schema", group = "🗂 Schema Changes" }
  3. label - For adding or modifying labels or annotations in the dataset:

    • Group: 🏷️ Label Updates
    • Example: { message = "^label", group = "🏷️ Label Updates" }
  4. source - For changes to the dataset's source, such as fetching data from new locations:

    • Group: 🌍 Source Changes
    • Example: { message = "^source", group = "🌍 Source Changes" }
  5. preprocess - For changes related to data preprocessing or augmentation:

    • Group: 🔧 Preprocessing
    • Example: { message = "^preprocess", group = "🔧 Preprocessing" }
  6. augment - For applying data augmentation techniques:

    • Group: 📈 Data Augmentation
    • Example: { message = "^augment", group = "📈 Data Augmentation" }
  7. sample - For adding or modifying samples in the dataset:

    • Group: 🧪 Sample Updates
    • Example: { message = "^sample", group = "🧪 Sample Updates" }
  8. eval - For changes related to evaluation data or metrics:

    • Group: 📊 Evaluation Data
    • Example: { message = "^eval", group = "📊 Evaluation Data" }
  9. metadata - For changes to metadata such as dataset description, versioning, etc.:

    • Group: 📝 Metadata Updates
    • Example: { message = "^metadata", group = "📝 Metadata Updates" }
  10. split - For changes in the train/validation/test splits of the dataset:

    • Group: 🔀 Data Splits
    • Example: { message = "^split", group = "🔀 Data Splits" }

These commit types should help organize and track changes in dataset development, especially when working on a project involving training language models.