Skip to main content

Feature Engineering

Data leakage from poor splitting train/validation/test

Split data by time into train/valid/test splits instead of doing it randomly.

  • Global statistics generated from train dataset now contains test data information too if generated before splitting.
  • If you oversample your data, do it after splitting.
  • Data duplication before splitting. Oversampling might result in duplicating certain examples, so always check for duplicates.
  • Group leakage
  • Leakage from data generation process. Keep track of your data's lineage.

Detecting Data Leakage

  • Manual ablatino tests

Handling Missing Values

  • Data leakage from filing in missing data with statistics from the test split

Scaling and Normalization

  • Always split before scaling.
---Explanation
Why do we need it?150,000 is much bigger than 40, so 150,000 gets more importance
Different scaling techniques[-1, 1] with 0 mean and unit vairance (stadardization if normal distribution) or [0, 1], etc.
Con 1common source of data leakage
Con 2the new data is out of statistics

Encoding Categorical Features

Feature Crossing

Good feature selection

Feature generalization

Feature coverage

Distribution of Feature Values

  • Try not to add a new feature until proven necessary. IF adding a new featuer significantly improves your model's performance, either that feature is really good or that feature just contains leaked information about labels.
  • Too many features you have more room for data leakage.