Skip to main content

Feature Engineering

Data leakage from poor splitting train/validation/test

Split data by time into train/valid/test splits instead of doing it randomly.

Global statistics generated from train dataset now contains test data information too if generated before splitting.
If you oversample your data, do it after splitting.
Data duplication before splitting. Oversampling might result in duplicating certain examples, so always check for duplicates.
Group leakage
Leakage from data generation process. Keep track of your data's lineage.

Detecting Data Leakage

Manual ablatino tests

Handling Missing Values

Data leakage from filing in missing data with statistics from the test split

Scaling and Normalization

Always split before scaling.

---	Explanation
Why do we need it?	150,000 is much bigger than 40, so 150,000 gets more importance
Different scaling techniques	[-1, 1] with 0 mean and unit vairance (stadardization if normal distribution) or [0, 1], etc.
Con 1	common source of data leakage
Con 2	the new data is out of statistics

Encoding Categorical Features

Feature Crossing

Good feature selection

Feature generalization

Feature coverage

Distribution of Feature Values

Try not to add a new feature until proven necessary. IF adding a new featuer significantly improves your model's performance, either that feature is really good or that feature just contains leaked information about labels.
Too many features you have more room for data leakage.

Feature generalization
- Feature coverage
- Distribution of Feature Values