Feature Engineering
Data leakage from poor splitting train/validation/test
Split data by time into train/valid/test splits instead of doing it randomly.
- Global statistics generated from train dataset now contains test data information too if generated before splitting.
- If you oversample your data, do it after splitting.
- Data duplication before splitting. Oversampling might result in duplicating certain examples, so always check for duplicates.
- Group leakage
- Leakage from data generation process. Keep track of your data's lineage.
Detecting Data Leakage
- Manual ablatino tests
Handling Missing Values
- Data leakage from filing in missing data with statistics from the test split
Scaling and Normalization
- Always split before scaling.
--- | Explanation |
---|---|
Why do we need it? | 150,000 is much bigger than 40, so 150,000 gets more importance |
Different scaling techniques | [-1, 1] with 0 mean and unit vairance (stadardization if normal distribution) or [0, 1], etc. |
Con 1 | common source of data leakage |
Con 2 | the new data is out of statistics |
Encoding Categorical Features
Feature Crossing
Good feature selection
Feature generalization
Feature coverage
Distribution of Feature Values
- Try not to add a new feature until proven necessary. IF adding a new featuer significantly improves your model's performance, either that feature is really good or that feature just contains leaked information about labels.
- Too many features you have more room for data leakage.