Self notes
- The analysis itself is secondary. The base data column, the key facts you collect are much more important.
- EG: While predicting the sales of a list of stores in a city– Count of items, in store, sqft area of store, outlet type, city type in store
Preparations
- Deal with null values (Substitute them with mean)
- Convert catagorial columns into 1-hot
from sklearn.preprocessing import LabelEncoder
Definitions: Target variable
- The varibale you want to successfully predict after your analysis ID columns
- Remove ‘ID’ type columns from your DF as they do not add any value to the predictive model
Process
- PreProcess
- Bring all columns to same same value count
- Explore
- Displot Each column(see value distribution)
Explore
Counts
In this phase, we want to make sure there is uniform distribution of values Displot
- Here we want to see how the values in the column are distributed
- It’s skew (Center, Left, Right)
- If Left / Right skew then apply log transformation
np.log(1+df['sales'])
Countplot
sns.countplot()
Correlation
sns.heatmap(df.corr())