Predicting_Income_Project¶

1. Introduction to the Dataset¶

In [1]:
import pandas as pd
In [2]:
df = pd.read_csv('income.csv')
In [3]:
for col in df.columns:
    counts = df[col].value_counts()
    print(f'dataframe[{col}]')
    print(counts)
    print('\n')
dataframe[age]
36    1348
35    1337
33    1335
23    1329
31    1325
      ... 
88       6
85       5
87       3
89       2
86       1
Name: age, Length: 74, dtype: int64


dataframe[workclass]
Private             33906
Self-emp-not-inc     3862
Local-gov            3136
?                    2799
State-gov            1981
Self-emp-inc         1695
Federal-gov          1432
Without-pay            21
Never-worked           10
Name: workclass, dtype: int64


dataframe[fnlwgt]
203488    21
190290    19
120277    19
125892    18
126569    18
          ..
188488     1
285290     1
293579     1
114874     1
257302     1
Name: fnlwgt, Length: 28523, dtype: int64


dataframe[education]
HS-grad         15784
Some-college    10878
Bachelors        8025
Masters          2657
Assoc-voc        2061
11th             1812
Assoc-acdm       1601
10th             1389
7th-8th           955
Prof-school       834
9th               756
12th              657
Doctorate         594
5th-6th           509
1st-4th           247
Preschool          83
Name: education, dtype: int64


dataframe[educational-num]
9     15784
10    10878
13     8025
14     2657
11     2061
7      1812
12     1601
6      1389
4       955
15      834
5       756
8       657
16      594
3       509
2       247
1        83
Name: educational-num, dtype: int64


dataframe[marital-status]
Married-civ-spouse       22379
Never-married            16117
Divorced                  6633
Separated                 1530
Widowed                   1518
Married-spouse-absent      628
Married-AF-spouse           37
Name: marital-status, dtype: int64


dataframe[occupation]
Prof-specialty       6172
Craft-repair         6112
Exec-managerial      6086
Adm-clerical         5611
Sales                5504
Other-service        4923
Machine-op-inspct    3022
?                    2809
Transport-moving     2355
Handlers-cleaners    2072
Farming-fishing      1490
Tech-support         1446
Protective-serv       983
Priv-house-serv       242
Armed-Forces           15
Name: occupation, dtype: int64


dataframe[relationship]
Husband           19716
Not-in-family     12583
Own-child          7581
Unmarried          5125
Wife               2331
Other-relative     1506
Name: relationship, dtype: int64


dataframe[race]
White                 41762
Black                  4685
Asian-Pac-Islander     1519
Amer-Indian-Eskimo      470
Other                   406
Name: race, dtype: int64


dataframe[gender]
Male      32650
Female    16192
Name: gender, dtype: int64


dataframe[capital-gain]
0        44807
15024      513
7688       410
7298       364
99999      244
         ...  
1111         1
7262         1
22040        1
1639         1
2387         1
Name: capital-gain, Length: 123, dtype: int64


dataframe[capital-loss]
0       46560
1902      304
1977      253
1887      233
2415       72
        ...  
2465        1
2080        1
155         1
1911        1
2201        1
Name: capital-loss, Length: 99, dtype: int64


dataframe[hours-per-week]
40    22803
50     4246
45     2717
60     2177
35     1937
      ...  
69        1
87        1
94        1
82        1
79        1
Name: hours-per-week, Length: 96, dtype: int64


dataframe[native-country]
United-States                 43832
Mexico                          951
?                               857
Philippines                     295
Germany                         206
Puerto-Rico                     184
Canada                          182
El-Salvador                     155
India                           151
Cuba                            138
England                         127
China                           122
South                           115
Jamaica                         106
Italy                           105
Dominican-Republic              103
Japan                            92
Guatemala                        88
Poland                           87
Vietnam                          86
Columbia                         85
Haiti                            75
Portugal                         67
Taiwan                           65
Iran                             59
Greece                           49
Nicaragua                        49
Peru                             46
Ecuador                          45
France                           38
Ireland                          37
Hong                             30
Thailand                         30
Cambodia                         28
Trinadad&Tobago                  27
Laos                             23
Yugoslavia                       23
Outlying-US(Guam-USVI-etc)       23
Scotland                         21
Honduras                         20
Hungary                          19
Holand-Netherlands                1
Name: native-country, dtype: int64


dataframe[income]
<=50K    37155
>50K     11687
Name: income, dtype: int64


From the result above, we can summary the dataset description.

Dataset description¶

The dataset contains the following features:

  • age: continuous.
  • workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
  • fnlwgt: continuous.
  • education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
  • education-num: continuous.
  • marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
  • occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
  • relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
  • race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
  • sex: Female, Male.
  • capital-gain: continuous.
  • capital-loss: continuous.
  • hours-per-week: continuous.
  • native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
  • class: >50K, <=50K

The class feature is the target variable, with two possible values: >50K and <=50K.

Dropping fnlwgt column¶

The fnlwgt column represents the estimated number of people for each row, and it is not directly related to the target variable income. Therefore, it is not useful for making predictions and can be dropped from the dataset to prevent any potential bias in the results.

In [4]:
df = df.drop('fnlwgt', axis=1)

2. One-Hot Encoding¶

From the Dataset description we know that there are

Multi-class columns¶

  • education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
  • marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
  • occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
  • race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
  • native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

Binary-class columns¶

  • sex: Female, Male.
  • class: >50K, <=50K

One-Hot Encoding Multi-Class Columns¶

Encoding Multi-Class Columns

In [5]:
df = pd.concat([df.drop('workclass', axis=1), pd.get_dummies(df['workclass']).add_prefix('workclass_')], axis=1)
df = pd.concat([df.drop('occupation', axis=1), pd.get_dummies(df['occupation']).add_prefix('occupation_')], axis=1)
df = df.drop('education', axis=1)
df = pd.concat([df.drop('marital-status', axis=1), pd.get_dummies(df['marital-status']).add_prefix('marital-status_')], axis=1)
df = pd.concat([df.drop('relationship', axis=1), pd.get_dummies(df['relationship']).add_prefix('relationship_')], axis=1)
df = pd.concat([df.drop('race', axis=1), pd.get_dummies(df['race']).add_prefix('race_')], axis=1)
df = pd.concat([df.drop('native-country', axis=1), pd.get_dummies(df['native-country']).add_prefix('native-country_')], axis=1)

Encoding Binary-Class Columns¶

Encoding Binary-Class Columns

In [6]:
df['gender'] = df['gender'].apply(lambda x: 1 if x == 'Male' else 0)
df['income'] = df['income'].apply(lambda x: 1 if x == '>50K' else 0)

3. Feature Selection, Most Correlated With Income¶

In [7]:
df.shape
Out[7]:
(48842, 91)

Filtering Top 20% Most Correlated Columns with the Income Column¶

Since we have a dataset with 91 columns, we want to identify the top 20% of columns that are most correlated with the income column. To achieve this, we can calculate the correlation between each column and the income column, and then select the columns with the highest correlation coefficients.

In [8]:
income_corr = df.corr()['income'].abs()
sorted_income_corr = income_corr.sort_values()
num_cols_to_drop = int(0.8 * len(df.columns))
cols_to_keep = sorted_income_corr.iloc[num_cols_to_drop : ].index
df_most_corr = df[cols_to_keep]

Plotting The Correlation Heatmap¶

In [9]:
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize = (15, 10))
sns.heatmap(df_most_corr.corr(), annot=True, cmap='coolwarm');

Based on the heatmap analysis, the following features are found to be most correlated with the income column:

  • Marital status
  • Husband relationship
  • Education level
  • Age

This information can be useful for feature selection and building a predictive model for income.

4. Building Machine Learning Model¶

Why Should be Random Forest Model?¶

Considering that we have already applied one-hot encoding for multi-class columns and binary encoding for binary-class columns, the decision tree emerges as a suitable model for predicting income. Nevertheless, it's important to note that decision trees can tend to overfit the training dataset. Therefore, utilizing Random Forest, an ensemble of decision trees, can help mitigate overfitting and lead to more robust predictions for income. Hence, the Random Forest model is likely to provide the best results when predicting income.

Dividing the Dataset into the Training Data and the Testing Data¶

In [10]:
from sklearn.model_selection import train_test_split

X = df.drop('income', axis=1)
y = df['income']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Applying The Random Forest Classifier Model¶

In [11]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier().fit(X_train, y_train)
clf.score(X_test, y_test)
Out[11]:
0.8496263691268298

Based on the score, this Machine Learning Model can predict income with an accuracy of 84.92%.

Analyzing The Features Importance¶

In [12]:
feature_imp = dict(zip(clf.feature_names_in_, clf.feature_importances_))
feature_imp = {k: v for k,v in sorted(feature_imp.items(), key = lambda x: x[1], reverse=True)}
feature_imp
Out[12]:
{'age': 0.22322233494447943,
 'educational-num': 0.1311826117279202,
 'hours-per-week': 0.11350407412031566,
 'capital-gain': 0.10854257519814174,
 'marital-status_Married-civ-spouse': 0.07285802131856205,
 'relationship_Husband': 0.054071865666143,
 'capital-loss': 0.03623408938158132,
 'marital-status_Never-married': 0.02570174793814733,
 'occupation_Exec-managerial': 0.021618843393805664,
 'occupation_Prof-specialty': 0.017875105317731954,
 'gender': 0.013889273602117912,
 'relationship_Not-in-family': 0.010563681520294483,
 'workclass_Private': 0.009653923427601437,
 'relationship_Own-child': 0.008426776216139093,
 'workclass_Self-emp-not-inc': 0.008300231271496155,
 'relationship_Wife': 0.00795649380744919,
 'occupation_Other-service': 0.007734290828360036,
 'native-country_United-States': 0.0064937340627310445,
 'marital-status_Divorced': 0.0063637295755381166,
 'race_White': 0.006223073287283066,
 'occupation_Sales': 0.006051784217098677,
 'workclass_Self-emp-inc': 0.00601461315741399,
 'occupation_Craft-repair': 0.005939820440652321,
 'relationship_Unmarried': 0.005749606215869169,
 'workclass_Local-gov': 0.005398525105848354,
 'occupation_Adm-clerical': 0.005145963881619278,
 'workclass_Federal-gov': 0.005086183145137154,
 'race_Black': 0.004796416481660459,
 'occupation_Farming-fishing': 0.0047604968012297825,
 'workclass_State-gov': 0.004412370339855017,
 'occupation_Tech-support': 0.0040771021673870105,
 'occupation_Machine-op-inspct': 0.004061759162365423,
 'occupation_Transport-moving': 0.003942784452512743,
 'occupation_Handlers-cleaners': 0.0033955311356353925,
 'race_Asian-Pac-Islander': 0.00293520185458387,
 'native-country_?': 0.0028921094649814356,
 'occupation_Protective-serv': 0.0026519234782636236,
 'native-country_Mexico': 0.0026285800906803384,
 'marital-status_Separated': 0.0020013038539824536,
 'relationship_Other-relative': 0.0019994140634655012,
 'occupation_?': 0.0019458118810433975,
 'workclass_?': 0.0016426835806499277,
 'marital-status_Widowed': 0.00157219247753646,
 'native-country_Canada': 0.0013575423796474877,
 'race_Amer-Indian-Eskimo': 0.0013434018787387844,
 'native-country_Philippines': 0.0011953237624546384,
 'race_Other': 0.0010785055357013873,
 'native-country_Germany': 0.0010094294037617407,
 'native-country_England': 0.0009540105100897136,
 'marital-status_Married-spouse-absent': 0.0009186510261147557,
 'native-country_India': 0.0008537887570261623,
 'native-country_Italy': 0.0007804564594581538,
 'native-country_Cuba': 0.0006789765423287363,
 'native-country_Japan': 0.0006178291065667383,
 'native-country_China': 0.0006116247447849556,
 'native-country_Poland': 0.0006085195896047573,
 'native-country_South': 0.000571696217536573,
 'native-country_Puerto-Rico': 0.0005669439950076394,
 'native-country_Jamaica': 0.0005244447392335239,
 'native-country_Ireland': 0.0004891875712529776,
 'native-country_Iran': 0.00047676226896973395,
 'native-country_Portugal': 0.00043557398118890494,
 'native-country_Greece': 0.00042426214741192475,
 'native-country_France': 0.0003981770948688261,
 'native-country_Cambodia': 0.000367642110478976,
 'marital-status_Married-AF-spouse': 0.000341186603122011,
 'native-country_Taiwan': 0.0003137402012422082,
 'native-country_Columbia': 0.00031055692301654435,
 'native-country_Yugoslavia': 0.0002969385079017758,
 'native-country_El-Salvador': 0.00028787357778905197,
 'native-country_Dominican-Republic': 0.00027879151115686035,
 'native-country_Vietnam': 0.0002737884854932835,
 'native-country_Peru': 0.00021806625005251905,
 'native-country_Ecuador': 0.0002097423646966491,
 'native-country_Hungary': 0.00019569670018537144,
 'native-country_Haiti': 0.00019330228615587253,
 'occupation_Priv-house-serv': 0.00018751479833655982,
 'native-country_Hong': 0.00013972805660000248,
 'native-country_Nicaragua': 0.000136058734038263,
 'native-country_Guatemala': 0.00013347680168018442,
 'native-country_Scotland': 0.00012227616818582268,
 'native-country_Trinadad&Tobago': 0.00011930826974066256,
 'workclass_Without-pay': 0.00011617143550319518,
 'native-country_Laos': 9.009014864171938e-05,
 'native-country_Thailand': 8.255386512834968e-05,
 'occupation_Armed-Forces': 7.978436171451954e-05,
 'native-country_Outlying-US(Guam-USVI-etc)': 5.077876426342873e-05,
 'native-country_Honduras': 3.551547200466243e-05,
 'workclass_Never-worked': 5.655835812791772e-06,
 'native-country_Holand-Netherlands': 0.0}

Feature Importance in Machine Learning Model¶

The code provided calculates and displays the importance of different features in a machine learning model. In simpler terms, it tells us which aspects or characteristics of the data are most important for making predictions with the model.

Top 5 Features with the Highest Importance Scores¶

  1. Age (Importance: 22.74%): The age of a person is the most important factor in making predictions. This means that a person's age has a significant impact on whether they are likely to have a certain income level.

  2. Educational Number (Importance: 13.05%): This represents a person's level of education. It's the second most important factor. People with higher educational numbers (likely indicating more education) tend to have a greater influence on the model's predictions.

  3. Hours per Week (Importance: 11.37%): The number of hours a person works per week is the third most important factor. It suggests that the number of hours worked plays a significant role in predicting income.

  4. Capital Gain (Importance: 10.89%): Capital gain refers to the profit from selling an investment. It's the fourth most important factor, indicating that financial gains contribute to income predictions.

  5. Marital Status (Married-civ-spouse) (Importance: 6.58%): Being in a married-civilian-spouse relationship is the fifth most important factor. It suggests that marital status, specifically this category, has an impact on income predictions.

The importance percentages next to each feature indicate how much each feature contributes to the model's ability to predict income. Features with higher importance percentages are more influential in making accurate predictions.

In summary, this information helps us understand which aspects of the data are crucial for the model's predictions. It can be valuable for decision-makers and data analysts to focus on these top features when trying to explain or improve the model's performance.

5. Hyperparameter Tuning¶

Utilizing GridSearchCV from sklearn.model_selection, we aim to determine the optimal parameter values for the Random Forest Classification model. Specifically, we are seeking the best values for the following parameters within the RandomForestClassifier: n_estimators, max_depth, min_samples_split, and max_features.

In [ ]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# Define the parameter grid for hyperparameter tuning
param_grid = {
    'n_estimators': [200, 300, 400],           # List of different values for n_estimators
    'max_depth': [None, 20, 30, 40],          # List of different values for max_depth
    'min_samples_split': [5, 8, 11],         # List of different values for min_samples_split
    'max_features': ['auto', 'sqrt', 'log2']  # List of different values for max_features
}

# Create the RandomForestClassifier
rf_classifier = RandomForestClassifier()

# Create the GridSearchCV object
grid_search = GridSearchCV(rf_classifier, param_grid=param_grid, verbose=10, n_jobs=-1)

# Fit the grid search to the training data
grid_search.fit(X_train, y_train)
Fitting 5 folds for each of 108 candidates, totalling 540 fits

Get The Best Parameters¶

In [ ]:
grid_search.best_params_

Get The Best Score¶

In [ ]:
best_estimator = grid_search.best_estimator_
best_estimator.score(X_test, y_test)
In [ ]: