import pandas as pd
df = pd.read_csv('income.csv')
for col in df.columns:
counts = df[col].value_counts()
print(f'dataframe[{col}]')
print(counts)
print('\n')
dataframe[age] 36 1348 35 1337 33 1335 23 1329 31 1325 ... 88 6 85 5 87 3 89 2 86 1 Name: age, Length: 74, dtype: int64 dataframe[workclass] Private 33906 Self-emp-not-inc 3862 Local-gov 3136 ? 2799 State-gov 1981 Self-emp-inc 1695 Federal-gov 1432 Without-pay 21 Never-worked 10 Name: workclass, dtype: int64 dataframe[fnlwgt] 203488 21 190290 19 120277 19 125892 18 126569 18 .. 188488 1 285290 1 293579 1 114874 1 257302 1 Name: fnlwgt, Length: 28523, dtype: int64 dataframe[education] HS-grad 15784 Some-college 10878 Bachelors 8025 Masters 2657 Assoc-voc 2061 11th 1812 Assoc-acdm 1601 10th 1389 7th-8th 955 Prof-school 834 9th 756 12th 657 Doctorate 594 5th-6th 509 1st-4th 247 Preschool 83 Name: education, dtype: int64 dataframe[educational-num] 9 15784 10 10878 13 8025 14 2657 11 2061 7 1812 12 1601 6 1389 4 955 15 834 5 756 8 657 16 594 3 509 2 247 1 83 Name: educational-num, dtype: int64 dataframe[marital-status] Married-civ-spouse 22379 Never-married 16117 Divorced 6633 Separated 1530 Widowed 1518 Married-spouse-absent 628 Married-AF-spouse 37 Name: marital-status, dtype: int64 dataframe[occupation] Prof-specialty 6172 Craft-repair 6112 Exec-managerial 6086 Adm-clerical 5611 Sales 5504 Other-service 4923 Machine-op-inspct 3022 ? 2809 Transport-moving 2355 Handlers-cleaners 2072 Farming-fishing 1490 Tech-support 1446 Protective-serv 983 Priv-house-serv 242 Armed-Forces 15 Name: occupation, dtype: int64 dataframe[relationship] Husband 19716 Not-in-family 12583 Own-child 7581 Unmarried 5125 Wife 2331 Other-relative 1506 Name: relationship, dtype: int64 dataframe[race] White 41762 Black 4685 Asian-Pac-Islander 1519 Amer-Indian-Eskimo 470 Other 406 Name: race, dtype: int64 dataframe[gender] Male 32650 Female 16192 Name: gender, dtype: int64 dataframe[capital-gain] 0 44807 15024 513 7688 410 7298 364 99999 244 ... 1111 1 7262 1 22040 1 1639 1 2387 1 Name: capital-gain, Length: 123, dtype: int64 dataframe[capital-loss] 0 46560 1902 304 1977 253 1887 233 2415 72 ... 2465 1 2080 1 155 1 1911 1 2201 1 Name: capital-loss, Length: 99, dtype: int64 dataframe[hours-per-week] 40 22803 50 4246 45 2717 60 2177 35 1937 ... 69 1 87 1 94 1 82 1 79 1 Name: hours-per-week, Length: 96, dtype: int64 dataframe[native-country] United-States 43832 Mexico 951 ? 857 Philippines 295 Germany 206 Puerto-Rico 184 Canada 182 El-Salvador 155 India 151 Cuba 138 England 127 China 122 South 115 Jamaica 106 Italy 105 Dominican-Republic 103 Japan 92 Guatemala 88 Poland 87 Vietnam 86 Columbia 85 Haiti 75 Portugal 67 Taiwan 65 Iran 59 Greece 49 Nicaragua 49 Peru 46 Ecuador 45 France 38 Ireland 37 Hong 30 Thailand 30 Cambodia 28 Trinadad&Tobago 27 Laos 23 Yugoslavia 23 Outlying-US(Guam-USVI-etc) 23 Scotland 21 Honduras 20 Hungary 19 Holand-Netherlands 1 Name: native-country, dtype: int64 dataframe[income] <=50K 37155 >50K 11687 Name: income, dtype: int64
From the result above, we can summary the dataset description.
The dataset contains the following features:
age
: continuous.workclass
: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.fnlwgt
: continuous.education
: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.education-num
: continuous.marital-status
: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.occupation
: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.relationship
: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.race
: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.sex
: Female, Male.capital-gain
: continuous.capital-loss
: continuous.hours-per-week
: continuous.native-country
: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.class
: >50K, <=50KThe class
feature is the target variable, with two possible values: >50K
and <=50K
.
fnlwgt
column¶The fnlwgt
column represents the estimated number of people for each row, and it is not directly related to the target variable income
. Therefore, it is not useful for making predictions and can be dropped from the dataset to prevent any potential bias in the results.
df = df.drop('fnlwgt', axis=1)
From the Dataset description we know that there are
education
: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.marital-status
: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.occupation
: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.race
: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.native-country
: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.sex
: Female, Male.class
: >50K, <=50KEncoding Multi-Class Columns
df = pd.concat([df.drop('workclass', axis=1), pd.get_dummies(df['workclass']).add_prefix('workclass_')], axis=1)
df = pd.concat([df.drop('occupation', axis=1), pd.get_dummies(df['occupation']).add_prefix('occupation_')], axis=1)
df = df.drop('education', axis=1)
df = pd.concat([df.drop('marital-status', axis=1), pd.get_dummies(df['marital-status']).add_prefix('marital-status_')], axis=1)
df = pd.concat([df.drop('relationship', axis=1), pd.get_dummies(df['relationship']).add_prefix('relationship_')], axis=1)
df = pd.concat([df.drop('race', axis=1), pd.get_dummies(df['race']).add_prefix('race_')], axis=1)
df = pd.concat([df.drop('native-country', axis=1), pd.get_dummies(df['native-country']).add_prefix('native-country_')], axis=1)
Encoding Binary-Class Columns
df['gender'] = df['gender'].apply(lambda x: 1 if x == 'Male' else 0)
df['income'] = df['income'].apply(lambda x: 1 if x == '>50K' else 0)
df.shape
(48842, 91)
Income
Column¶Since we have a dataset with 91 columns, we want to identify the top 20% of columns that are most correlated with the income
column. To achieve this, we can calculate the correlation between each column and the income
column, and then select the columns with the highest correlation coefficients.
income_corr = df.corr()['income'].abs()
sorted_income_corr = income_corr.sort_values()
num_cols_to_drop = int(0.8 * len(df.columns))
cols_to_keep = sorted_income_corr.iloc[num_cols_to_drop : ].index
df_most_corr = df[cols_to_keep]
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize = (15, 10))
sns.heatmap(df_most_corr.corr(), annot=True, cmap='coolwarm');
Based on the heatmap analysis, the following features are found to be most correlated with the income
column:
This information can be useful for feature selection and building a predictive model for income
.
Considering that we have already applied one-hot encoding for multi-class columns and binary encoding for binary-class columns, the decision tree emerges as a suitable model for predicting income. Nevertheless, it's important to note that decision trees can tend to overfit the training dataset. Therefore, utilizing Random Forest, an ensemble of decision trees, can help mitigate overfitting and lead to more robust predictions for income. Hence, the Random Forest model is likely to provide the best results when predicting income
.
from sklearn.model_selection import train_test_split
X = df.drop('income', axis=1)
y = df['income']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier().fit(X_train, y_train)
clf.score(X_test, y_test)
0.8496263691268298
Based on the score, this Machine Learning Model can predict income
with an accuracy of 84.92%.
feature_imp = dict(zip(clf.feature_names_in_, clf.feature_importances_))
feature_imp = {k: v for k,v in sorted(feature_imp.items(), key = lambda x: x[1], reverse=True)}
feature_imp
{'age': 0.22322233494447943, 'educational-num': 0.1311826117279202, 'hours-per-week': 0.11350407412031566, 'capital-gain': 0.10854257519814174, 'marital-status_Married-civ-spouse': 0.07285802131856205, 'relationship_Husband': 0.054071865666143, 'capital-loss': 0.03623408938158132, 'marital-status_Never-married': 0.02570174793814733, 'occupation_Exec-managerial': 0.021618843393805664, 'occupation_Prof-specialty': 0.017875105317731954, 'gender': 0.013889273602117912, 'relationship_Not-in-family': 0.010563681520294483, 'workclass_Private': 0.009653923427601437, 'relationship_Own-child': 0.008426776216139093, 'workclass_Self-emp-not-inc': 0.008300231271496155, 'relationship_Wife': 0.00795649380744919, 'occupation_Other-service': 0.007734290828360036, 'native-country_United-States': 0.0064937340627310445, 'marital-status_Divorced': 0.0063637295755381166, 'race_White': 0.006223073287283066, 'occupation_Sales': 0.006051784217098677, 'workclass_Self-emp-inc': 0.00601461315741399, 'occupation_Craft-repair': 0.005939820440652321, 'relationship_Unmarried': 0.005749606215869169, 'workclass_Local-gov': 0.005398525105848354, 'occupation_Adm-clerical': 0.005145963881619278, 'workclass_Federal-gov': 0.005086183145137154, 'race_Black': 0.004796416481660459, 'occupation_Farming-fishing': 0.0047604968012297825, 'workclass_State-gov': 0.004412370339855017, 'occupation_Tech-support': 0.0040771021673870105, 'occupation_Machine-op-inspct': 0.004061759162365423, 'occupation_Transport-moving': 0.003942784452512743, 'occupation_Handlers-cleaners': 0.0033955311356353925, 'race_Asian-Pac-Islander': 0.00293520185458387, 'native-country_?': 0.0028921094649814356, 'occupation_Protective-serv': 0.0026519234782636236, 'native-country_Mexico': 0.0026285800906803384, 'marital-status_Separated': 0.0020013038539824536, 'relationship_Other-relative': 0.0019994140634655012, 'occupation_?': 0.0019458118810433975, 'workclass_?': 0.0016426835806499277, 'marital-status_Widowed': 0.00157219247753646, 'native-country_Canada': 0.0013575423796474877, 'race_Amer-Indian-Eskimo': 0.0013434018787387844, 'native-country_Philippines': 0.0011953237624546384, 'race_Other': 0.0010785055357013873, 'native-country_Germany': 0.0010094294037617407, 'native-country_England': 0.0009540105100897136, 'marital-status_Married-spouse-absent': 0.0009186510261147557, 'native-country_India': 0.0008537887570261623, 'native-country_Italy': 0.0007804564594581538, 'native-country_Cuba': 0.0006789765423287363, 'native-country_Japan': 0.0006178291065667383, 'native-country_China': 0.0006116247447849556, 'native-country_Poland': 0.0006085195896047573, 'native-country_South': 0.000571696217536573, 'native-country_Puerto-Rico': 0.0005669439950076394, 'native-country_Jamaica': 0.0005244447392335239, 'native-country_Ireland': 0.0004891875712529776, 'native-country_Iran': 0.00047676226896973395, 'native-country_Portugal': 0.00043557398118890494, 'native-country_Greece': 0.00042426214741192475, 'native-country_France': 0.0003981770948688261, 'native-country_Cambodia': 0.000367642110478976, 'marital-status_Married-AF-spouse': 0.000341186603122011, 'native-country_Taiwan': 0.0003137402012422082, 'native-country_Columbia': 0.00031055692301654435, 'native-country_Yugoslavia': 0.0002969385079017758, 'native-country_El-Salvador': 0.00028787357778905197, 'native-country_Dominican-Republic': 0.00027879151115686035, 'native-country_Vietnam': 0.0002737884854932835, 'native-country_Peru': 0.00021806625005251905, 'native-country_Ecuador': 0.0002097423646966491, 'native-country_Hungary': 0.00019569670018537144, 'native-country_Haiti': 0.00019330228615587253, 'occupation_Priv-house-serv': 0.00018751479833655982, 'native-country_Hong': 0.00013972805660000248, 'native-country_Nicaragua': 0.000136058734038263, 'native-country_Guatemala': 0.00013347680168018442, 'native-country_Scotland': 0.00012227616818582268, 'native-country_Trinadad&Tobago': 0.00011930826974066256, 'workclass_Without-pay': 0.00011617143550319518, 'native-country_Laos': 9.009014864171938e-05, 'native-country_Thailand': 8.255386512834968e-05, 'occupation_Armed-Forces': 7.978436171451954e-05, 'native-country_Outlying-US(Guam-USVI-etc)': 5.077876426342873e-05, 'native-country_Honduras': 3.551547200466243e-05, 'workclass_Never-worked': 5.655835812791772e-06, 'native-country_Holand-Netherlands': 0.0}
The code provided calculates and displays the importance of different features in a machine learning model. In simpler terms, it tells us which aspects or characteristics of the data are most important for making predictions with the model.
Age (Importance: 22.74%): The age of a person is the most important factor in making predictions. This means that a person's age has a significant impact on whether they are likely to have a certain income level.
Educational Number (Importance: 13.05%): This represents a person's level of education. It's the second most important factor. People with higher educational numbers (likely indicating more education) tend to have a greater influence on the model's predictions.
Hours per Week (Importance: 11.37%): The number of hours a person works per week is the third most important factor. It suggests that the number of hours worked plays a significant role in predicting income.
Capital Gain (Importance: 10.89%): Capital gain refers to the profit from selling an investment. It's the fourth most important factor, indicating that financial gains contribute to income predictions.
Marital Status (Married-civ-spouse) (Importance: 6.58%): Being in a married-civilian-spouse relationship is the fifth most important factor. It suggests that marital status, specifically this category, has an impact on income predictions.
The importance percentages next to each feature indicate how much each feature contributes to the model's ability to predict income. Features with higher importance percentages are more influential in making accurate predictions.
In summary, this information helps us understand which aspects of the data are crucial for the model's predictions. It can be valuable for decision-makers and data analysts to focus on these top features when trying to explain or improve the model's performance.
Utilizing GridSearchCV from sklearn.model_selection, we aim to determine the optimal parameter values for the Random Forest Classification model. Specifically, we are seeking the best values for the following parameters within the RandomForestClassifier
: n_estimators
, max_depth
, min_samples_split
, and max_features
.
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
# Define the parameter grid for hyperparameter tuning
param_grid = {
'n_estimators': [200, 300, 400], # List of different values for n_estimators
'max_depth': [None, 20, 30, 40], # List of different values for max_depth
'min_samples_split': [5, 8, 11], # List of different values for min_samples_split
'max_features': ['auto', 'sqrt', 'log2'] # List of different values for max_features
}
# Create the RandomForestClassifier
rf_classifier = RandomForestClassifier()
# Create the GridSearchCV object
grid_search = GridSearchCV(rf_classifier, param_grid=param_grid, verbose=10, n_jobs=-1)
# Fit the grid search to the training data
grid_search.fit(X_train, y_train)
Fitting 5 folds for each of 108 candidates, totalling 540 fits
grid_search.best_params_
best_estimator = grid_search.best_estimator_
best_estimator.score(X_test, y_test)