Solving a Tabular Style Datathon

Solving a Tabular Style Datathon

First, we load the train and test features:

train = pd.read_csv("train.csv")
testFeatures = pd.read_csv("test.csv")

After looking at train.head(), we can see that the problem is a tabular binary prediction problem. Usually for tabular dataset, a gradient boosting tree will outperform NN, due to the limited number of training data, so we decided to use autogluon, an autoML tool that consists of many popular GB model to fit. Another we also did, besides using autogluon, is that we used openfe, an automated feature generator. So we load them in:

!pip install autogluon.tabular[all]
!pip install openfe
import numpy as np
import matplotlib.pyplot as plt
from autogluon.tabular  import TabularDataset,TabularPredictor
from openfe import OpenFE, transform

We use openFE to generate about 2000 features automatically. The way it works is it first generates different combination of original features, such as xy y or x / y. It then tries to estimate the importance of the feature and get rid of ones that have 0 importance.

train_x = train.drop(columns=['building_id', 'damage_grade'])
train_y = train['damage_grade']

ofe = OpenFE()
features = ofe.fit(data=train_x, label=train_y, n_jobs=20)

Now we add the features proposed by openfe, but in a forward selection manner: The new_features_list is sorted by importance already, so we keep increasing the number of new features we add, until there is a worsen validation loss. This is how we get the 32

train_data, test_data = transform(train, testFeatures, ofe.new_features_list[:32], n_jobs=20)

Before we fit our model, we dropped building_id, because it is unrelated to how badly the building will be damaged. There is no need to split a train_data & val_data, because autogluon takes care of cross validation for us automatically. We use eval_metric='log_loss', instead of 'accuracy', as the compeition loss uses log_loss.

train_data = TabularDataset(train_data.drop(columns=['building_id']))
predictor = TabularPredictor(label='damage_grade', eval_metric='log_loss').fit(train_data=train_data, num_gpus=0)

After the model is fitted, get the prediction probability.

predictions = predictor.predict_proba(test_data)