Blue book for Bulldozers: A Kaggle Machine Learning Competition

Harder than counting sheep:
Predicting the price of dozing bulls.

After coming in the top 10% of the Stack Overflow competition in October 2012, I recently competed in a team in another Machine Learning competition on Kaggle, this time to predict the auction sale-price of heavy machinery.

The competition provided the details of 400,000 machines at auction, with details like the size, type, weight and features of each machine, when it was manufactured, how much use it had had, where it was auctioned (anonymised), and the final sale price of each machine.The public leader board validation set was made of 40,000 rows with the sale-price hidden. The challenge was to predict the sale-price for these rows, from the non-hidden features.

The first step was to set up a traning / test harness. Kaggle provide a “benchmark solution”, so we modified that to split the training set into parts, train on the first part, then test on the second part using 10-fold Cross-Validation. Once we had the benchmark solution working, we had something to test against: We’d run once without any changes, then make a change, run again, and see if there was any improvement in the error-score.

To give you an idea of things that we tried over the month that we competed, this is a (non-exhaustive) list of the techniques that we tried:
  • Append the appendix data (an addtional file which was provided, to fix several problems with the originally released data), with more information about each machine. This improved our score (as expected). Although APPENDING it seemed to perform better than just replacing the original “incorrect” columns.
  • Dummy Coding categorical variables. Create new true / false cols like “size = large” / “size = medium” / “size=”small”. Perfomed worse than just feeding the original columns with the benchmark coding into a Random Forest.
  • Random coding of the nominal variables. The benchmark code iterated through the nominal (textual) features giving the first value encountered “0”, the subsequent one “1”, etc. We tried coding them in a random order to see if it worsened performance. It did, which isn’t really surprising since you’re destroying data in some sense by removing any effect the order of the rows had on the coding.
  • Coding based on frequency of the nominal variables. Code the most common value as 0, the next most common 1, etc. Was better than random-coding, but worse than the benchmark coding. I’m still not sure why.
  • Add “is null-like” columns for each feature. True if the value was null, 0, 1000, empty, unknown, etc. Did not improve score.
  • Principal Component Analysis / Multiple Correspondance Analysis to try to reduce the number of attributes to just the most important ones. Performed worse than just using the original columns (although, for a given N components, (with N relatively small), generally performed better than any subset of N columns. But there was never a great reason to keep the number of columns down, so this wasn’t that much use.)
  • Forward selection on columns. Selected a few columns, but was not as good as feeding in ALL the columns and letting the random-forest decide.
  • Split “ProductClassDesc” into 2 features (split on the “-“). Improved score.
  • Treat datasource and auctioneerID as nominal values, not integers, and code as above. Performed worse.
  • Add columns: “Age at auction”, “years since manufacture”, “appendix differs (true/false?)”. Did not significantly affect perfomance.
  • Remove invalid values. Set values that were obviously wrong (like those machines that were sold before they were manufactured) to 0. Did not significantly affect perfomance.
  • Remove invalid rows. Remove any rows with invalid values from the training set. Worsened perfomance. Presumably because these rows still appeared in the test-set, but we now had never seen “invalid” values before.
  • Bagging - Train on 10 different subsets of the training data (90% of the training data at each round), to produce 10 different models. Each model ran over the test/validation set and the results were averaged (mean) to produce the final submission. Not sure if this improved performance or not. Untested.
  • Using different models than the random forest. Nothing better found.
  • Tuning Random Forest parmeters: Nothing better than default found.
  • Using external data to augment the data we had. No useful external data found.
  • Creating additional features from the data. Nothing found that improved the RMSLE.
Since most of these attempts didn’t improve our score, In the end, we basically ran the benchmark code on the training set + Appendix, with a couple of extra features, more trees, and some of the less important features removed.

The final set of features we used (with reported importances from the random forest) were:

('ProductSize', 0.41652172001561916)
('ageatauction', 0.19948251510717771)
('apx_PrimaryLower', 0.13406362618392142)
('fiBaseModel', 0.12148613718214749)
('Coupler_System', 0.051217663198178283)
('apx_MfgYear', 0.021145885459952378)
('apx_fiManufacturerID', 0.0097494283775320933)
('Pushblock', 0.0089972298725232661)
('Engine_Horsepower', 0.008846433447453169)
('apx_ModelID', 0.0042763625747894753)
('apx_PrimaryUpper', 0.0036585522523700344)
('Scarifier', 0.0036413421737741384)
('yearsSinceManufacture', 0.0028904643776582117)
('Blade_Width', 0.0028731453236328121)
('fiSecondaryDesc', 0.0022618679097654356)
('YearMade', 0.0019901130435210253)
('apx_fiManufacturerDesc', 0.0014476744929245078)
('ProdCapacity', 0.0012626674871026706)
('apx_fiSecondaryDesc', 0.0010425325600432653)
('fiProductClassDesc', 0.00093630558400319901)
('fiModelDesc', 0.00056719203319523953)
('apx_fiModelDesc', 0.00029437944727283606)
('SaleYear', 0.00022935103892434739)
('apx_ProdCapacity', 0.00021697456333554102)
('Drive_System', 0.00020643489848706439)
('Enclosure', 0.00010215641151035445)
('Forks', 9.5813103086392245e-05)
('Blade_Extension', 8.0352144873435045e-05)
('Tip_Control', 5.9438105305499288e-05)
('apx_fiBaseModel', 4.5516313762997594e-05)
('apx_fiProductClassDesc', 4.33268809663658e-05)
('fiModelSeries', 3.4079419042772179e-05)
('Enclosure_Type', 3.1262092602450725e-05)
('apx_fiModelSeries', 2.6341749513859476e-05)
('fiModelDescriptor', 2.2807850943366345e-05)
('Ride_Control', 2.2505958064719364e-05)
('SaleMonth', 2.1849036919695667e-05)
('Ripper', 1.8778815914797885e-05)
('Hydraulics', 1.4807454596585348e-05)
('Coupler', 1.2188403143554827e-05)
('apx_fiModelDescriptor', 9.4231488471798423e-06)
('apx_MachineID', 9.3998985325911364e-06)
('state', 6.7549014932989303e-06)
('SaleDay', 5.8738279187689774e-06)
('Blade_Type', 5.4057744430051251e-06)
('auctioneerID', 3.9776719234206763e-06)
('MachineHoursCurrentMeter', 3.9611219174820141e-06)
('Tire_Size', 3.4886586776409558e-06)
('Travel_Controls', 2.0961153892056989e-06)
('Stick_Length', 1.4384744234822175e-06)
('Stick', 1.2726766881212763e-06)
('datasource', 1.1001817337094394e-06)
('apx_ProductGroup', 1.0266089810096326e-06)
('UsageBand', 9.4128083421217395e-07)
('Thumb', 9.2267259440680872e-07)
('Transmission', 9.040701186308483e-07)
('Backhoe_Mounting', 8.1541235667766097e-07)
('ProdClass', 7.9739045406991217e-07)
('Undercarriage_Pad_Width', 6.9253297534224442e-07)
('Differential_Type', 4.7160145738810293e-07)
('Grouser_Type', 4.1470891159320132e-07)
('appendixdiffers', 3.8711509850675122e-07)
('Track_Type', 3.4502876528014946e-07)
('Pad_Type', 3.3943373773663519e-07)
('apx_PrimarySizeBasis', 1.6448148840859249e-07)
('Turbocharged', 1.1104579477061898e-07)
('Steering_Controls', 1.0653425857162143e-07)
('Pattern_Changer', 9.3398440962281264e-08)
('ProductGroupDesc', 3.7256255581379795e-08)
('apx_ProductGroupDesc', 8.6086322380585466e-09)
('Grouser_Tracks', 3.326743211015174e-09)
('Hydraulics_Flow', 2.6945620128168258e-09)
('MachineID', 0.0)
('Transmission', 0.0)

We got a RMSLE of 0.26231 (beating the benchmark of 0.26704 but not approaching the winning score of 0.22910).

Unfortunately, this scattergun approach is the best way I know of going about these competitions. I attacked the Stack Overflow competition in essentially the same way (kept adding more and more features, and trying different techniques), but just had more luck there. I think with experience it should be possible to learn which techniques work best in which scenarios, but without being taught what works well where, or being in a team learning these things from people who know what they are doing, it really takes a long time to attempt to try all these different methods.

If anyone has any advice (or has attempted other Kaggle competitons and feels the same way!), please please let me know!