通过对最好的模型进行检查,你经常会得到关于这个问题的深刻认知。例如,随机森林回归[RandomForestRegressor]可以指出每个属性的相对重要性,以便作出准确的预测:
>>> feature_importances = grid_search.best_estimator_.feature_importances_
>>> feature_importances
array([ 7.14156423e-02, 6.76139189e-02, 4.44260894e-02,
1.66308583e-02, 1.66076861e-02, 1.82402545e-02,
1.63458761e-02, 3.26497987e-01, 6.04365775e-02,
1.13055290e-01, 7.79324766e-02, 1.12166442e-02,
1.53344918e-01, 8.41308969e-05, 2.68483884e-03,
3.46681181e-03])
让我们在其相应的属性名称旁边显示这些重要性评分:
>>> extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]
>>> cat_one_hot_attribs = list(encoder.classes_)
>>> attributes = num_attribs + extra_attribs + cat_one_hot_attribs
>>> sorted(zip(feature_importances, attributes), reverse=True)
[(0.32649798665134971, 'median_income'),
(0.15334491760305854, 'INLAND'),
(0.11305529021187399, 'pop_per_hhold'),
(0.07793247662544775, 'bedrooms_per_room'),
(0.071415642259275158, 'longitude'),
(0.067613918945568688, 'latitude'),
(0.060436577499703222, 'rooms_per_hhold'),
(0.04442608939578685, 'housing_median_age'),
(0.018240254462909437, 'population'),
(0.01663085833886218, 'total_rooms'),
(0.016607686091288865, 'total_bedrooms'),
(0.016345876147580776, 'households'),
(0.011216644219017424, '<1H OCEAN'),
(0.0034668118081117387, 'NEAR OCEAN'),
(0.0026848388432755429, 'NEAR BAY'),
(8.4130896890070617e-05, 'ISLAND')]
有了这些信息,您可能想尝试删除一些不太有用的特性(例如,显然只有一个ocean_proximity类别是非常有用的,所以您可以尝试删除其他的特性)。
您还应该了解系统所犯的特定错误,然后试着理解它为什么会产生这些错误,以及如何修复这个问题(添加额外的特征,或者相反减少特征,清除那些信息不丰富的数据,清除异常值等等)。