前言
本文旨在为前面学过的api做一个汇总,用于复习
1.K-近邻算法API
sklearn.neighbors.KNeighborsClassifier(n_neighbors=5)
代码demo
from sklearn.neighbors import KNeighborsClassifierx=[[1 ],[2 ],[0 ],[0 ]] y=[1 ,1 ,0 ,0 ] estimator=KNeighborsClassifier(n_neighbors=2 ) estimator.fit(x,y) ret=estimator.predict([[1 ]]) print(ret)
运行结果:
特征预处理API
因为归一化的数据处理机制,如果数据集中异常点过多,处理结果的鲁棒性较差,并不是很适合现代的大型数据集
所以我们一般都是使用标准化处理数据
代码demo:
from sklearn.preprocessing import MinMaxScaler, StandardScalerimport pandas as pddata = pd.read_csv('./data/dating.txt' ) transfer = MinMaxScaler(feature_range=(0 , 1 )) MinMax_data = transfer.fit_transform(data.iloc[:, :3 ]) print("归一化处理后: \n" , MinMax_data) transfer1 = StandardScaler() stander_data = transfer1.fit_transform(data.iloc[:, :3 ]) print("标准化处理后: \n" , stander_data)
处理前部分原数据
milage,Liters,Consumtime,target 40920,8.326976,0.953952,3 14488,7.153469,1.673904,2 26052,1.441871,0.805124,1 75136,13.147394,0.428964,1 38344,1.669788,0.134296,1 72993,10.141740,1.032955,1 35948,6.830792,1.213192,3
处理后的结果:
归一化处理后: [[0.44832535 0.39805139 0.56233353] [0.15873259 0.34195467 0.98724416] [0.28542943 0.06892523 0.47449629] ... [0.29115949 0.50910294 0.51079493] [0.52711097 0.43665451 0.4290048 ] [0.47940793 0.3768091 0.78571804]] 标准化处理后: [[ 0.33193158 0.41660188 0.24523407] [-0.87247784 0.13992897 1.69385734] [-0.34554872 -1.20667094 -0.05422437] ... [-0.32171752 0.96431572 0.06952649] [ 0.65959911 0.60699509 -0.20931587] [ 0.46120328 0.31183342 1.00680598]]
数据集分割API
from sklearn.model_selection import train_test_split
参数:
train_size:训练集大小
float:0-1之间,表示训练集所占的比例
int:直接指定训练集的数量
None:自动为测试集的补集,也就是原始数据集减去测试集
test_size:测试集大小,默认值是0.25
float:0-1之间,表示测试集所占的比例
int:直接指定测试集的数量
None:自动为训练集的补集,也就是原始数据集减去训练集
**random_state:**可以理解为随机数种子,主要是为了复现结果而设置
**shuffle:**表示是否打乱数据位置,True或者False,默认是True
**stratify:**表示是否按照样本比例(不同类别的比例)来划分数据集,例如原始数据集 类A:类B = 75%:25%,那么划分的测试集和训练集中的A:B的比例都会是75%:25%;可用于样本类别差异很大的情况,一般使用为:stratify=y,即用数据集的标签y来进行划分。
我们使用这个api去分割我们使用的数据集
将其分割成训练集和测试集
在鸢尾花案例中,我们传入鸢尾花数据集中的特征值和目标值
然后返回训练集特征值,测试集特征值,训练集目标值,测试集目标值用于模型训练
代码demo:
(数据集为鸢尾花)
from sklearn.datasets import load_irisfrom sklearn.model_selection import train_test_splitiris = load_iris() x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2 , random_state=22 ) print("训练集的特征值是: \n" ,x_train) print("测试集的特征值是: \n" ,x_test) print("训练集的目标值是: \n" ,y_train) print("测试集的目标值是: \n" ,y_test)
运行结果:
训练集的特征值是: [[6.2 2.8 4.8 1.8] [5.1 3.3 1.7 0.5] [5.6 2.9 3.6 1.3] [7.7 3.8 6.7 2.2] [5.4 3. 4.5 1.5] [5.8 4. 1.2 0.2] [6.4 2.8 5.6 2.2] [6.1 3. 4.6 1.4] [5.5 2.3 4. 1.3] [6.9 3.1 5.1 2.3] [6. 2.9 4.5 1.5] [6.2 2.9 4.3 1.3] [6.8 3.2 5.9 2.3] [5. 2.3 3.3 1. ] [4.8 3.4 1.6 0.2] [6.1 2.6 5.6 1.4] [5.2 3.4 1.4 0.2] [6.7 3.1 4.4 1.4] [5.1 3.5 1.4 0.2] [5.2 3.5 1.5 0.2] [5.5 3.5 1.3 0.2] [4.9 2.5 4.5 1.7] [6.2 3.4 5.4 2.3] [7.9 3.8 6.4 2. ] [5.4 3.4 1.7 0.2] [6.7 3.1 5.6 2.4] [6.3 3.4 5.6 2.4] [7.6 3. 6.6 2.1] [6. 2.2 5. 1.5] [4.3 3. 1.1 0.1] [4.8 3.1 1.6 0.2] [5.8 2.7 5.1 1.9] [5.7 2.8 4.1 1.3] [5.2 2.7 3.9 1.4] [7.7 3. 6.1 2.3] [6.3 2.7 4.9 1.8] [6.1 2.8 4. 1.3] [5.1 3.7 1.5 0.4] [5.7 2.8 4.5 1.3] [5.4 3.9 1.3 0.4] [5.8 2.8 5.1 2.4] [5.8 2.6 4. 1.2] [5.1 2.5 3. 1.1] [5.7 3.8 1.7 0.3] [5.5 2.4 3.7 1. ] [5.9 3. 4.2 1.5] [6.7 3.1 4.7 1.5] [7.7 2.8 6.7 2. ] [4.9 3. 1.4 0.2] [6.3 3.3 4.7 1.6] [5.1 3.8 1.5 0.3] [5.8 2.7 3.9 1.2] [6.9 3.2 5.7 2.3] [4.9 3.1 1.5 0.1] [5. 2. 3.5 1. ] [4.9 3.1 1.5 0.2] [5. 3.5 1.3 0.3] [5.4 3.7 1.5 0.2] [6.8 3. 5.5 2.1] [6.3 3.3 6. 2.5] [5. 3.4 1.6 0.4] [5.2 4.1 1.5 0.1] [6.3 2.5 5. 1.9] [7.7 2.6 6.9 2.3] [6. 2.2 4. 1. ] [7.2 3.6 6.1 2.5] [4.9 2.4 3.3 1. ] [6.1 2.8 4.7 1.2] [6.5 3. 5.2 2. ] [5.1 3.5 1.4 0.3] [7.4 2.8 6.1 1.9] [5.9 3. 5.1 1.8] [6.4 2.7 5.3 1.9] [4.4 2.9 1.4 0.2] [5.6 2.8 4.9 2. ] [5.1 3.4 1.5 0.2] [5. 3.3 1.4 0.2] [5.7 2.6 3.5 1. ] [6.9 3.1 5.4 2.1] [5.5 2.6 4.4 1.2] [6.3 2.8 5.1 1.5] [7. 3.2 4.7 1.4] [6.8 2.8 4.8 1.4] [6.5 3.2 5.1 2. ] [6.9 3.1 4.9 1.5] [5.5 2.4 3.8 1.1] [5.6 3. 4.5 1.5] [6. 3. 4.8 1.8] [6. 2.7 5.1 1.6] [5.8 2.7 5.1 1.9] [5.9 3.2 4.8 1.8] [5.1 3.8 1.6 0.2] [6.2 2.2 4.5 1.5] [5.6 3. 4.1 1.3] [5.6 2.5 3.9 1.1] [5.8 2.7 4.1 1. ] [6.4 3.1 5.5 1.8] [6.6 2.9 4.6 1.3] [5.5 4.2 1.4 0.2] [4.4 3. 1.3 0.2] [6.3 2.9 5.6 1.8] [6.4 3.2 4.5 1.5] [7.3 2.9 6.3 1.8] [5. 3.6 1.4 0.2] [7.1 3. 5.9 2.1] [4.9 3.6 1.4 0.1] [6.5 3. 5.5 1.8] [6.7 3.3 5.7 2.1] [5.4 3.4 1.5 0.4] [6.1 2.9 4.7 1.4] [4.6 3.2 1.4 0.2] [6.7 3. 5.2 2.3] [5.7 3. 4.2 1.2] [5. 3.4 1.5 0.2] [6.5 3. 5.8 2.2] [6.6 3. 4.4 1.4] [5. 3.5 1.6 0.6] [4.6 3.6 1. 0.2] [6.3 2.5 4.9 1.5] [5.7 4.4 1.5 0.4]] 测试集的特征值是: [[4.6 3.4 1.4 0.3] [4.6 3.1 1.5 0.2] [5.7 2.5 5. 2. ] [4.8 3. 1.4 0.1] [4.8 3.4 1.9 0.2] [7.2 3. 5.8 1.6] [5. 3. 1.6 0.2] [6.7 2.5 5.8 1.8] [6.4 2.8 5.6 2.1] [4.8 3. 1.4 0.3] [5.3 3.7 1.5 0.2] [4.4 3.2 1.3 0.2] [5. 3.2 1.2 0.2] [5.4 3.9 1.7 0.4] [6. 3.4 4.5 1.6] [6.5 2.8 4.6 1.5] [4.5 2.3 1.3 0.3] [5.7 2.9 4.2 1.3] [6.7 3.3 5.7 2.5] [5.5 2.5 4. 1.3] [6.7 3. 5. 1.7] [6.4 2.9 4.3 1.3] [6.4 3.2 5.3 2.3] [5.6 2.7 4.2 1.3] [6.3 2.3 4.4 1.3] [4.7 3.2 1.6 0.2] [4.7 3.2 1.3 0.2] [6.1 3. 4.9 1.8] [5.1 3.8 1.9 0.4] [7.2 3.2 6. 1.8]] 训练集的目标值是: [2 0 1 2 1 0 2 1 1 2 1 1 2 1 0 2 0 1 0 0 0 2 2 2 0 2 2 2 2 0 0 2 1 1 2 2 1 0 1 0 2 1 1 0 1 1 1 2 0 1 0 1 2 0 1 0 0 0 2 2 0 0 2 2 1 2 1 1 2 0 2 2 2 0 2 0 0 1 2 1 2 1 1 2 1 1 1 2 1 2 1 0 1 1 1 1 2 1 0 0 2 1 2 0 2 0 2 2 0 1 0 2 1 0 2 1 0 0 1 0] 测试集的目标值是: [0 0 2 0 0 2 0 2 2 0 0 0 0 0 1 1 0 1 2 1 1 1 2 1 1 0 0 2 0 2]
交叉验证,网格搜索API
sklearn.model_selection.GridSearchCV(estimator, param_grid=None,cv=None)
对估计器的指定参数值进行详尽搜索
estimator:估计器对象
param_grid:估计器参数(dict){“n_neighbors”:[1,3,5]}
cv:指定几折交叉验证
fit:输入训练数据
score:准确率
结果分析:
bestscore__:在交叉验证中验证的最好结果
bestestimator :最好的参数模型
cvresults :每次交叉验证后的验证集准确率结果和训练集准确率结果
这个API主要是选择模型最优参数时使用,我们可以通过这个api去筛选knn算法中k值最优的模型结果
这个在我们鸢尾花案例中有较明显的体现
from sklearn.datasets import load_irisfrom sklearn.preprocessing import StandardScalerfrom sklearn.model_selection import train_test_splitfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.model_selection import GridSearchCViris = load_iris() x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, train_size=0.2 ) transfer = StandardScaler() x_train = transfer.fit_transform(x_train) x_test = transfer.fit_transform(x_test) estimator = KNeighborsClassifier() param_dict = {"n_neighbors" : [1 , 3 , 5 , 7 , 9 ]} estimator = GridSearchCV(estimator, param_grid=param_dict, cv=5 ) estimator.fit(x_train, y_train) y_pre = estimator.predict(x_test) print("对比预测结果和真实值: \n" , y_pre == y_test) score = estimator.score(x_test, y_test) print("准确率 \n" , score) print("在交叉验证中验证的最好结果:\n" , estimator.best_score_) print("最好的参数模型:\n" , estimator.best_estimator_) print("每次交叉验证后的准确率结果:\n" , estimator.cv_results_)
运行结果:
对比预测结果和真实值: [ True False True True True False False True True True True True True False True True True True False True True True True True True True True True False True True False False True True True True True False True True False False True False True True True False False True False True True True True True True True True False True True True False False True True True True True True True True True True True True True True True True True True True True True True True True True True True True True False True True True True False True True True True True True True True True False True True True True True False True True True] 准确率 0.8166666666666667 在交叉验证中验证的最好结果: 0.9666666666666668 最好的参数模型: KNeighborsClassifier(n_neighbors=1) 每次交叉验证后的准确率结果: {'mean_fit_time': array([0.00039964, 0.00019946, 0.00039964, 0.00039892, 0.00039849]), 'std_fit_time': array([0.00048945, 0.00039892, 0.00048946, 0.00048858, 0.00048805]), 'mean_score_time': array([0.00119624, 0.00099826, 0.00099659, 0.00099702, 0.00079846]), 'std_score_time': array([4.00023964e-04, 2.03425684e-06, 6.29846620e-04, 6.30751181e-04, 3.99234513e-04]), 'param_n_neighbors': masked_array(data=[1, 3, 5, 7, 9], mask=[False, False, False, False, False], fill_value='?', dtype=object), 'params': [{'n_neighbors': 1}, {'n_neighbors': 3}, {'n_neighbors': 5}, {'n_neighbors': 7}, {'n_neighbors': 9}], 'split0_test_score': array([0.83333333, 0.83333333, 1. , 0.83333333, 0.83333333]), 'split1_test_score': array([1. , 1. , 0.83333333, 1. , 1. ]), 'split2_test_score': array([1., 1., 1., 1., 1.]), 'split3_test_score': array([1., 1., 1., 1., 1.]), 'split4_test_score': array([1., 1., 1., 1., 1.]), 'mean_test_score': array([0.96666667, 0.96666667, 0.96666667, 0.96666667, 0.96666667]), 'std_test_score': array([0.06666667, 0.06666667, 0.06666667, 0.06666667, 0.06666667]), 'rank_test_score': array([1, 1, 1, 1, 1])}
从上面我们可以看成,k最优的值是1,我们在使用GridSearchCV时,将我们的模型和我们自己传入的超参数去进行筛选最优k值
2.线性回归API
sklearn.linear_model.LinearRegression(fit_intercept=True)
通过正规方程优化
fit_intercept:是否计算偏置
LinearRegression.coef_:回归系数
LinearRegression.intercept_:偏置
代码demo:
from sklearn.linear_model import LinearRegressionx = [[80 , 86 ], [82 , 80 ], [85 , 78 ], [90 , 90 ], [86 , 82 ], [82 , 90 ], [78 , 80 ], [92 , 94 ]] y = [84.2 , 80.6 , 80.1 , 90 , 83.2 , 87.6 , 79.4 , 93.4 ] estimator = LinearRegression() estimator.fit(x, y) print("系数是 \n" , estimator.coef_) y_pre = estimator.predict([[100 , 60 ]]) print("预测的最终成绩 \n" , y_pre)
运行结果:
系数是 [0.3 0.7] 预测的最终成绩 [72.]
线性回归优化API
梯度下降
sklearn.linear_model.SGDRegressor(loss=“squared_loss”, fit_intercept=True, learning_rate =‘invscaling’, eta0=0.01)
SGDRegressor类实现了随机梯度下降学习,它支持不同的loss函数和正则化惩罚项 来拟合线性回归模型。
loss:损失类型
loss=”squared_loss”: 普通最小二乘法
fit_intercept:是否计算偏置
learning_rate : string, optional
学习率填充
‘constant’: eta = eta0
‘optimal’: eta = 1.0 / (alpha * (t + t0)) [default]
‘invscaling’: eta = eta0 / pow(t, power_t)
对于一个常数值的学习率来说,可以使用learning_rate=’constant’ ,并使用eta0来指定学习率。
SGDRegressor.coef_:回归系数
SGDRegressor.intercept_:偏置
回归性能评估API
sklearn.metrics.mean_squared_error(y_true, y_pred)
均方误差回归损失
y_true:真实值
y_pred:预测值
return:浮点数结果
代码demo:
(数据集为波士顿房价)
from sklearn.datasets import load_bostonfrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import StandardScalerfrom sklearn.linear_model import LinearRegression, SGDRegressor, Ridge,RidgeCVfrom sklearn.metrics import mean_squared_errordef line_model1 (): boston = load_boston() x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.2 ) transfer = StandardScaler() x_train = transfer.fit_transform(x_train) x_test = transfer.fit_transform(x_test) estimator = LinearRegression() estimator.fit(x_train, y_train) y_Pre = estimator.predict(x_test) print("预测值\n" , y_Pre) print("系数\n" , estimator.coef_) print("准确率 \n" , estimator.score(x_test, y_test)) print("偏置\n" , estimator.intercept_) error = mean_squared_error(y_test, y_Pre) print("回归误差\n" , error) def line_model2 (): boston = load_boston() x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.2 ) transfer = StandardScaler() x_train = transfer.fit_transform(x_train) x_test = transfer.fit_transform(x_test) estimator = SGDRegressor(max_iter=1000 , learning_rate='constant' , eta0=0.001 ,penalty="l2" ) estimator.fit(x_train, y_train) y_Pre = estimator.predict(x_test) print("预测值\n" , y_Pre) print("系数\n" , estimator.coef_) print("准确率 \n" , estimator.score(x_test, y_test)) print("偏置\n" , estimator.intercept_) error = mean_squared_error(y_test, y_Pre) print("回归误差\n" , error) if __name__ == '__main__' : line_model1() print("----------" ) line_model2()
运行结果
预测值 [15.69281708 15.60607551 22.31177172 9.79777519 32.00221586 40.10161418 19.94011241 20.74246294 21.48518439 30.48520552 20.39978969 25.04166372 28.08300791 11.78332454 20.63561262 29.89531582 20.37314771 30.31422588 21.64751123 28.47355414 33.53589598 37.66749504 17.68295323 33.36918539 28.63002574 21.03322709 15.0384725 16.79145899 33.62807826 27.14229387 13.06248108 19.40785796 22.29885164 16.65282174 12.24378835 16.28891882 18.19222902 22.13357388 22.07569494 15.0815945 18.23662886 24.14784132 18.00690586 21.07802009 34.33780194 21.47902247 6.68112277 18.56996792 21.57538055 17.60363462 22.59480213 25.56799234 29.20690236 17.90162353 26.98975494 30.23606361 4.4897138 25.65537934 24.18648981 24.32316046 18.77870203 17.51216944 37.49041986 24.15746128 33.74693009 18.25229237 20.85217311 20.90589937 13.55357772 28.67324426 5.51189516 9.01921008 35.34329167 26.11243276 16.48342158 22.25250803 12.64776473 21.64854402 4.05782391 24.8512574 39.51878212 32.03050057 20.29504628 27.60485709 23.81182922 20.87088906 19.57752992 11.64715411 15.70474816 27.95454556 16.18374635 5.15133836 31.13429957 20.23116795 26.81146854 18.61413205 36.22408796 25.53406166 19.32957632 18.94178824 17.00882559 31.5355671 ] 系数 [-0.94636854 0.95952433 0.51606731 0.57193601 -1.87046793 3.08337865 -0.41319793 -2.62571876 2.28143837 -2.24935004 -2.23076044 0.93303594 -2.84608188] 准确率 0.5769869713050476 偏置 22.129455445544572 回归误差 39.58927420993645 ---------- 预测值 [13.49186027 25.504558 26.85136413 20.41815308 21.13959552 20.26916695 20.25187422 20.71100327 22.10361591 20.86208864 27.69058263 16.84565454 23.1962553 27.72225096 37.910782 20.23060971 17.02781194 18.87322649 17.69020341 14.66895098 28.67676648 14.3772249 21.56619808 20.01347812 19.31445623 17.8437261 17.38324619 22.04745202 21.97111618 15.00772809 18.94599197 25.37435555 28.1082041 31.12965967 37.57303017 44.80191689 16.53556745 24.09384002 36.8709077 19.17673628 16.52712413 14.23977448 5.59462638 17.70090591 26.51458495 19.39278929 19.0322606 30.58290131 19.1019216 9.34116774 12.9045293 29.29311584 21.42543614 21.52096337 25.31770195 41.0620208 33.2764697 23.28640673 15.48265946 16.23251281 36.01545631 17.96328453 32.16047679 30.14808262 31.79525668 27.9722376 10.79118391 23.51939577 41.36217739 14.56201891 16.00070189 18.9889247 20.23388718 31.48427425 3.5585817 17.59610333 33.70130271 17.85070691 24.68518801 27.89839823 30.24982773 33.68003882 16.41916282 25.13915645 21.99819736 14.42301084 24.66098363 28.62481404 22.58388204 19.38915251 23.82787937 14.28316523 29.1965419 28.60842088 40.57363673 19.52499872 27.40106199 10.58893198 13.68963952 21.32472939 27.51893247 20.21780463] 系数 [-0.6471845 0.60758869 -0.02538613 0.7397836 -1.53741589 3.05876641 -0.15258714 -2.68946359 1.81747311 -1.02835943 -1.86174619 0.86895181 -3.83261592] 准确率 0.6986164807848243 偏置 [22.78737932] 回归误差 26.76961909594243
线性回归的改进-岭回归
sklearn.linear_model.Ridge(alpha=1.0, fit_intercept=True,solver=“auto”, normalize=False)
具有l2正则化的线性回归
alpha:正则化力度,也叫 λ
solver:会根据数据自动选择优化方法
sag:如果数据集、特征都比较大,选择该随机梯度下降优化
normalize:数据是否进行标准化
normalize=False:可以在fit之前调用preprocessing.StandardScaler标准化数据
Ridge.coef_:回归权重
Ridge.intercept_:回归偏置
Ridge方法相当于SGDRegressor(penalty=‘l2’, loss=“squared_loss”),只不过SGDRegressor实现了一个普通的随机梯度下降学习,推荐使用Ridge(实现了SAG)
sklearn.linear_model.RidgeCV(_BaseRidgeCV, RegressorMixin)
具有l2正则化的线性回归,可以进行交叉验证
coef_:回归系数
所以当我们知道有Ridge之后之前学到的SGDRegressor也就不需要使用了
而RidgeCV是Ridge的超参数选择API
代码demo:
(数据集为波士顿房价)
from sklearn.metrics import mean_squared_errorfrom sklearn.model_selection import train_test_splitfrom sklearn.datasets import load_bostonfrom sklearn.linear_model import Ridgedata=load_boston() x_train,x_test,y_train,y_test=train_test_split(data.data,data.target,random_state=22 ,test_size=0.2 ) estimator=Ridge(alpha=0 ,normalize=True ) estimator.fit(x_train,y_train) y_pre=estimator.predict(x_test) score=estimator.score(x_test,y_test) print("准确率 \n" ,score) print("系数 \n" ,estimator.coef_) print("偏置 \n" ,estimator.intercept_) error = mean_squared_error(y_test, y_pre) print("误差为:\n" , error)
运行结果:
准确率 0.7657465943591123 系数 [-1.01199845e-01 4.67962110e-02 -2.06902678e-02 3.58072311e+00 -1.71288922e+01 3.92207267e+00 -5.67997339e-03 -1.54862273e+00 2.97156958e-01 -1.00709587e-02 -7.78761318e-01 9.87125185e-03 -5.25319199e-01] 偏置 32.42825286699124 误差为: 20.770684784270024
交叉验证的Ridge
代码demo:
from sklearn.metrics import mean_squared_errorfrom sklearn.model_selection import train_test_splitfrom sklearn.datasets import load_bostonfrom sklearn.linear_model import RidgeCVdata=load_boston() x_train,x_test,y_train,y_test=train_test_split(data.data,data.target,random_state=22 ,test_size=0.2 ) estimator=RidgeCV(alphas=(1 ,0.1 , 10 ,0.55 ,9 ),normalize=True ) estimator.fit(x_train,y_train) y_pre=estimator.predict(x_test) score=estimator.score(x_test,y_test) print("准确率 \n" ,score) print("系数 \n" ,estimator.coef_) print("偏置 \n" ,estimator.intercept_) error = mean_squared_error(y_test, y_pre) print("误差为:\n" , error)
运行结果:
准确率 0.755640759192834 系数 [-7.10636292e-02 2.81853030e-02 -7.07760829e-02 3.72392633e+00 -1.05710401e+01 4.09961387e+00 -9.32546827e-03 -1.06310611e+00 1.37548657e-01 -3.58696599e-03 -6.85595220e-01 9.37914747e-03 -4.66378501e-01] 偏置 23.328276195586888 误差为: 21.666744827223443
模型加载和保存API
from sklearn.externals import joblib
保存:joblib.dump(estimator, ‘test.pkl’)
加载:estimator = joblib.load(‘test.pkl’)
代码举例:
保存模型
(数据集为波士顿房价)
from skimage.metrics import mean_squared_errorfrom sklearn.datasets import load_bostonfrom sklearn.linear_model import RidgeCV, Ridgefrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import StandardScalerimport joblibdef dump_load_demo (): boston = load_boston() x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, train_size=0.2 ,random_state=22 ) transfer = StandardScaler() x_train = transfer.fit_transform(x_train) x_test = transfer.fit_transform(x_test) estimator = Ridge() estimator.fit(x_train, y_train) joblib.dump(estimator, './data/test.pkl' ) y_pre = estimator.predict(x_test) print("预测值 \n" , y_pre) score = estimator.score(x_test, y_test) print("准确率\n" , score) print("偏置\n" , estimator.intercept_) print("系数\n" , estimator.coef_) ret = mean_squared_error(y_test, y_pre) print("均方误差\n" , ret) if __name__ == '__main__' : dump_load_demo()
注:模型保存格式为pkl
运行结果:
预测值 [23.64469576 27.97409384 17.55168799 28.80358367 17.99377237 17.16911389 18.08838244 18.5484773 18.39159086 30.11882442 19.02379724 26.14281853 12.91653465 16.42264835 33.05693686 16.30877287 6.37751888 16.82204542 28.50466918 22.99403455 16.57065921 28.10613493 27.20987256 15.88197605 29.39165138 23.83772485 28.28820203 26.13717053 15.34475384 17.2592326 24.96024219 11.92509187 31.5877362 4.22575655 13.85279407 15.45154778 4.20339965 19.19572767 36.43206568 27.38697842 23.16484782 15.50800593 32.61396957 4.76928343 20.61280174 22.98552067 18.00938256 17.47797758 15.65941555 21.53036514 8.07350341 23.7631832 27.89948024 12.60928859 6.00939238 29.66164389 25.73749604 22.21461849 15.86568871 21.70506535 22.32167003 21.54997175 18.75814107 34.35036868 23.93330189 16.4650795 10.29497693 4.81134268 37.24939583 21.04504805 14.40856753 21.70178409 33.82983682 21.38429774 30.8300285 23.97655944 21.06166321 17.23171262 24.16127528 22.93301994 27.46787702 19.09761525 21.58895637 26.70861218 23.91545352 18.99584879 25.89840907 21.06068455 22.76976898 16.54404802 22.34618208 20.5520392 17.00269605 11.11408577 10.80942164 16.88787311 21.36466922 14.08334209 18.34541038 22.49992866 16.6009481 14.76856955 22.98225728 20.82099466 18.09167202 31.25449524 14.7748573 21.37510721 29.48724174 30.27555742 19.35717062 23.41591865 18.92675356 14.08811569 19.50876796 20.75844963 24.66902078 23.28548242 21.37184963 12.61360164 11.84374119 1.97389286 26.75034249 18.36677866 20.22284006 25.65136456 26.76853678 21.01965862 24.76838538 19.2719079 0.20758938 33.66047857 33.62514941 24.10607586 25.27714009 19.36420085 32.79293209 12.7253308 16.4697096 10.07572615 34.63725968 17.51742591 30.81214014 22.7464453 7.47271236 22.0566537 26.5021439 29.02030002 14.90958575 6.34319256 21.72255731 15.312686 26.12292152 28.03322847 23.81482954 14.76050315 26.25298872 33.36534552 5.38378434 19.48867011 17.22952014 26.73584653 13.75318646 32.2262328 30.80102394 30.85734971 19.13265617 18.97112331 23.78360179 18.36212696 11.02997263 20.49077695 20.0010755 33.14860384 23.48461798 25.93285901 33.6286644 14.55281723 5.71533379 23.80222741 12.74528738 29.87644788 17.36057156 17.68386619 24.48905778 15.40371636 17.1681389 14.56256338 22.20362029 24.07938957 16.13989928 20.30403757 13.11742038 19.76011695 10.15959589 5.05451631 24.98773317 22.05933162 17.38635448 17.95889724 11.85983575 23.09467263 10.5421983 12.56199688 20.46920769 22.89551792 11.35987849 28.36158041 22.46150768 22.07052047 18.96479228 15.94762742 29.55993025 18.18280727 26.60883362 28.91338874 29.0243314 27.11588917 23.41267873 22.16089044 15.87275538 19.55923911 10.01955695 25.36241049 37.09399692 9.21076403 28.7250581 23.1210652 16.10488274 20.98532986 11.92937281 16.63409543 13.39946236 18.10252834 30.20768813 27.01016297 23.78388552 15.90408559 25.88867096 23.46240636 22.98550277 17.90243108 30.69975106 20.54447287 17.24878101 22.6405577 18.78662547 26.43378753 6.1415722 13.46113754 11.01274483 27.46925322 31.75050402 11.44302638 9.15277255 30.32661424 22.23993372 3.59762111 21.18431804 22.32873001 24.91183653 31.66081008 20.11883814 23.58860016 23.39144498 20.95717749 28.62287194 30.25468985 18.60827558 17.29159623 20.8459336 29.53135771 29.1735858 25.86670544 20.47761831 22.00175334 25.46210862 20.76591231 18.48931061 23.21360354 28.04039153 14.76240119 5.56366719 19.02971742 23.60960253 25.61011124 17.93010868 18.35299157 14.99314051 24.55852344 15.73398244 17.9936638 15.78653015 26.86927323 21.83288484 17.09786581 24.1455792 23.21781793 4.89715057 14.55190726 29.53097939 16.96942402 29.97092011 20.56134909 23.61695306 13.90547522 28.94524778 13.9490235 25.14367973 17.81456951 26.92882448 21.97947553 16.37959719 25.58285606 18.4004159 16.81711119 21.41056759 20.52708367 10.18563418 24.72279293 7.65824257 -5.75882408 28.92725713 15.23172694 -0.80822785 1.1885204 16.05592378 31.02687874 22.11833608 25.257107 21.11962573 27.17319156 14.23418317 35.29815094 16.76025589 23.88193103 11.49969608 11.005055 11.00317386 18.27016267 24.9635937 14.54733798 22.23443933 21.72653137 18.11347747 22.50164981 30.66773175 17.84627447 19.36076926 27.85693006 33.87880818 25.66176459 19.14387897 22.66753296 14.84930104 27.39448909 33.97151416 18.97839843 16.93696433 23.93610382 27.80296883 10.97564404 33.33567959 18.90876462 21.50046582 28.38223075 15.20285187 16.59362028 19.48364863 27.52004471 20.99718401 24.00002741 27.2015801 22.39667384 23.24447674 23.08634542 11.51804889 14.09160002 30.71290202 30.53124258 1.59429213 19.2223747 23.92469427 30.52913878 13.84632587 17.93078843 28.58855707 16.2168317 8.57194758 19.94632304 21.58821976 20.44036427 27.38200144 24.42298241 6.31193462 26.26904066 28.46064924 21.90708324 11.52479134 15.93255738 25.56393581 23.51065591 31.34782385 7.78903776 21.15999469] 准确率 0.6384683499205781 偏置 20.605940594059394 系数 [-0.91782309 -0.17382552 0.26572243 0.6906218 -2.23367615 1.81562646 0.39996906 -1.14481067 2.18240602 -0.79484838 -2.12801148 0.82343198 -4.1643924 ] 均方误差 32.38410204009493
加载模型
(数据集为波士顿房价)
from skimage.metrics import mean_squared_errorfrom sklearn.datasets import load_bostonfrom sklearn.linear_model import RidgeCV, Ridgefrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import StandardScalerimport joblibdef dump_load_demo (): boston = load_boston() x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, train_size=0.2 ,random_state=22 ) transfer = StandardScaler() x_train = transfer.fit_transform(x_train) x_test = transfer.fit_transform(x_test) estimator = Ridge() estimator.fit(x_train, y_train) estimator=joblib.load('./data/test.pkl' ) y_pre = estimator.predict(x_test) print("预测值 \n" , y_pre) score = estimator.score(x_test, y_test) print("准确率\n" , score) print("偏置\n" , estimator.intercept_) print("系数\n" , estimator.coef_) ret = mean_squared_error(y_test, y_pre) print("均方误差\n" , ret) if __name__ == '__main__' : dump_load_demo()
运行结果:
预测值 [23.64469576 27.97409384 17.55168799 28.80358367 17.99377237 17.16911389 18.08838244 18.5484773 18.39159086 30.11882442 19.02379724 26.14281853 12.91653465 16.42264835 33.05693686 16.30877287 6.37751888 16.82204542 28.50466918 22.99403455 16.57065921 28.10613493 27.20987256 15.88197605 29.39165138 23.83772485 28.28820203 26.13717053 15.34475384 17.2592326 24.96024219 11.92509187 31.5877362 4.22575655 13.85279407 15.45154778 4.20339965 19.19572767 36.43206568 27.38697842 23.16484782 15.50800593 32.61396957 4.76928343 20.61280174 22.98552067 18.00938256 17.47797758 15.65941555 21.53036514 8.07350341 23.7631832 27.89948024 12.60928859 6.00939238 29.66164389 25.73749604 22.21461849 15.86568871 21.70506535 22.32167003 21.54997175 18.75814107 34.35036868 23.93330189 16.4650795 10.29497693 4.81134268 37.24939583 21.04504805 14.40856753 21.70178409 33.82983682 21.38429774 30.8300285 23.97655944 21.06166321 17.23171262 24.16127528 22.93301994 27.46787702 19.09761525 21.58895637 26.70861218 23.91545352 18.99584879 25.89840907 21.06068455 22.76976898 16.54404802 22.34618208 20.5520392 17.00269605 11.11408577 10.80942164 16.88787311 21.36466922 14.08334209 18.34541038 22.49992866 16.6009481 14.76856955 22.98225728 20.82099466 18.09167202 31.25449524 14.7748573 21.37510721 29.48724174 30.27555742 19.35717062 23.41591865 18.92675356 14.08811569 19.50876796 20.75844963 24.66902078 23.28548242 21.37184963 12.61360164 11.84374119 1.97389286 26.75034249 18.36677866 20.22284006 25.65136456 26.76853678 21.01965862 24.76838538 19.2719079 0.20758938 33.66047857 33.62514941 24.10607586 25.27714009 19.36420085 32.79293209 12.7253308 16.4697096 10.07572615 34.63725968 17.51742591 30.81214014 22.7464453 7.47271236 22.0566537 26.5021439 29.02030002 14.90958575 6.34319256 21.72255731 15.312686 26.12292152 28.03322847 23.81482954 14.76050315 26.25298872 33.36534552 5.38378434 19.48867011 17.22952014 26.73584653 13.75318646 32.2262328 30.80102394 30.85734971 19.13265617 18.97112331 23.78360179 18.36212696 11.02997263 20.49077695 20.0010755 33.14860384 23.48461798 25.93285901 33.6286644 14.55281723 5.71533379 23.80222741 12.74528738 29.87644788 17.36057156 17.68386619 24.48905778 15.40371636 17.1681389 14.56256338 22.20362029 24.07938957 16.13989928 20.30403757 13.11742038 19.76011695 10.15959589 5.05451631 24.98773317 22.05933162 17.38635448 17.95889724 11.85983575 23.09467263 10.5421983 12.56199688 20.46920769 22.89551792 11.35987849 28.36158041 22.46150768 22.07052047 18.96479228 15.94762742 29.55993025 18.18280727 26.60883362 28.91338874 29.0243314 27.11588917 23.41267873 22.16089044 15.87275538 19.55923911 10.01955695 25.36241049 37.09399692 9.21076403 28.7250581 23.1210652 16.10488274 20.98532986 11.92937281 16.63409543 13.39946236 18.10252834 30.20768813 27.01016297 23.78388552 15.90408559 25.88867096 23.46240636 22.98550277 17.90243108 30.69975106 20.54447287 17.24878101 22.6405577 18.78662547 26.43378753 6.1415722 13.46113754 11.01274483 27.46925322 31.75050402 11.44302638 9.15277255 30.32661424 22.23993372 3.59762111 21.18431804 22.32873001 24.91183653 31.66081008 20.11883814 23.58860016 23.39144498 20.95717749 28.62287194 30.25468985 18.60827558 17.29159623 20.8459336 29.53135771 29.1735858 25.86670544 20.47761831 22.00175334 25.46210862 20.76591231 18.48931061 23.21360354 28.04039153 14.76240119 5.56366719 19.02971742 23.60960253 25.61011124 17.93010868 18.35299157 14.99314051 24.55852344 15.73398244 17.9936638 15.78653015 26.86927323 21.83288484 17.09786581 24.1455792 23.21781793 4.89715057 14.55190726 29.53097939 16.96942402 29.97092011 20.56134909 23.61695306 13.90547522 28.94524778 13.9490235 25.14367973 17.81456951 26.92882448 21.97947553 16.37959719 25.58285606 18.4004159 16.81711119 21.41056759 20.52708367 10.18563418 24.72279293 7.65824257 -5.75882408 28.92725713 15.23172694 -0.80822785 1.1885204 16.05592378 31.02687874 22.11833608 25.257107 21.11962573 27.17319156 14.23418317 35.29815094 16.76025589 23.88193103 11.49969608 11.005055 11.00317386 18.27016267 24.9635937 14.54733798 22.23443933 21.72653137 18.11347747 22.50164981 30.66773175 17.84627447 19.36076926 27.85693006 33.87880818 25.66176459 19.14387897 22.66753296 14.84930104 27.39448909 33.97151416 18.97839843 16.93696433 23.93610382 27.80296883 10.97564404 33.33567959 18.90876462 21.50046582 28.38223075 15.20285187 16.59362028 19.48364863 27.52004471 20.99718401 24.00002741 27.2015801 22.39667384 23.24447674 23.08634542 11.51804889 14.09160002 30.71290202 30.53124258 1.59429213 19.2223747 23.92469427 30.52913878 13.84632587 17.93078843 28.58855707 16.2168317 8.57194758 19.94632304 21.58821976 20.44036427 27.38200144 24.42298241 6.31193462 26.26904066 28.46064924 21.90708324 11.52479134 15.93255738 25.56393581 23.51065591 31.34782385 7.78903776 21.15999469] 准确率 0.6384683499205781 偏置 20.605940594059394 系数 [-0.91782309 -0.17382552 0.26572243 0.6906218 -2.23367615 1.81562646 0.39996906 -1.14481067 2.18240602 -0.79484838 -2.12801148 0.82343198 -4.1643924 ] 均方误差 32.38410204009493
这样代码运行的结果都是一样的
3.逻辑回归API
sklearn.linear_model.LogisticRegression(solver=‘liblinear’, penalty=‘l2’, C = 1.0)
solver可选参数:{‘liblinear’, ‘sag’, ‘saga’,‘newton-cg’, ‘lbfgs’},
默认: ‘liblinear’;用于优化问题的算法。
对于小数据集来说,“liblinear”是个不错的选择,而“sag”和’saga’对于大型数据集会更快。
对于多类问题,只有’newton-cg’, ‘sag’, 'saga’和’lbfgs’可以处理多项损失;“liblinear”仅限于“one-versus-rest”分类。
penalty:正则化的种类
C:正则化力度
默认将类别数量少的当做正例
LogisticRegression方法相当于 SGDClassifier(loss=“log”, penalty=" "),SGDClassifier实现了一个普通的随机梯度下降学习。而使用LogisticRegression(实现了SAG)
分类评估API
sklearn.metrics.classification_report(y_true, y_pred, labels=[], target_names=None )
y_true:真实目标值
y_pred:估计器预测目标值
labels:指定类别对应的数字
target_names:目标类别名称
return:每个类别精确率与召回率
auc指标API
from sklearn.metrics import roc_auc_score
sklearn.metrics.roc_auc_score(y_true, y_score)
计算ROC曲线面积,即AUC值
y_true:每个样本的真实类别,必须为0(反例),1(正例)标记
y_score:预测得分,可以是正类的估计概率、置信值或者分类器方法的返回值
AUC的范围在[0.5, 1]之间,并且越接近1越好
AUC只能用来评价二分类
AUC非常适合评价样本不平衡中的分类器性能
代码demo:
(数据集为癌症分类预测)
数据集来源:https://archive.ics.uci.edu/ml/machine-learning-databases/
数据描述
(1)699条样本,共11列数据,第一列用语检索的id,后9列分别是与肿瘤
相关的医学特征,最后一列表示肿瘤类型的数值。
(2)包含16个缺失值,用”?”标出。
from sklearn.preprocessing import StandardScalerfrom sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LogisticRegressionimport pandas as pdimport numpy as npnames = ['Sample code number' , 'Clump Thickness' , 'Uniformity of Cell Size' , 'Uniformity of Cell Shape' , 'Marginal Adhesion' , 'Single Epithelial Cell Size' , 'Bare Nuclei' , 'Bland Chromatin' , 'Normal Nucleoli' , 'Mitoses' , 'Class' ] data=pd.read_csv('./data/breast-cancer-wisconsin.data' ,names=names) data=data.replace(to_replace="?" ,value=np.nan) data['Bare Nuclei' ]=data["Bare Nuclei" ].astype("float64" ) for i in data.columns: if np.all (pd.notnull(data[i])) == False : print(i) data[i].fillna(data[i].mean(),inplace=True ) data['Bare Nuclei' ]=data["Bare Nuclei" ].astype("int64" ) x=data.iloc[:,1 :10 ] y=data["Class" ] x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2 ,random_state=22 ) transfer=StandardScaler() x_train=transfer.fit_transform(x_train) x_test=transfer.fit_transform(x_test) estimator=LogisticRegression() estimator.fit(x_train,y_train) y_pre=estimator.predict(x_test) score=estimator.score(x_test,y_test) roc=roc_auc_score(y_true=y_test,y_score=y_pre) print("正确率 \n" ,score) print(roc)
运行结果:
正确率 0.9571428571428572 auc 0.9489795918367347
4.决策树API
class sklearn.tree.DecisionTreeClassifier(criterion=’gini’, max_depth=None,random_state=None)
criterion
特征选择标准
“gini"或者"entropy”,前者代表基尼系数,后者代表信息增益。一默认"gini",即CART算法。
min_samples_split
内部节点再划分所需最小样本数
这个值限制了子树继续划分的条件,如果某节点的样本数少于min_samples_split,则不会继续再尝试选择最优特征来进行划分。 默认是2.如果样本量不大,不需要管这个值。如果样本量数量级非常大,则推荐增大这个值。我之前的一个项目例子,有大概10万样本,建立决策树时,我选择了min_samples_split=10。可以作为参考。
min_samples_leaf
叶子节点最少样本数
这个值限制了叶子节点最少的样本数,如果某叶子节点数目小于样本数,则会和兄弟节点一起被剪枝。 默认是1,可以输入最少的样本数的整数,或者最少样本数占样本总数的百分比。如果样本量不大,不需要管这个值。如果样本量数量级非常大,则推荐增大这个值。之前的10万样本项目使用min_samples_leaf的值为5,仅供参考。
max_depth
决策树最大深度
决策树的最大深度,默认可以不输入,如果不输入的话,决策树在建立子树的时候不会限制子树的深度。一般来说,数据少或者特征少的时候可以不管这个值。如果模型样本量多,特征也多的情况下,推荐限制这个最大深度,具体的取值取决于数据的分布。常用的可以取值10-100之间
random_state
特征提取API
字典特征提取
sklearn.feature_extraction.DictVectorizer(sparse=True,…)
DictVectorizer.fit_transform(X)
X:字典或者包含字典的迭代器返回值
返回sparse矩阵
DictVectorizer.get_feature_names() 返回类别名称
文本特征提取
sklearn.feature_extraction.text.CountVectorizer(stop_words=[])
返回词频矩阵
CountVectorizer.fit_transform(X)
X:文本或者包含文本字符串的可迭代对象
返回值:返回sparse矩阵
CountVectorizer.get_feature_names() 返回值:单词列表
sklearn.feature_extraction.text.TfidfVectorizer
返回词频矩阵
TfidfVectorizer.fit_transform(X)
X:文本或者包含文本字符串的可迭代对象
返回值:返回sparse矩阵
TfidfVectorizer.get_feature_names() 返回值:单词列表
但是对于中文而言,比较特殊,因为字段分割问题,我们可能需要使用第三方的库,先对中文文本进行处理
jieba分词处理
直接使用jieba.cut()对于中文文本进行切分
https://github.com/fxsjy/jieba
可以从这个网站上看到基本用法
代码demo:
(数据集来自于泰坦尼克号)
http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt
import pandas as pdimport numpy as npfrom sklearn.feature_extraction import DictVectorizerfrom sklearn.model_selection import train_test_splitfrom sklearn.tree import DecisionTreeClassifier, export_graphvizdata=pd.read_csv('./titanic.txt' ) data["age" ].fillna(data["age" ].mean(),inplace=True ) x=data[["pclass" ,"age" ,"sex" ]] y=data["survived" ] x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=22 ,test_size=0.2 ) transfer=DictVectorizer() x_train=transfer.fit_transform(x_train.to_dict(orient='records' )) x_test=transfer.fit_transform(x_test.to_dict(orient='records' )) estimator=DecisionTreeClassifier(criterion='entropy' , max_depth=3 ) estimator.fit(x_train,y_train) y_pre=estimator.predict(x_test) score=estimator.score(x_test,y_test) print("准确率: \n" ,score)
运行结果:
5.集成学习API
随机森林API(bagging)
sklearn.ensemble.RandomForestClassifier(n_estimators=10, criterion=’gini’, max_depth=None, bootstrap=True, random_state=None, min_samples_split=2)
n_estimators:integer,optional(default = 10)森林里的树木数量120,200,300,500,800,1200
Criterion:string,可选(default =“gini”)分割特征的测量方法
max_depth:integer或None,可选(默认=无)树的最大深度 5,8,15,25,30
max_features="auto”,每个决策树的最大特征数量
If “auto”, then max_features=sqrt(n_features)
.
If “sqrt”, then max_features=sqrt(n_features)
(same as “auto”).
If “log2”, then max_features=log2(n_features)
.
If None, then max_features=n_features
.
bootstrap:boolean,optional(default = True)是否在构建树时使用放回抽样
min_samples_split:节点划分最少样本数
min_samples_leaf:叶子节点的最小样本数
超参数:n_estimator, max_depth, min_samples_split,min_samples_leaf
代码demo:
(数据集来自于泰坦尼克号)
和决策树使用基本一致,无非就是随机森林的超参数多了一些而已,我们也可以使用交叉验证API(GridSearchCV)去选择较优参数
print(gc.best_estimator_)import pandas as pd from sklearn.ensemble import RandomForestClassifierfrom sklearn.model_selection import train_test_splitfrom sklearn.feature_extraction import DictVectorizerimport numpy as npfrom sklearn.model_selection import GridSearchCVdata=pd.read_csv('./titanic.txt' ) data['age' ].fillna(data["age" ].mean(),inplace=True ) x=data[["pclass" ,"age" ,"sex" ]] y=data["survived" ] x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=22 ,test_size=0.2 ) transfer=DictVectorizer() x_train=transfer.fit_transform(x_train.to_dict(orient='records' )) x_test=transfer.fit_transform(x_test.to_dict(orient='records' )) rf = RandomForestClassifier() param = {"n_estimators" : [120 ,200 ,300 ,500 ,800 ,1200 ], "max_depth" : [5 , 8 , 15 , 25 , 30 ]} gc = GridSearchCV(rf, param_grid=param, cv=2 ) gc.fit(x_train, y_train) print("随机森林预测的准确率为:" , gc.score(x_test, y_test)) print("最好模型:" ,gc.best_estimator_)
运行结果:
随机森林预测的准确率为: 0.7908745247148289 最好模型: RandomForestClassifier(max_depth=5, n_estimators=300)
Adaboost(boosting)
代码demo:
from sklearn.ensemble import AdaBoostClassifierfrom sklearn.feature_extraction import DictVectorizerfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import classification_reportimport pandas as pdimport numpy as npdata=pd.read_csv("titanic.txt" ) data["age" ].fillna(data["age" ].mean(),inplace=True ) x=data[["age" ,"sex" ,"pclass" ]] y=data["survived" ] x_train,x_test,y_train,y_text=train_test_split(x,y,random_state=22 ,test_size=0.2 ) transfer=DictVectorizer() x_train=transfer.fit_transform(x_train.to_dict(orient="records" )) x_test=transfer.fit_transform(x_test.to_dict(orient="records" )) estimator=AdaBoostClassifier(n_estimators=100 , random_state=0 ) estimator.fit(x_train,y_train) score=estimator.score(x_test,y_text) print("准确率为" ,score)
运
行结果:
6.聚类算法API
sklearn.cluster.KMeans(n_clusters=8)
参数:
n_clusters:开始的聚类中心数量
整型,缺省值=8,生成的聚类数,即产生的质心(centroids)数。
方法:
estimator.fit(x)
estimator.predict(x)
estimator.fit_predict(x)
计算聚类中心并预测每个样本属于哪个类别,相当于先调用fit(x),然后再调用predict(x)
代码demo:
import matplotlib.pyplot as plt from sklearn import datasets from sklearn.cluster import KMeans from sklearn.metrics import calinski_harabaz_score x,y=datasets.make_blobs(n_samples=1000,n_features=2, centers=[[-1, -1], [0, 0], [1, 1], [2, 2]], cluster_std=[0.4,0.1,0.1,0.1], random_state=1) plt.scatter(x[:,0],x[:,1]) plt.show() #模型训练加预测 y_pre=KMeans(n_clusters=10,random_state=9).fit_predict(x) plt.scatter(x[:,0],x[:,1],c=y_pre) plt.show() # 用Calinski-Harabasz Index评估的聚类分数 print(calinski_harabaz_score(X, y_pred))
本段demo的运行结果为:
图一为原图
图二为聚类后的图像
还有ch值
ch值越大越好
特征降维API
sklearn.feature_selection.VarianceThreshold(threshold = 0.0)
删除所有低方差特征
Variance.fit_transform(X)
X:numpy array格式的数据[n_samples,n_features]
返回值:训练集差异低于threshold的特征将被删除。默认值是保留所有非零方差特征,即删除所有样本中具有相同值的特征。
代码demo
import pandas as pdfrom sklearn.feature_selection import VarianceThresholddef Variance (): data = pd.read_csv('./data/factor_returns.csv' ) print(data.shape) print("------------------------------------" ) transfer = VarianceThreshold(threshold=100 ) data = transfer.fit_transform(data.iloc[:, 1 :10 ]) print(data.shape) if __name__ == '__main__' : Variance()
原数据部分显示
index,pe_ratio,pb_ratio,market_cap,return_on_asset_net_profit,du_return_on_equity,ev,earnings_per_share,revenue,total_expense,date,return 0,000001.XSHE,5.9572,1.1818,85252550922.0,0.8008,14.9403,1211444855670.0,2.01,20701401000.0,10882540000.0,2012-01-31,0.027657228229937388 1,000002.XSHE,7.0289,1.588,84113358168.0,1.6463,7.8656,300252061695.0,0.326,29308369223.2,23783476901.2,2012-01-31,0.08235182370820669 2,000008.XSHE,-262.7461,7.0003,517045520.0,-0.5678,-0.5943,770517752.56,-0.006,11679829.03,12030080.04,2012-01-31,0.09978900335112327 3,000060.XSHE,16.476,3.7146,19680455995.0,5.6036,14.617,28009159184.6,0.35,9189386877.65,7935542726.05,2012-01-31,0.12159482758620697
运行结果:
(2318, 12) ------------------------------------ (2318, 5)
数据集中一些低方差特征就被剔除了
相关系数API
相关系数反映了变量之间相关关系密切程度
皮尔逊相关系数
from scipy.stats import pearsonr
x : (N,) array_like
y : (N,) array_like Returns: (Pearson’s correlation coefficient, p-value)
代码demo:
from scipy.stats import pearsonrx1 = [12.5 , 15.3 , 23.2 , 26.4 , 33.5 , 34.4 , 39.4 , 45.2 , 55.4 , 60.9 ] x2 = [21.2 , 23.9 , 32.9 , 34.1 , 42.5 , 43.2 , 49.0 , 52.8 , 59.4 , 63.5 ] x1, x2 = pearsonr(x1, x2) print(x1, x2)
运行结果:
(0.9941983762371883, 4.9220899554573455e-09)
x1越接近1说明相关性越高
斯皮尔曼相关系数
from scipy.stats import spearmanr
代码demo:
from scipy.stats import spearmanrx1 = [12.5 , 15.3 , 23.2 , 26.4 , 33.5 , 34.4 , 39.4 , 45.2 , 55.4 , 60.9 ] x2 = [21.2 , 23.9 , 32.9 , 34.1 , 42.5 , 43.2 , 49.0 , 52.8 , 59.4 , 63.5 ] x1, x2 = speramanr(x1, x2) print(x1, x2)
运行结果:
SpearmanrResult(correlation=0.9999999999999999, pvalue=6.646897422032013e-64)
主成分分析(PCA)API
sklearn.decomposition.PCA(n_components=None)
将数据分解为较低维数空间
n_components:
小数:表示保留百分之多少的信息
整数:减少到多少特征
PCA.fit_transform(X) X:numpy array格式的数据[n_samples,n_features]
返回值:转换后指定维度的array
代码demo:
from sklearn.decomposition import PCAdata = [[2 , 8 , 4 , 5 ], [6 , 3 , 0 , 8 ], [5 , 4 , 9 , 1 ]] transfer=PCA(n_components=2 ) data=transfer.fit_transform(data) print(data) print(data.shape) print("---------" ) transfer=PCA(n_components=0.99 ) data=transfer.fit_transform(data) print(data) print(data.shape)
运行结果:
[[-3.13587302e-16 3.82970843e+00] [-5.74456265e+00 -1.91485422e+00] [ 5.74456265e+00 -1.91485422e+00]] (3, 2) --------- [[ 1.80389890e-15 3.82970843e+00] [ 5.74456265e+00 -1.91485422e+00] [-5.74456265e+00 -1.91485422e+00]] (3, 2)
案例:探究用户对物品类别的喜好细分降维
数据如下:
order_products__prior.csv:订单与商品信息
字段:order_id , product_id , add_to_cart_order, reordered
products.csv:商品信息
字段:product_id , product_name, aisle_id , department_id
orders.csv:用户的订单信息
字段:order_id ,user_id ,eval_set,order_number,….
aisles.csv:商品所属具体物品类别
import pandas as pdfrom sklearn.decomposition import PCAfrom sklearn.cluster import KMeansfrom sklearn.metrics import silhouette_scoreorder_products=pd.read_csv("./data/order_products__prior.csv" ) products=pd.read_csv('./data/products.csv' ) orders=pd.read_csv('./data/orders.csv' ) aisles=pd.read_csv('./data/aisles.csv' ) table1=pd.merge(order_products,products,on=["product_id" ,"product_id" ]) table2=pd.merge(table1,orders,on=["order_id" ,"order_id" ]) table3=pd.merge(table2,aisles,on=["aisle_id" ,"aisle_id" ]) table = pd.crosstab(table3["user_id" ], table3["aisle" ]) new_data=table[:1000 ] transfer=PCA(n_components=0.9 ) trans_data=transfer.fit_transform(new_data) estimator=KMeans(n_clusters=5 ) y_pre=estimator.fit_predict(trans_data) sl_score=silhouette_score(trans_data,y_pre) print(sl_score)
运行结果
sl的最佳值为1,最差值为-1