前言

本文旨在为前面学过的api做一个汇总,用于复习

1.K-近邻算法API

sklearn.neighbors.KNeighborsClassifier(n_neighbors=5)

代码demo

from sklearn.neighbors import KNeighborsClassifier

x=[[1],[2],[0],[0]]
y=[1,1,0,0]
estimator=KNeighborsClassifier(n_neighbors=2)
estimator.fit(x,y)
ret=estimator.predict([[1]])
print(ret)

运行结果:

[1]

特征预处理API

  • from sklearn.preprocessing import MinMaxScaler(归一化)

  • from sklearn.preprocessing import StandardScaler(标准化)

因为归一化的数据处理机制,如果数据集中异常点过多,处理结果的鲁棒性较差,并不是很适合现代的大型数据集

所以我们一般都是使用标准化处理数据

代码demo:

from sklearn.preprocessing import MinMaxScaler, StandardScaler
import pandas as pd
#读取数据
data = pd.read_csv('./data/dating.txt')
#创建MinMaxScaler转换器
transfer = MinMaxScaler(feature_range=(0, 1))

MinMax_data = transfer.fit_transform(data.iloc[:, :3])
print("归一化处理后: \n", MinMax_data)
#创建StandardScaler转换器
transfer1 = StandardScaler()
stander_data = transfer1.fit_transform(data.iloc[:, :3])
print("标准化处理后: \n", stander_data)

处理前部分原数据

milage,Liters,Consumtime,target
40920,8.326976,0.953952,3
14488,7.153469,1.673904,2
26052,1.441871,0.805124,1
75136,13.147394,0.428964,1
38344,1.669788,0.134296,1
72993,10.141740,1.032955,1
35948,6.830792,1.213192,3

处理后的结果:

归一化处理后: 
[[0.44832535 0.39805139 0.56233353]
[0.15873259 0.34195467 0.98724416]
[0.28542943 0.06892523 0.47449629]
...
[0.29115949 0.50910294 0.51079493]
[0.52711097 0.43665451 0.4290048 ]
[0.47940793 0.3768091 0.78571804]]
标准化处理后:
[[ 0.33193158 0.41660188 0.24523407]
[-0.87247784 0.13992897 1.69385734]
[-0.34554872 -1.20667094 -0.05422437]
...
[-0.32171752 0.96431572 0.06952649]
[ 0.65959911 0.60699509 -0.20931587]
[ 0.46120328 0.31183342 1.00680598]]

数据集分割API

from sklearn.model_selection import train_test_split

参数:

train_size:训练集大小

float:0-1之间,表示训练集所占的比例

int:直接指定训练集的数量

None:自动为测试集的补集,也就是原始数据集减去测试集

test_size:测试集大小,默认值是0.25

float:0-1之间,表示测试集所占的比例

int:直接指定测试集的数量

None:自动为训练集的补集,也就是原始数据集减去训练集

**random_state:**可以理解为随机数种子,主要是为了复现结果而设置

**shuffle:**表示是否打乱数据位置,True或者False,默认是True

**stratify:**表示是否按照样本比例(不同类别的比例)来划分数据集,例如原始数据集 类A:类B = 75%:25%,那么划分的测试集和训练集中的A:B的比例都会是75%:25%;可用于样本类别差异很大的情况,一般使用为:stratify=y,即用数据集的标签y来进行划分。

我们使用这个api去分割我们使用的数据集

将其分割成训练集和测试集

在鸢尾花案例中,我们传入鸢尾花数据集中的特征值和目标值

然后返回训练集特征值,测试集特征值,训练集目标值,测试集目标值用于模型训练

代码demo:

(数据集为鸢尾花)

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# 1.获取数据集
iris = load_iris()

# 2.数据基本处理
# x_train,x_test,y_train,y_test为训练集特征值、测试集特征值、训练集目标值、测试集目标值
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=22)

print("训练集的特征值是: \n",x_train)
print("测试集的特征值是: \n",x_test)
print("训练集的目标值是: \n",y_train)
print("测试集的目标值是: \n",y_test)

运行结果:

训练集的特征值是: 
[[6.2 2.8 4.8 1.8]
[5.1 3.3 1.7 0.5]
[5.6 2.9 3.6 1.3]
[7.7 3.8 6.7 2.2]
[5.4 3. 4.5 1.5]
[5.8 4. 1.2 0.2]
[6.4 2.8 5.6 2.2]
[6.1 3. 4.6 1.4]
[5.5 2.3 4. 1.3]
[6.9 3.1 5.1 2.3]
[6. 2.9 4.5 1.5]
[6.2 2.9 4.3 1.3]
[6.8 3.2 5.9 2.3]
[5. 2.3 3.3 1. ]
[4.8 3.4 1.6 0.2]
[6.1 2.6 5.6 1.4]
[5.2 3.4 1.4 0.2]
[6.7 3.1 4.4 1.4]
[5.1 3.5 1.4 0.2]
[5.2 3.5 1.5 0.2]
[5.5 3.5 1.3 0.2]
[4.9 2.5 4.5 1.7]
[6.2 3.4 5.4 2.3]
[7.9 3.8 6.4 2. ]
[5.4 3.4 1.7 0.2]
[6.7 3.1 5.6 2.4]
[6.3 3.4 5.6 2.4]
[7.6 3. 6.6 2.1]
[6. 2.2 5. 1.5]
[4.3 3. 1.1 0.1]
[4.8 3.1 1.6 0.2]
[5.8 2.7 5.1 1.9]
[5.7 2.8 4.1 1.3]
[5.2 2.7 3.9 1.4]
[7.7 3. 6.1 2.3]
[6.3 2.7 4.9 1.8]
[6.1 2.8 4. 1.3]
[5.1 3.7 1.5 0.4]
[5.7 2.8 4.5 1.3]
[5.4 3.9 1.3 0.4]
[5.8 2.8 5.1 2.4]
[5.8 2.6 4. 1.2]
[5.1 2.5 3. 1.1]
[5.7 3.8 1.7 0.3]
[5.5 2.4 3.7 1. ]
[5.9 3. 4.2 1.5]
[6.7 3.1 4.7 1.5]
[7.7 2.8 6.7 2. ]
[4.9 3. 1.4 0.2]
[6.3 3.3 4.7 1.6]
[5.1 3.8 1.5 0.3]
[5.8 2.7 3.9 1.2]
[6.9 3.2 5.7 2.3]
[4.9 3.1 1.5 0.1]
[5. 2. 3.5 1. ]
[4.9 3.1 1.5 0.2]
[5. 3.5 1.3 0.3]
[5.4 3.7 1.5 0.2]
[6.8 3. 5.5 2.1]
[6.3 3.3 6. 2.5]
[5. 3.4 1.6 0.4]
[5.2 4.1 1.5 0.1]
[6.3 2.5 5. 1.9]
[7.7 2.6 6.9 2.3]
[6. 2.2 4. 1. ]
[7.2 3.6 6.1 2.5]
[4.9 2.4 3.3 1. ]
[6.1 2.8 4.7 1.2]
[6.5 3. 5.2 2. ]
[5.1 3.5 1.4 0.3]
[7.4 2.8 6.1 1.9]
[5.9 3. 5.1 1.8]
[6.4 2.7 5.3 1.9]
[4.4 2.9 1.4 0.2]
[5.6 2.8 4.9 2. ]
[5.1 3.4 1.5 0.2]
[5. 3.3 1.4 0.2]
[5.7 2.6 3.5 1. ]
[6.9 3.1 5.4 2.1]
[5.5 2.6 4.4 1.2]
[6.3 2.8 5.1 1.5]
[7. 3.2 4.7 1.4]
[6.8 2.8 4.8 1.4]
[6.5 3.2 5.1 2. ]
[6.9 3.1 4.9 1.5]
[5.5 2.4 3.8 1.1]
[5.6 3. 4.5 1.5]
[6. 3. 4.8 1.8]
[6. 2.7 5.1 1.6]
[5.8 2.7 5.1 1.9]
[5.9 3.2 4.8 1.8]
[5.1 3.8 1.6 0.2]
[6.2 2.2 4.5 1.5]
[5.6 3. 4.1 1.3]
[5.6 2.5 3.9 1.1]
[5.8 2.7 4.1 1. ]
[6.4 3.1 5.5 1.8]
[6.6 2.9 4.6 1.3]
[5.5 4.2 1.4 0.2]
[4.4 3. 1.3 0.2]
[6.3 2.9 5.6 1.8]
[6.4 3.2 4.5 1.5]
[7.3 2.9 6.3 1.8]
[5. 3.6 1.4 0.2]
[7.1 3. 5.9 2.1]
[4.9 3.6 1.4 0.1]
[6.5 3. 5.5 1.8]
[6.7 3.3 5.7 2.1]
[5.4 3.4 1.5 0.4]
[6.1 2.9 4.7 1.4]
[4.6 3.2 1.4 0.2]
[6.7 3. 5.2 2.3]
[5.7 3. 4.2 1.2]
[5. 3.4 1.5 0.2]
[6.5 3. 5.8 2.2]
[6.6 3. 4.4 1.4]
[5. 3.5 1.6 0.6]
[4.6 3.6 1. 0.2]
[6.3 2.5 4.9 1.5]
[5.7 4.4 1.5 0.4]]
测试集的特征值是:
[[4.6 3.4 1.4 0.3]
[4.6 3.1 1.5 0.2]
[5.7 2.5 5. 2. ]
[4.8 3. 1.4 0.1]
[4.8 3.4 1.9 0.2]
[7.2 3. 5.8 1.6]
[5. 3. 1.6 0.2]
[6.7 2.5 5.8 1.8]
[6.4 2.8 5.6 2.1]
[4.8 3. 1.4 0.3]
[5.3 3.7 1.5 0.2]
[4.4 3.2 1.3 0.2]
[5. 3.2 1.2 0.2]
[5.4 3.9 1.7 0.4]
[6. 3.4 4.5 1.6]
[6.5 2.8 4.6 1.5]
[4.5 2.3 1.3 0.3]
[5.7 2.9 4.2 1.3]
[6.7 3.3 5.7 2.5]
[5.5 2.5 4. 1.3]
[6.7 3. 5. 1.7]
[6.4 2.9 4.3 1.3]
[6.4 3.2 5.3 2.3]
[5.6 2.7 4.2 1.3]
[6.3 2.3 4.4 1.3]
[4.7 3.2 1.6 0.2]
[4.7 3.2 1.3 0.2]
[6.1 3. 4.9 1.8]
[5.1 3.8 1.9 0.4]
[7.2 3.2 6. 1.8]]
训练集的目标值是:
[2 0 1 2 1 0 2 1 1 2 1 1 2 1 0 2 0 1 0 0 0 2 2 2 0 2 2 2 2 0 0 2 1 1 2 2 1
0 1 0 2 1 1 0 1 1 1 2 0 1 0 1 2 0 1 0 0 0 2 2 0 0 2 2 1 2 1 1 2 0 2 2 2 0
2 0 0 1 2 1 2 1 1 2 1 1 1 2 1 2 1 0 1 1 1 1 2 1 0 0 2 1 2 0 2 0 2 2 0 1 0
2 1 0 2 1 0 0 1 0]
测试集的目标值是:
[0 0 2 0 0 2 0 2 2 0 0 0 0 0 1 1 0 1 2 1 1 1 2 1 1 0 0 2 0 2]

交叉验证,网格搜索API

sklearn.model_selection.GridSearchCV(estimator, param_grid=None,cv=None)
  • 对估计器的指定参数值进行详尽搜索
  • estimator:估计器对象
  • param_grid:估计器参数(dict){“n_neighbors”:[1,3,5]}
  • cv:指定几折交叉验证
  • fit:输入训练数据
  • score:准确率
  • 结果分析:
    • bestscore__:在交叉验证中验证的最好结果
    • bestestimator:最好的参数模型
    • cvresults:每次交叉验证后的验证集准确率结果和训练集准确率结果

这个API主要是选择模型最优参数时使用,我们可以通过这个api去筛选knn算法中k值最优的模型结果

这个在我们鸢尾花案例中有较明显的体现

from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

iris = load_iris()

x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, train_size=0.2)

transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.fit_transform(x_test)

estimator = KNeighborsClassifier()

param_dict = {"n_neighbors": [1, 3, 5, 7, 9]}
estimator = GridSearchCV(estimator, param_grid=param_dict, cv=5)

estimator.fit(x_train, y_train)

y_pre = estimator.predict(x_test)
print("对比预测结果和真实值: \n", y_pre == y_test)

score = estimator.score(x_test, y_test)
print("准确率 \n", score)

print("在交叉验证中验证的最好结果:\n", estimator.best_score_)
print("最好的参数模型:\n", estimator.best_estimator_)
print("每次交叉验证后的准确率结果:\n", estimator.cv_results_)

运行结果:

对比预测结果和真实值: 
[ True False True True True False False True True True True True
True False True True True True False True True True True True
True True True True False True True False False True True True
True True False True True False False True False True True True
False False True False True True True True True True True True
False True True True False False True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True False
True True True True False True True True True True True True
True True False True True True True True False True True True]
准确率
0.8166666666666667
在交叉验证中验证的最好结果:
0.9666666666666668
最好的参数模型:
KNeighborsClassifier(n_neighbors=1)
每次交叉验证后的准确率结果:
{'mean_fit_time': array([0.00039964, 0.00019946, 0.00039964, 0.00039892, 0.00039849]), 'std_fit_time': array([0.00048945, 0.00039892, 0.00048946, 0.00048858, 0.00048805]), 'mean_score_time': array([0.00119624, 0.00099826, 0.00099659, 0.00099702, 0.00079846]), 'std_score_time': array([4.00023964e-04, 2.03425684e-06, 6.29846620e-04, 6.30751181e-04,
3.99234513e-04]), 'param_n_neighbors': masked_array(data=[1, 3, 5, 7, 9],
mask=[False, False, False, False, False],
fill_value='?',
dtype=object), 'params': [{'n_neighbors': 1}, {'n_neighbors': 3}, {'n_neighbors': 5}, {'n_neighbors': 7}, {'n_neighbors': 9}], 'split0_test_score': array([0.83333333, 0.83333333, 1. , 0.83333333, 0.83333333]), 'split1_test_score': array([1. , 1. , 0.83333333, 1. , 1. ]), 'split2_test_score': array([1., 1., 1., 1., 1.]), 'split3_test_score': array([1., 1., 1., 1., 1.]), 'split4_test_score': array([1., 1., 1., 1., 1.]), 'mean_test_score': array([0.96666667, 0.96666667, 0.96666667, 0.96666667, 0.96666667]), 'std_test_score': array([0.06666667, 0.06666667, 0.06666667, 0.06666667, 0.06666667]), 'rank_test_score': array([1, 1, 1, 1, 1])}

从上面我们可以看成,k最优的值是1,我们在使用GridSearchCV时,将我们的模型和我们自己传入的超参数去进行筛选最优k值

2.线性回归API

sklearn.linear_model.LinearRegression(fit_intercept=True)

  • 通过正规方程优化
  • fit_intercept:是否计算偏置
  • LinearRegression.coef_:回归系数
  • LinearRegression.intercept_:偏置

代码demo:

from sklearn.linear_model import LinearRegression
#创建数据集
x = [[80, 86],
[82, 80],
[85, 78],
[90, 90],
[86, 82],
[82, 90],
[78, 80],
[92, 94]]
y = [84.2, 80.6, 80.1, 90, 83.2, 87.6, 79.4, 93.4]

estimator = LinearRegression()
estimator.fit(x, y)
print("系数是 \n", estimator.coef_)

y_pre = estimator.predict([[100, 60]])
print("预测的最终成绩 \n", y_pre)

运行结果:

系数是 
[0.3 0.7]
预测的最终成绩
[72.]

线性回归优化API

梯度下降

sklearn.linear_model.SGDRegressor(loss=“squared_loss”, fit_intercept=True, learning_rate =‘invscaling’, eta0=0.01)

  • SGDRegressor类实现了随机梯度下降学习,它支持不同的loss函数和正则化惩罚项来拟合线性回归模型。
  • loss:损失类型
    • loss=”squared_loss”: 普通最小二乘法
  • fit_intercept:是否计算偏置
  • learning_rate : string, optional
    • 学习率填充
    • ‘constant’: eta = eta0
    • ‘optimal’: eta = 1.0 / (alpha * (t + t0)) [default]
    • ‘invscaling’: eta = eta0 / pow(t, power_t)
      • power_t=0.25:存在父类当中
    • 对于一个常数值的学习率来说,可以使用learning_rate=’constant’ ,并使用eta0来指定学习率。
  • SGDRegressor.coef_:回归系数
  • SGDRegressor.intercept_:偏置

回归性能评估API

sklearn.metrics.mean_squared_error(y_true, y_pred)

  • 均方误差回归损失
  • y_true:真实值
  • y_pred:预测值
  • return:浮点数结果

代码demo:

(数据集为波士顿房价)

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, SGDRegressor, Ridge,RidgeCV
from sklearn.metrics import mean_squared_error


def line_model1():
# 加载数据
boston = load_boston()
# 分割数据
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.2)
# 标准化处理
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.fit_transform(x_test)
# 线性回归模型创建
estimator = LinearRegression()
# 训练模型
estimator.fit(x_train, y_train)
# 预测值
y_Pre = estimator.predict(x_test)

# 模型评估
print("预测值\n", y_Pre)
print("系数\n", estimator.coef_)
print("准确率 \n", estimator.score(x_test, y_test))
print("偏置\n", estimator.intercept_)
# 回归误差评估
error = mean_squared_error(y_test, y_Pre)
print("回归误差\n", error)


def line_model2():
# 加载数据
boston = load_boston()
# 分割数据
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.2)
# 标准化处理
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.fit_transform(x_test)
# 线性回归模型创建
estimator = SGDRegressor(max_iter=1000, learning_rate='constant', eta0=0.001,penalty="l2")
# 训练模型
estimator.fit(x_train, y_train)
# 预测值
y_Pre = estimator.predict(x_test)

# 模型评估
print("预测值\n", y_Pre)
print("系数\n", estimator.coef_)
print("准确率 \n", estimator.score(x_test, y_test))
print("偏置\n", estimator.intercept_)
# 回归误差评估
error = mean_squared_error(y_test, y_Pre)
print("回归误差\n", error)



if __name__ == '__main__':
line_model1()
print("----------")
line_model2()

运行结果

预测值
[15.69281708 15.60607551 22.31177172 9.79777519 32.00221586 40.10161418
19.94011241 20.74246294 21.48518439 30.48520552 20.39978969 25.04166372
28.08300791 11.78332454 20.63561262 29.89531582 20.37314771 30.31422588
21.64751123 28.47355414 33.53589598 37.66749504 17.68295323 33.36918539
28.63002574 21.03322709 15.0384725 16.79145899 33.62807826 27.14229387
13.06248108 19.40785796 22.29885164 16.65282174 12.24378835 16.28891882
18.19222902 22.13357388 22.07569494 15.0815945 18.23662886 24.14784132
18.00690586 21.07802009 34.33780194 21.47902247 6.68112277 18.56996792
21.57538055 17.60363462 22.59480213 25.56799234 29.20690236 17.90162353
26.98975494 30.23606361 4.4897138 25.65537934 24.18648981 24.32316046
18.77870203 17.51216944 37.49041986 24.15746128 33.74693009 18.25229237
20.85217311 20.90589937 13.55357772 28.67324426 5.51189516 9.01921008
35.34329167 26.11243276 16.48342158 22.25250803 12.64776473 21.64854402
4.05782391 24.8512574 39.51878212 32.03050057 20.29504628 27.60485709
23.81182922 20.87088906 19.57752992 11.64715411 15.70474816 27.95454556
16.18374635 5.15133836 31.13429957 20.23116795 26.81146854 18.61413205
36.22408796 25.53406166 19.32957632 18.94178824 17.00882559 31.5355671 ]
系数
[-0.94636854 0.95952433 0.51606731 0.57193601 -1.87046793 3.08337865
-0.41319793 -2.62571876 2.28143837 -2.24935004 -2.23076044 0.93303594
-2.84608188]
准确率
0.5769869713050476
偏置
22.129455445544572
回归误差
39.58927420993645
----------
预测值
[13.49186027 25.504558 26.85136413 20.41815308 21.13959552 20.26916695
20.25187422 20.71100327 22.10361591 20.86208864 27.69058263 16.84565454
23.1962553 27.72225096 37.910782 20.23060971 17.02781194 18.87322649
17.69020341 14.66895098 28.67676648 14.3772249 21.56619808 20.01347812
19.31445623 17.8437261 17.38324619 22.04745202 21.97111618 15.00772809
18.94599197 25.37435555 28.1082041 31.12965967 37.57303017 44.80191689
16.53556745 24.09384002 36.8709077 19.17673628 16.52712413 14.23977448
5.59462638 17.70090591 26.51458495 19.39278929 19.0322606 30.58290131
19.1019216 9.34116774 12.9045293 29.29311584 21.42543614 21.52096337
25.31770195 41.0620208 33.2764697 23.28640673 15.48265946 16.23251281
36.01545631 17.96328453 32.16047679 30.14808262 31.79525668 27.9722376
10.79118391 23.51939577 41.36217739 14.56201891 16.00070189 18.9889247
20.23388718 31.48427425 3.5585817 17.59610333 33.70130271 17.85070691
24.68518801 27.89839823 30.24982773 33.68003882 16.41916282 25.13915645
21.99819736 14.42301084 24.66098363 28.62481404 22.58388204 19.38915251
23.82787937 14.28316523 29.1965419 28.60842088 40.57363673 19.52499872
27.40106199 10.58893198 13.68963952 21.32472939 27.51893247 20.21780463]
系数
[-0.6471845 0.60758869 -0.02538613 0.7397836 -1.53741589 3.05876641
-0.15258714 -2.68946359 1.81747311 -1.02835943 -1.86174619 0.86895181
-3.83261592]
准确率
0.6986164807848243
偏置
[22.78737932]
回归误差
26.76961909594243

线性回归的改进-岭回归

sklearn.linear_model.Ridge(alpha=1.0, fit_intercept=True,solver=“auto”, normalize=False)

  • 具有l2正则化的线性回归
  • alpha:正则化力度,也叫 λ
    • λ取值:0~1 1~10
  • solver:会根据数据自动选择优化方法
    • sag:如果数据集、特征都比较大,选择该随机梯度下降优化
  • normalize:数据是否进行标准化
    • normalize=False:可以在fit之前调用preprocessing.StandardScaler标准化数据
  • Ridge.coef_:回归权重
  • Ridge.intercept_:回归偏置

Ridge方法相当于SGDRegressor(penalty=‘l2’, loss=“squared_loss”),只不过SGDRegressor实现了一个普通的随机梯度下降学习,推荐使用Ridge(实现了SAG)

  • sklearn.linear_model.RidgeCV(_BaseRidgeCV, RegressorMixin)
    • 具有l2正则化的线性回归,可以进行交叉验证
    • coef_:回归系数

所以当我们知道有Ridge之后之前学到的SGDRegressor也就不需要使用了

而RidgeCV是Ridge的超参数选择API

代码demo:

(数据集为波士顿房价)

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston
from sklearn.linear_model import Ridge

#加载数据集
data=load_boston()
#数据分割
x_train,x_test,y_train,y_test=train_test_split(data.data,data.target,random_state=22,test_size=0.2)
#模型训练
estimator=Ridge(alpha=0,normalize=True)
estimator.fit(x_train,y_train)
#打印预测值
y_pre=estimator.predict(x_test)
score=estimator.score(x_test,y_test)
#模型评估
print("准确率 \n",score)
print("系数 \n",estimator.coef_)
print("偏置 \n",estimator.intercept_)
error = mean_squared_error(y_test, y_pre)
print("误差为:\n", error)

运行结果:

准确率 
0.7657465943591123
系数
[-1.01199845e-01 4.67962110e-02 -2.06902678e-02 3.58072311e+00
-1.71288922e+01 3.92207267e+00 -5.67997339e-03 -1.54862273e+00
2.97156958e-01 -1.00709587e-02 -7.78761318e-01 9.87125185e-03
-5.25319199e-01]
偏置
32.42825286699124
误差为:
20.770684784270024

交叉验证的Ridge

代码demo:

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston
from sklearn.linear_model import RidgeCV

#加载数据集
data=load_boston()
#切分数据集
x_train,x_test,y_train,y_test=train_test_split(data.data,data.target,random_state=22,test_size=0.2)

#模型训练
estimator=RidgeCV(alphas=(1,0.1, 10,0.55,9),normalize=True)
estimator.fit(x_train,y_train)
#打印预测值
y_pre=estimator.predict(x_test)
score=estimator.score(x_test,y_test)
#模型评估
print("准确率 \n",score)
print("系数 \n",estimator.coef_)
print("偏置 \n",estimator.intercept_)
error = mean_squared_error(y_test, y_pre)
print("误差为:\n", error)

运行结果:

准确率 
0.755640759192834
系数
[-7.10636292e-02 2.81853030e-02 -7.07760829e-02 3.72392633e+00
-1.05710401e+01 4.09961387e+00 -9.32546827e-03 -1.06310611e+00
1.37548657e-01 -3.58696599e-03 -6.85595220e-01 9.37914747e-03
-4.66378501e-01]
偏置
23.328276195586888
误差为:
21.666744827223443

模型加载和保存API

from sklearn.externals import joblib

  • 保存:joblib.dump(estimator, ‘test.pkl’)
  • 加载:estimator = joblib.load(‘test.pkl’)

代码举例:

保存模型

(数据集为波士顿房价)

from skimage.metrics import mean_squared_error
from sklearn.datasets import load_boston
from sklearn.linear_model import RidgeCV, Ridge
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import joblib


def dump_load_demo():
# 加载数据
boston = load_boston()
# 数据分割,保证数据分割的一致性
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, train_size=0.2,random_state=22)
# 数据标准化
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.fit_transform(x_test)
# 模型训练
estimator = Ridge()
estimator.fit(x_train, y_train)

# 模型保存
joblib.dump(estimator, './data/test.pkl')

# 得出预测值
y_pre = estimator.predict(x_test)
# 模型评估
print("预测值 \n", y_pre)
score = estimator.score(x_test, y_test)
print("准确率\n", score)
print("偏置\n", estimator.intercept_)
print("系数\n", estimator.coef_)
ret = mean_squared_error(y_test, y_pre)
print("均方误差\n", ret)


if __name__ == '__main__':
dump_load_demo()

注:模型保存格式为pkl

运行结果:

预测值 
[23.64469576 27.97409384 17.55168799 28.80358367 17.99377237 17.16911389
18.08838244 18.5484773 18.39159086 30.11882442 19.02379724 26.14281853
12.91653465 16.42264835 33.05693686 16.30877287 6.37751888 16.82204542
28.50466918 22.99403455 16.57065921 28.10613493 27.20987256 15.88197605
29.39165138 23.83772485 28.28820203 26.13717053 15.34475384 17.2592326
24.96024219 11.92509187 31.5877362 4.22575655 13.85279407 15.45154778
4.20339965 19.19572767 36.43206568 27.38697842 23.16484782 15.50800593
32.61396957 4.76928343 20.61280174 22.98552067 18.00938256 17.47797758
15.65941555 21.53036514 8.07350341 23.7631832 27.89948024 12.60928859
6.00939238 29.66164389 25.73749604 22.21461849 15.86568871 21.70506535
22.32167003 21.54997175 18.75814107 34.35036868 23.93330189 16.4650795
10.29497693 4.81134268 37.24939583 21.04504805 14.40856753 21.70178409
33.82983682 21.38429774 30.8300285 23.97655944 21.06166321 17.23171262
24.16127528 22.93301994 27.46787702 19.09761525 21.58895637 26.70861218
23.91545352 18.99584879 25.89840907 21.06068455 22.76976898 16.54404802
22.34618208 20.5520392 17.00269605 11.11408577 10.80942164 16.88787311
21.36466922 14.08334209 18.34541038 22.49992866 16.6009481 14.76856955
22.98225728 20.82099466 18.09167202 31.25449524 14.7748573 21.37510721
29.48724174 30.27555742 19.35717062 23.41591865 18.92675356 14.08811569
19.50876796 20.75844963 24.66902078 23.28548242 21.37184963 12.61360164
11.84374119 1.97389286 26.75034249 18.36677866 20.22284006 25.65136456
26.76853678 21.01965862 24.76838538 19.2719079 0.20758938 33.66047857
33.62514941 24.10607586 25.27714009 19.36420085 32.79293209 12.7253308
16.4697096 10.07572615 34.63725968 17.51742591 30.81214014 22.7464453
7.47271236 22.0566537 26.5021439 29.02030002 14.90958575 6.34319256
21.72255731 15.312686 26.12292152 28.03322847 23.81482954 14.76050315
26.25298872 33.36534552 5.38378434 19.48867011 17.22952014 26.73584653
13.75318646 32.2262328 30.80102394 30.85734971 19.13265617 18.97112331
23.78360179 18.36212696 11.02997263 20.49077695 20.0010755 33.14860384
23.48461798 25.93285901 33.6286644 14.55281723 5.71533379 23.80222741
12.74528738 29.87644788 17.36057156 17.68386619 24.48905778 15.40371636
17.1681389 14.56256338 22.20362029 24.07938957 16.13989928 20.30403757
13.11742038 19.76011695 10.15959589 5.05451631 24.98773317 22.05933162
17.38635448 17.95889724 11.85983575 23.09467263 10.5421983 12.56199688
20.46920769 22.89551792 11.35987849 28.36158041 22.46150768 22.07052047
18.96479228 15.94762742 29.55993025 18.18280727 26.60883362 28.91338874
29.0243314 27.11588917 23.41267873 22.16089044 15.87275538 19.55923911
10.01955695 25.36241049 37.09399692 9.21076403 28.7250581 23.1210652
16.10488274 20.98532986 11.92937281 16.63409543 13.39946236 18.10252834
30.20768813 27.01016297 23.78388552 15.90408559 25.88867096 23.46240636
22.98550277 17.90243108 30.69975106 20.54447287 17.24878101 22.6405577
18.78662547 26.43378753 6.1415722 13.46113754 11.01274483 27.46925322
31.75050402 11.44302638 9.15277255 30.32661424 22.23993372 3.59762111
21.18431804 22.32873001 24.91183653 31.66081008 20.11883814 23.58860016
23.39144498 20.95717749 28.62287194 30.25468985 18.60827558 17.29159623
20.8459336 29.53135771 29.1735858 25.86670544 20.47761831 22.00175334
25.46210862 20.76591231 18.48931061 23.21360354 28.04039153 14.76240119
5.56366719 19.02971742 23.60960253 25.61011124 17.93010868 18.35299157
14.99314051 24.55852344 15.73398244 17.9936638 15.78653015 26.86927323
21.83288484 17.09786581 24.1455792 23.21781793 4.89715057 14.55190726
29.53097939 16.96942402 29.97092011 20.56134909 23.61695306 13.90547522
28.94524778 13.9490235 25.14367973 17.81456951 26.92882448 21.97947553
16.37959719 25.58285606 18.4004159 16.81711119 21.41056759 20.52708367
10.18563418 24.72279293 7.65824257 -5.75882408 28.92725713 15.23172694
-0.80822785 1.1885204 16.05592378 31.02687874 22.11833608 25.257107
21.11962573 27.17319156 14.23418317 35.29815094 16.76025589 23.88193103
11.49969608 11.005055 11.00317386 18.27016267 24.9635937 14.54733798
22.23443933 21.72653137 18.11347747 22.50164981 30.66773175 17.84627447
19.36076926 27.85693006 33.87880818 25.66176459 19.14387897 22.66753296
14.84930104 27.39448909 33.97151416 18.97839843 16.93696433 23.93610382
27.80296883 10.97564404 33.33567959 18.90876462 21.50046582 28.38223075
15.20285187 16.59362028 19.48364863 27.52004471 20.99718401 24.00002741
27.2015801 22.39667384 23.24447674 23.08634542 11.51804889 14.09160002
30.71290202 30.53124258 1.59429213 19.2223747 23.92469427 30.52913878
13.84632587 17.93078843 28.58855707 16.2168317 8.57194758 19.94632304
21.58821976 20.44036427 27.38200144 24.42298241 6.31193462 26.26904066
28.46064924 21.90708324 11.52479134 15.93255738 25.56393581 23.51065591
31.34782385 7.78903776 21.15999469]
准确率
0.6384683499205781
偏置
20.605940594059394
系数
[-0.91782309 -0.17382552 0.26572243 0.6906218 -2.23367615 1.81562646
0.39996906 -1.14481067 2.18240602 -0.79484838 -2.12801148 0.82343198
-4.1643924 ]
均方误差
32.38410204009493

加载模型

(数据集为波士顿房价)

from skimage.metrics import mean_squared_error
from sklearn.datasets import load_boston
from sklearn.linear_model import RidgeCV, Ridge
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import joblib


def dump_load_demo():
# 加载数据
boston = load_boston()
# 数据分割,保证数据分割的一致性
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, train_size=0.2,random_state=22)
# 数据标准化
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.fit_transform(x_test)
# 模型训练
estimator = Ridge()
estimator.fit(x_train, y_train)

# 模型加载
estimator=joblib.load('./data/test.pkl')
# 得出预测值
y_pre = estimator.predict(x_test)
# 模型评估
print("预测值 \n", y_pre)
score = estimator.score(x_test, y_test)
print("准确率\n", score)
print("偏置\n", estimator.intercept_)
print("系数\n", estimator.coef_)
ret = mean_squared_error(y_test, y_pre)
print("均方误差\n", ret)


if __name__ == '__main__':
dump_load_demo()

运行结果:

预测值 
[23.64469576 27.97409384 17.55168799 28.80358367 17.99377237 17.16911389
18.08838244 18.5484773 18.39159086 30.11882442 19.02379724 26.14281853
12.91653465 16.42264835 33.05693686 16.30877287 6.37751888 16.82204542
28.50466918 22.99403455 16.57065921 28.10613493 27.20987256 15.88197605
29.39165138 23.83772485 28.28820203 26.13717053 15.34475384 17.2592326
24.96024219 11.92509187 31.5877362 4.22575655 13.85279407 15.45154778
4.20339965 19.19572767 36.43206568 27.38697842 23.16484782 15.50800593
32.61396957 4.76928343 20.61280174 22.98552067 18.00938256 17.47797758
15.65941555 21.53036514 8.07350341 23.7631832 27.89948024 12.60928859
6.00939238 29.66164389 25.73749604 22.21461849 15.86568871 21.70506535
22.32167003 21.54997175 18.75814107 34.35036868 23.93330189 16.4650795
10.29497693 4.81134268 37.24939583 21.04504805 14.40856753 21.70178409
33.82983682 21.38429774 30.8300285 23.97655944 21.06166321 17.23171262
24.16127528 22.93301994 27.46787702 19.09761525 21.58895637 26.70861218
23.91545352 18.99584879 25.89840907 21.06068455 22.76976898 16.54404802
22.34618208 20.5520392 17.00269605 11.11408577 10.80942164 16.88787311
21.36466922 14.08334209 18.34541038 22.49992866 16.6009481 14.76856955
22.98225728 20.82099466 18.09167202 31.25449524 14.7748573 21.37510721
29.48724174 30.27555742 19.35717062 23.41591865 18.92675356 14.08811569
19.50876796 20.75844963 24.66902078 23.28548242 21.37184963 12.61360164
11.84374119 1.97389286 26.75034249 18.36677866 20.22284006 25.65136456
26.76853678 21.01965862 24.76838538 19.2719079 0.20758938 33.66047857
33.62514941 24.10607586 25.27714009 19.36420085 32.79293209 12.7253308
16.4697096 10.07572615 34.63725968 17.51742591 30.81214014 22.7464453
7.47271236 22.0566537 26.5021439 29.02030002 14.90958575 6.34319256
21.72255731 15.312686 26.12292152 28.03322847 23.81482954 14.76050315
26.25298872 33.36534552 5.38378434 19.48867011 17.22952014 26.73584653
13.75318646 32.2262328 30.80102394 30.85734971 19.13265617 18.97112331
23.78360179 18.36212696 11.02997263 20.49077695 20.0010755 33.14860384
23.48461798 25.93285901 33.6286644 14.55281723 5.71533379 23.80222741
12.74528738 29.87644788 17.36057156 17.68386619 24.48905778 15.40371636
17.1681389 14.56256338 22.20362029 24.07938957 16.13989928 20.30403757
13.11742038 19.76011695 10.15959589 5.05451631 24.98773317 22.05933162
17.38635448 17.95889724 11.85983575 23.09467263 10.5421983 12.56199688
20.46920769 22.89551792 11.35987849 28.36158041 22.46150768 22.07052047
18.96479228 15.94762742 29.55993025 18.18280727 26.60883362 28.91338874
29.0243314 27.11588917 23.41267873 22.16089044 15.87275538 19.55923911
10.01955695 25.36241049 37.09399692 9.21076403 28.7250581 23.1210652
16.10488274 20.98532986 11.92937281 16.63409543 13.39946236 18.10252834
30.20768813 27.01016297 23.78388552 15.90408559 25.88867096 23.46240636
22.98550277 17.90243108 30.69975106 20.54447287 17.24878101 22.6405577
18.78662547 26.43378753 6.1415722 13.46113754 11.01274483 27.46925322
31.75050402 11.44302638 9.15277255 30.32661424 22.23993372 3.59762111
21.18431804 22.32873001 24.91183653 31.66081008 20.11883814 23.58860016
23.39144498 20.95717749 28.62287194 30.25468985 18.60827558 17.29159623
20.8459336 29.53135771 29.1735858 25.86670544 20.47761831 22.00175334
25.46210862 20.76591231 18.48931061 23.21360354 28.04039153 14.76240119
5.56366719 19.02971742 23.60960253 25.61011124 17.93010868 18.35299157
14.99314051 24.55852344 15.73398244 17.9936638 15.78653015 26.86927323
21.83288484 17.09786581 24.1455792 23.21781793 4.89715057 14.55190726
29.53097939 16.96942402 29.97092011 20.56134909 23.61695306 13.90547522
28.94524778 13.9490235 25.14367973 17.81456951 26.92882448 21.97947553
16.37959719 25.58285606 18.4004159 16.81711119 21.41056759 20.52708367
10.18563418 24.72279293 7.65824257 -5.75882408 28.92725713 15.23172694
-0.80822785 1.1885204 16.05592378 31.02687874 22.11833608 25.257107
21.11962573 27.17319156 14.23418317 35.29815094 16.76025589 23.88193103
11.49969608 11.005055 11.00317386 18.27016267 24.9635937 14.54733798
22.23443933 21.72653137 18.11347747 22.50164981 30.66773175 17.84627447
19.36076926 27.85693006 33.87880818 25.66176459 19.14387897 22.66753296
14.84930104 27.39448909 33.97151416 18.97839843 16.93696433 23.93610382
27.80296883 10.97564404 33.33567959 18.90876462 21.50046582 28.38223075
15.20285187 16.59362028 19.48364863 27.52004471 20.99718401 24.00002741
27.2015801 22.39667384 23.24447674 23.08634542 11.51804889 14.09160002
30.71290202 30.53124258 1.59429213 19.2223747 23.92469427 30.52913878
13.84632587 17.93078843 28.58855707 16.2168317 8.57194758 19.94632304
21.58821976 20.44036427 27.38200144 24.42298241 6.31193462 26.26904066
28.46064924 21.90708324 11.52479134 15.93255738 25.56393581 23.51065591
31.34782385 7.78903776 21.15999469]
准确率
0.6384683499205781
偏置
20.605940594059394
系数
[-0.91782309 -0.17382552 0.26572243 0.6906218 -2.23367615 1.81562646
0.39996906 -1.14481067 2.18240602 -0.79484838 -2.12801148 0.82343198
-4.1643924 ]
均方误差
32.38410204009493

这样代码运行的结果都是一样的

3.逻辑回归API

  • sklearn.linear_model.LogisticRegression(solver=‘liblinear’, penalty=‘l2’, C = 1.0)
    • solver可选参数:{‘liblinear’, ‘sag’, ‘saga’,‘newton-cg’, ‘lbfgs’},
      • 默认: ‘liblinear’;用于优化问题的算法。
      • 对于小数据集来说,“liblinear”是个不错的选择,而“sag”和’saga’对于大型数据集会更快。
      • 对于多类问题,只有’newton-cg’, ‘sag’, 'saga’和’lbfgs’可以处理多项损失;“liblinear”仅限于“one-versus-rest”分类。
    • penalty:正则化的种类
    • C:正则化力度

默认将类别数量少的当做正例

LogisticRegression方法相当于 SGDClassifier(loss=“log”, penalty=" "),SGDClassifier实现了一个普通的随机梯度下降学习。而使用LogisticRegression(实现了SAG)

分类评估API

  • sklearn.metrics.classification_report(y_true, y_pred, labels=[], target_names=None )
    • y_true:真实目标值
    • y_pred:估计器预测目标值
    • labels:指定类别对应的数字
    • target_names:目标类别名称
    • return:每个类别精确率与召回率

auc指标API

from sklearn.metrics import roc_auc_score

  • sklearn.metrics.roc_auc_score(y_true, y_score)
    • 计算ROC曲线面积,即AUC值
    • y_true:每个样本的真实类别,必须为0(反例),1(正例)标记
    • y_score:预测得分,可以是正类的估计概率、置信值或者分类器方法的返回值

AUC的范围在[0.5, 1]之间,并且越接近1越好

  • AUC只能用来评价二分类

  • AUC非常适合评价样本不平衡中的分类器性能

代码demo:

(数据集为癌症分类预测)

数据集来源:https://archive.ics.uci.edu/ml/machine-learning-databases/

数据描述

(1)699条样本,共11列数据,第一列用语检索的id,后9列分别是与肿瘤

相关的医学特征,最后一列表示肿瘤类型的数值。

(2)包含16个缺失值,用”?”标出。

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import pandas as pd
import numpy as np
#读取数据
names = ['Sample code number', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape',
'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin',
'Normal Nucleoli', 'Mitoses', 'Class']
data=pd.read_csv('./data/breast-cancer-wisconsin.data',names=names)
data=data.replace(to_replace="?",value=np.nan)
#存在缺失值数据转换只能使用float64
data['Bare Nuclei']=data["Bare Nuclei"].astype("float64")
#缺失值使用该列的平均值替换
for i in data.columns:
if np.all(pd.notnull(data[i])) == False:
print(i)
data[i].fillna(data[i].mean(),inplace=True)
#缺失值处理完后,可以转换为int64
data['Bare Nuclei']=data["Bare Nuclei"].astype("int64")
#提取特征值
x=data.iloc[:,1:10]
#提取目标值
y=data["Class"]

#切分数据集
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=22)

#构造转换器
transfer=StandardScaler()
x_train=transfer.fit_transform(x_train)
x_test=transfer.fit_transform(x_test)

#构建逻辑回归模型
estimator=LogisticRegression()
#训练模型
estimator.fit(x_train,y_train)
#生成预测值
y_pre=estimator.predict(x_test)
#模型评估
score=estimator.score(x_test,y_test)
#auc评估
roc=roc_auc_score(y_true=y_test,y_score=y_pre)
print("正确率 \n",score)
print(roc)

运行结果:

正确率 
0.9571428571428572
auc
0.9489795918367347

4.决策树API

class sklearn.tree.DecisionTreeClassifier(criterion=’gini’, max_depth=None,random_state=None)

  • criterion
    • 特征选择标准
    • “gini"或者"entropy”,前者代表基尼系数,后者代表信息增益。一默认"gini",即CART算法。
  • min_samples_split
    • 内部节点再划分所需最小样本数
    • 这个值限制了子树继续划分的条件,如果某节点的样本数少于min_samples_split,则不会继续再尝试选择最优特征来进行划分。 默认是2.如果样本量不大,不需要管这个值。如果样本量数量级非常大,则推荐增大这个值。我之前的一个项目例子,有大概10万样本,建立决策树时,我选择了min_samples_split=10。可以作为参考。
  • min_samples_leaf
    • 叶子节点最少样本数
    • 这个值限制了叶子节点最少的样本数,如果某叶子节点数目小于样本数,则会和兄弟节点一起被剪枝。 默认是1,可以输入最少的样本数的整数,或者最少样本数占样本总数的百分比。如果样本量不大,不需要管这个值。如果样本量数量级非常大,则推荐增大这个值。之前的10万样本项目使用min_samples_leaf的值为5,仅供参考。
  • max_depth
    • 决策树最大深度
    • 决策树的最大深度,默认可以不输入,如果不输入的话,决策树在建立子树的时候不会限制子树的深度。一般来说,数据少或者特征少的时候可以不管这个值。如果模型样本量多,特征也多的情况下,推荐限制这个最大深度,具体的取值取决于数据的分布。常用的可以取值10-100之间
  • random_state
    • 随机数种子

特征提取API

字典特征提取

sklearn.feature_extraction.DictVectorizer(sparse=True,…)

  • DictVectorizer.fit_transform(X)
    • X:字典或者包含字典的迭代器返回值
    • 返回sparse矩阵
  • DictVectorizer.get_feature_names() 返回类别名称

文本特征提取

  • sklearn.feature_extraction.text.CountVectorizer(stop_words=[])
    • 返回词频矩阵
    • CountVectorizer.fit_transform(X)
      • X:文本或者包含文本字符串的可迭代对象
      • 返回值:返回sparse矩阵
    • CountVectorizer.get_feature_names() 返回值:单词列表
  • sklearn.feature_extraction.text.TfidfVectorizer
    • 返回词频矩阵
    • TfidfVectorizer.fit_transform(X)
      • X:文本或者包含文本字符串的可迭代对象
      • 返回值:返回sparse矩阵
    • TfidfVectorizer.get_feature_names() 返回值:单词列表

但是对于中文而言,比较特殊,因为字段分割问题,我们可能需要使用第三方的库,先对中文文本进行处理

jieba分词处理

直接使用jieba.cut()对于中文文本进行切分

https://github.com/fxsjy/jieba

可以从这个网站上看到基本用法

代码demo:

(数据集来自于泰坦尼克号)

http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt

import pandas as pd
import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, export_graphviz

data=pd.read_csv('./titanic.txt')
#处理缺失值
data["age"].fillna(data["age"].mean(),inplace=True)
#选择特征值
x=data[["pclass","age","sex"]]
#选择目标值
y=data["survived"]

#分割数据集
x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=22,test_size=0.2)

#特征提取
transfer=DictVectorizer()
#转换数据格式进行提取
x_train=transfer.fit_transform(x_train.to_dict(orient='records'))
x_test=transfer.fit_transform(x_test.to_dict(orient='records'))

#创建模型
estimator=DecisionTreeClassifier(criterion='entropy', max_depth=3)
#训练模型
estimator.fit(x_train,y_train)
#得出预测值
y_pre=estimator.predict(x_test)
#模型评估
score=estimator.score(x_test,y_test)

print("准确率: \n",score)

运行结果:

准确率: 
0.7756653992395437

5.集成学习API

随机森林API(bagging)

  • sklearn.ensemble.RandomForestClassifier(n_estimators=10, criterion=’gini’, max_depth=None, bootstrap=True, random_state=None, min_samples_split=2)
    • n_estimators:integer,optional(default = 10)森林里的树木数量120,200,300,500,800,1200
    • Criterion:string,可选(default =“gini”)分割特征的测量方法
    • max_depth:integer或None,可选(默认=无)树的最大深度 5,8,15,25,30
    • max_features="auto”,每个决策树的最大特征数量
      • If “auto”, then max_features=sqrt(n_features).
      • If “sqrt”, then max_features=sqrt(n_features)(same as “auto”).
      • If “log2”, then max_features=log2(n_features).
      • If None, then max_features=n_features.
    • bootstrap:boolean,optional(default = True)是否在构建树时使用放回抽样
    • min_samples_split:节点划分最少样本数
    • min_samples_leaf:叶子节点的最小样本数
  • 超参数:n_estimator, max_depth, min_samples_split,min_samples_leaf

代码demo:

(数据集来自于泰坦尼克号)

和决策树使用基本一致,无非就是随机森林的超参数多了一些而已,我们也可以使用交叉验证API(GridSearchCV)去选择较优参数

print(gc.best_estimator_)import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
import numpy as np
from sklearn.model_selection import GridSearchCV

data=pd.read_csv('./titanic.txt')

data['age'].fillna(data["age"].mean(),inplace=True)


#选择特征值
x=data[["pclass","age","sex"]]
#选择目标值
y=data["survived"]

x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=22,test_size=0.2)

transfer=DictVectorizer()
#转换数据格式进行提取
x_train=transfer.fit_transform(x_train.to_dict(orient='records'))
x_test=transfer.fit_transform(x_test.to_dict(orient='records'))



rf = RandomForestClassifier()
param = {"n_estimators": [120,200,300,500,800,1200], "max_depth": [5, 8, 15, 25, 30]}

# 超参数调优
gc = GridSearchCV(rf, param_grid=param, cv=2)

gc.fit(x_train, y_train)

print("随机森林预测的准确率为:", gc.score(x_test, y_test))

print("最好模型:",gc.best_estimator_)

运行结果:

随机森林预测的准确率为: 0.7908745247148289

最好模型: RandomForestClassifier(max_depth=5, n_estimators=300)

Adaboost(boosting)

代码demo:

from sklearn.ensemble import AdaBoostClassifier
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import pandas as pd
import numpy as np

#读取数据
data=pd.read_csv("titanic.txt")
#处理缺失值
data["age"].fillna(data["age"].mean(),inplace=True)

#提取特征值
x=data[["age","sex","pclass"]]
#提取目标值
y=data["survived"]
#数据集分割
x_train,x_test,y_train,y_text=train_test_split(x,y,random_state=22,test_size=0.2)
#特征提取
transfer=DictVectorizer()
x_train=transfer.fit_transform(x_train.to_dict(orient="records"))
x_test=transfer.fit_transform(x_test.to_dict(orient="records"))
#创建模型
estimator=AdaBoostClassifier(n_estimators=100, random_state=0)
#训练模型
estimator.fit(x_train,y_train)
#模型评估
score=estimator.score(x_test,y_text)
print("准确率为",score)

行结果:

准确率为 0.7946768060836502

6.聚类算法API

sklearn.cluster.KMeans(n_clusters=8)

  • 参数:
    • n_clusters:开始的聚类中心数量
      • 整型,缺省值=8,生成的聚类数,即产生的质心(centroids)数。
  • 方法:
    • estimator.fit(x)
    • estimator.predict(x)
    • estimator.fit_predict(x)
      • 计算聚类中心并预测每个样本属于哪个类别,相当于先调用fit(x),然后再调用predict(x)

代码demo:

import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.cluster import KMeans
from sklearn.metrics import calinski_harabaz_score
x,y=datasets.make_blobs(n_samples=1000,n_features=2,
centers=[[-1, -1], [0, 0], [1, 1], [2, 2]],
cluster_std=[0.4,0.1,0.1,0.1],
random_state=1)
plt.scatter(x[:,0],x[:,1])
plt.show()

#模型训练加预测
y_pre=KMeans(n_clusters=10,random_state=9).fit_predict(x)

plt.scatter(x[:,0],x[:,1],c=y_pre)
plt.show()

# 用Calinski-Harabasz Index评估的聚类分数
print(calinski_harabaz_score(X, y_pred))

本段demo的运行结果为:

mark

图一为原图

mark

图二为聚类后的图像

还有ch值

3415.4240121338516

ch值越大越好

特征降维API

sklearn.feature_selection.VarianceThreshold(threshold = 0.0)

  • 删除所有低方差特征
  • Variance.fit_transform(X)
    • X:numpy array格式的数据[n_samples,n_features]
    • 返回值:训练集差异低于threshold的特征将被删除。默认值是保留所有非零方差特征,即删除所有样本中具有相同值的特征。

代码demo

import pandas as pd
from sklearn.feature_selection import VarianceThreshold

def Variance():
data = pd.read_csv('./data/factor_returns.csv')
print(data.shape)
print("------------------------------------")
transfer = VarianceThreshold(threshold=100)
data = transfer.fit_transform(data.iloc[:, 1:10])
print(data.shape)

if __name__ == '__main__':
Variance()

原数据部分显示

index,pe_ratio,pb_ratio,market_cap,return_on_asset_net_profit,du_return_on_equity,ev,earnings_per_share,revenue,total_expense,date,return
0,000001.XSHE,5.9572,1.1818,85252550922.0,0.8008,14.9403,1211444855670.0,2.01,20701401000.0,10882540000.0,2012-01-31,0.027657228229937388
1,000002.XSHE,7.0289,1.588,84113358168.0,1.6463,7.8656,300252061695.0,0.326,29308369223.2,23783476901.2,2012-01-31,0.08235182370820669
2,000008.XSHE,-262.7461,7.0003,517045520.0,-0.5678,-0.5943,770517752.56,-0.006,11679829.03,12030080.04,2012-01-31,0.09978900335112327
3,000060.XSHE,16.476,3.7146,19680455995.0,5.6036,14.617,28009159184.6,0.35,9189386877.65,7935542726.05,2012-01-31,0.12159482758620697

运行结果:

(2318, 12)
------------------------------------
(2318, 5)

数据集中一些低方差特征就被剔除了

相关系数API

相关系数反映了变量之间相关关系密切程度

皮尔逊相关系数

from scipy.stats import pearsonr

  • x : (N,) array_like
  • y : (N,) array_like Returns: (Pearson’s correlation coefficient, p-value)

代码demo:

from scipy.stats import pearsonr

x1 = [12.5, 15.3, 23.2, 26.4, 33.5, 34.4, 39.4, 45.2, 55.4, 60.9]
x2 = [21.2, 23.9, 32.9, 34.1, 42.5, 43.2, 49.0, 52.8, 59.4, 63.5]

x1, x2 = pearsonr(x1, x2)
print(x1, x2)

运行结果:

(0.9941983762371883, 4.9220899554573455e-09)

x1越接近1说明相关性越高

斯皮尔曼相关系数
  • from scipy.stats import spearmanr

代码demo:

from scipy.stats import spearmanr

x1 = [12.5, 15.3, 23.2, 26.4, 33.5, 34.4, 39.4, 45.2, 55.4, 60.9]
x2 = [21.2, 23.9, 32.9, 34.1, 42.5, 43.2, 49.0, 52.8, 59.4, 63.5]

x1, x2 = speramanr(x1, x2)
print(x1, x2)

运行结果:

SpearmanrResult(correlation=0.9999999999999999, pvalue=6.646897422032013e-64)

主成分分析(PCA)API

sklearn.decomposition.PCA(n_components=None)

  • 将数据分解为较低维数空间
  • n_components:
    • 小数:表示保留百分之多少的信息
    • 整数:减少到多少特征
  • PCA.fit_transform(X) X:numpy array格式的数据[n_samples,n_features]
  • 返回值:转换后指定维度的array

代码demo:

from sklearn.decomposition import PCA

data = [[2, 8, 4, 5],
[6, 3, 0, 8],
[5, 4, 9, 1]]
#保留多少维度
transfer=PCA(n_components=2)
data=transfer.fit_transform(data)
print(data)
print(data.shape)
print("---------")
#保留信息百分比
transfer=PCA(n_components=0.99)
data=transfer.fit_transform(data)
print(data)
print(data.shape)

运行结果:

[[-3.13587302e-16  3.82970843e+00]
[-5.74456265e+00 -1.91485422e+00]
[ 5.74456265e+00 -1.91485422e+00]]
(3, 2)
---------
[[ 1.80389890e-15 3.82970843e+00]
[ 5.74456265e+00 -1.91485422e+00]
[-5.74456265e+00 -1.91485422e+00]]
(3, 2)

案例:探究用户对物品类别的喜好细分降维

数据如下:

  • order_products__prior.csv:订单与商品信息
    • 字段:order_id, product_id, add_to_cart_order, reordered
  • products.csv:商品信息
    • 字段:product_id, product_name, aisle_id, department_id
  • orders.csv:用户的订单信息
    • 字段:order_id,user_id,eval_set,order_number,….
  • aisles.csv:商品所属具体物品类别
    • 字段: aisle_id, aisle
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

#读取数据
order_products=pd.read_csv("./data/order_products__prior.csv")
products=pd.read_csv('./data/products.csv')
orders=pd.read_csv('./data/orders.csv')
aisles=pd.read_csv('./data/aisles.csv')

table1=pd.merge(order_products,products,on=["product_id","product_id"])
table2=pd.merge(table1,orders,on=["order_id","order_id"])
table3=pd.merge(table2,aisles,on=["aisle_id","aisle_id"])
#这个时候table3的维度为(32434489, 14)

table = pd.crosstab(table3["user_id"], table3["aisle"])
#取部分数据
new_data=table[:1000]
#特征降维
transfer=PCA(n_components=0.9)
trans_data=transfer.fit_transform(new_data)

#创建模型
estimator=KMeans(n_clusters=5)
y_pre=estimator.fit_predict(trans_data)
#模型评估
sl_score=silhouette_score(trans_data,y_pre)
print(sl_score)

运行结果

0.48145604075255666

sl的最佳值为1,最差值为-1