关于机器学习：Python中的逻辑回归和交叉验证（使用sklearn）

Logistic regression and cross-validation in Python (with sklearn)

我试图通过逻辑回归(这不是问题)来解决给定数据集上的分类问题。为了避免过度拟合，我正试图通过交叉验证来实现它(这就是问题所在)：完成程序时我遗漏了一些东西。我的目的是确定准确度。

但让我具体点。这就是我所做的：

我把这套设备分成火车组和测试组

我定义了要使用的对数回归预测模型

我使用交叉值预测方法(在sklearn.cross验证中)进行预测。

最后，我测量了精度

代码如下：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.cross_validation import train_test_split
from sklearn import metrics, cross_validation
from sklearn.linear_model import LogisticRegression

# read training data in pandas dataframe
data = pd.read_csv("./dataset.csv", delimiter=';')
# last column is target, store in array t
t = data['TARGET']
# list of features, including target
features = data.columns
# item feature matrix in X
X = data[features[:-1]].as_matrix()
# remove first column because it is not necessary in the analysis
X = np.delete(X,0,axis=1)
# divide in training and test set
X_train, X_test, t_train, t_test = train_test_split(X, t, test_size=0.2, random_state=0)

# define method
logreg=LogisticRegression()

# cross valitadion prediction
predicted = cross_validation.cross_val_predict(logreg, X_train, t_train, cv=10)
print(metrics.accuracy_score(t_train, predicted))

我的问题：

据我所知，在培训结束之前，不应考虑测试集，并且应对培训集进行交叉验证。这就是为什么我将x_train和t_train插入交叉值预测方法中的原因。图格，我听到一个错误说：
ValueError: Found input variables with inconsistent numbers of samples: [6016, 4812]。
其中6016是整个数据集中的样本数，4812是数据集拆分后训练集中的样本数。
之后，我不知道该怎么办。我的意思是：X测试和T测试什么时候开始起作用？我不知道在交叉验证之后应该如何使用它们，以及如何获得最终的准确性。

附加问题：我还想在交叉验证的每个步骤中执行维度的缩放和缩减(通过特征选择或PCA)。我该怎么做？我已经看到定义管道有助于扩展，但我不知道如何将其应用于第二个问题。

我真的很感谢你的帮助：—)

请查看SciKit的交叉验证文档以了解更多信息。

另外，您使用cross_val_predict的方式也不正确。它将在内部调用您提供的cv(cv=10)，将所提供的数据(即您的x u列，t u列)重新划分到列中并进行测试，在列中匹配估计值并预测仍在测试中的数据。

现在，对于您的X_test和y_test的使用，您应该首先将您的估计值与列车数据相匹配(交叉预测值不匹配)，然后使用它对测试数据进行预测，然后计算精度。

描述上述内容的简单代码段(借用代码)(请阅读注释并询问是否理解任何内容)：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41

# item feature matrix in X
X = data[features[:-1]].as_matrix()
# remove first column because it is not necessary in the analysis
X = np.delete(X,0,axis=1)
# divide in training and test set
X_train, X_test, t_train, t_test = train_test_split(X, t, test_size=0.2, random_state=0)

# Until here everything is good
# You keep away 20% of data for testing (test_size=0.2)
# This test data should be unseen by any of the below methods

# define method
logreg=LogisticRegression()

# Ideally what you are doing here should be correct, until you did anything wrong in dataframe operations (which apparently has been solved)
#cross valitadion prediction
#This cross validation prediction will print the predicted values of 't_train'
predicted = cross_validation.cross_val_predict(logreg, X_train, t_train, cv=10)
# internal working of cross_val_predict:
#1. Get the data and estimator (logreg, X_train, t_train)
#2. From here on, we will use X_train as X_cv and t_train as t_cv (because cross_val_predict doesnt know that its our training data) - Doubts??
#3. Split X_cv, t_cv into X_cv_train, X_cv_test, t_cv_train, t_cv_test by using its internal cv
#4. Use X_cv_train, t_cv_train for fitting 'logreg'
#5. Predict on X_cv_test (No use of t_cv_test)
#6. Repeat steps 3 to 5 repeatedly for cv=10 iterations, each time using different data for training and different data for testing.

# So here you are correctly comparing 'predicted' and 't_train'
print(metrics.accuracy_score(t_train, predicted))

# The above metrics will show you how our estimator 'logreg' works on 'X_train' data. If the accuracies are very high it may be because of overfitting.

# Now what to do about the X_test and t_test above.
# Actually the correct preference for metrics is this X_test and t_train
# If you are satisfied by the accuracies on the training data then you should fit the entire training data to the estimator and then predict on X_test

logreg.fit(X_train, t_train)
t_pred = logreg(X_test)

# Here is the final accuracy
print(metrics.accuracy_score(t_test, t_pred))
# If this accuracy is good, then your model is good.

如果您的数据较少或不想将数据分成培训和测试，那么您应该使用@fuzzyhedge建议的方法。

1
2
3
4
5
6
7
8
9
10

# Use cross_val_score on your all data
scores = model_selection.cross_val_score(logreg, X, y, cv=10)

# 'cross_val_score' will almost work same from steps 1 to 4
#5. t_cv_pred = logreg.predict(X_cv_test) and calculate accuracy with t_cv_test.
#6. Repeat steps 1 to 5 for cv_iterations = 10
#7. Return array of accuracies calculated in step 5.

# Find out average of returned accuracies to see the model performance
scores = scores.mean()

。

注意-交叉验证最好与网格搜索一起使用，以找出对给定数据表现最佳的估计量的参数。例如，使用logisticregression，它定义了许多参数。但是如果你用

1	logreg = LogisticRegression()

将只使用默认参数初始化模型。可能是另一个参数值

1	logreg = LogisticRegression(penalty='l1', solver='liblinear')

。

可能对您的数据性能更好。对更好参数的搜索是gridsearch。

现在，关于第二部分的缩放、尺寸缩减等，请使用管道。您可以参考管道文档和以下示例：

http://scikit learn.org/stable/auto_examples/feature_stacker.html sphx glr auto examples功能stacker py
http://scikit learn.org/stable/auto-examples/plot-digits-pipe.html sphx-glr-auto-examples-plot-digits-pipe-py

如果需要帮助，请随时联系我。

相关讨论

这里是在一个示例数据帧上测试的工作代码。代码中的第一个问题是目标数组不是np.array。您的特性中也不应该有目标数据。下面我将演示如何使用train_test_split手动拆分培训和测试数据。我还演示了如何使用包装交叉值分数自动拆分、匹配和评分。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

random.seed(42)
# Create example df with alphabetic col names.
alphabet_cols = list(string.ascii_uppercase)[:26]
df = pd.DataFrame(np.random.randint(1000, size=(1000, 26)),
columns=alphabet_cols)
df['Target'] = df['A']
df.drop(['A'], axis=1, inplace=True)
print(df.head())
y = df.Target.values # df['Target'] is not an np.array.
feature_cols = [i for i in list(df.columns) if i != 'Target']
X = df.ix[:, feature_cols].as_matrix()
# Illustrated here for manual splitting of training and testing data.
X_train, X_test, y_train, y_test = \
model_selection.train_test_split(X, y, test_size=0.2, random_state=0)

# Initialize model.
logreg = linear_model.LinearRegression()

# Use cross_val_score to automatically split, fit, and score.
scores = model_selection.cross_val_score(logreg, X, y, cv=10)
print(scores)
print('average score: {}'.format(scores.mean()))

产量

1
2
3
4
5
6
7
8
9
10
11

B C D E F G H I J K ... Target
0 20 33 451 0 420 657 954 156 200 935 ... 253
1 427 533 801 183 894 822 303 623 455 668 ... 421
2 148 681 339 450 376 482 834 90 82 684 ... 903
3 289 612 472 105 515 845 752 389 532 306 ... 639
4 556 103 132 823 149 974 161 632 153 782 ... 347

[5 rows x 26 columns]
[-0.0367 -0.0874 -0.0094 -0.0469 -0.0279 -0.0694 -0.1002 -0.0399 0.0328
-0.0409]
average score: -0.04258093018969249

号

有用的参考资料：

从熊猫变成麻木
选择数据框中除列子集之外的所有列
sklearn.model_selection.train_test_拆分
sklearn.model_selection.cross_val_分数