关于机器学习:Python中的逻辑回归和交叉验证(使用sklearn)

Logistic regression and cross-validation in Python (with sklearn)

我试图通过逻辑回归(这不是问题)来解决给定数据集上的分类问题。为了避免过度拟合,我正试图通过交叉验证来实现它(这就是问题所在):完成程序时我遗漏了一些东西。我的目的是确定准确度。

但让我具体点。这就是我所做的:

  • 我把这套设备分成火车组和测试组
  • 我定义了要使用的对数回归预测模型
  • 我使用交叉值预测方法(在sklearn.cross验证中)进行预测。
  • 最后,我测量了精度
  • 代码如下:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    import pandas as pd
    import numpy as np
    import seaborn as sns
    from sklearn.cross_validation import train_test_split
    from sklearn import metrics, cross_validation
    from sklearn.linear_model import LogisticRegression

    # read training data in pandas dataframe
    data = pd.read_csv("./dataset.csv", delimiter=';')
    # last column is target, store in array t
    t = data['TARGET']
    # list of features, including target
    features = data.columns
    # item feature matrix in X
    X = data[features[:-1]].as_matrix()
    # remove first column because it is not necessary in the analysis
    X = np.delete(X,0,axis=1)
    # divide in training and test set
    X_train, X_test, t_train, t_test = train_test_split(X, t, test_size=0.2, random_state=0)

    # define method
    logreg=LogisticRegression()

    # cross valitadion prediction
    predicted = cross_validation.cross_val_predict(logreg, X_train, t_train, cv=10)
    print(metrics.accuracy_score(t_train, predicted))

    我的问题:

    • 据我所知,在培训结束之前,不应考虑测试集,并且应对培训集进行交叉验证。这就是为什么我将x_train和t_train插入交叉值预测方法中的原因。图格,我听到一个错误说:

      ValueError: Found input variables with inconsistent numbers of samples: [6016, 4812]

      其中6016是整个数据集中的样本数,4812是数据集拆分后训练集中的样本数。

    • 之后,我不知道该怎么办。我的意思是:X测试和T测试什么时候开始起作用?我不知道在交叉验证之后应该如何使用它们,以及如何获得最终的准确性。

    附加问题:我还想在交叉验证的每个步骤中执行维度的缩放和缩减(通过特征选择或PCA)。我该怎么做?我已经看到定义管道有助于扩展,但我不知道如何将其应用于第二个问题。

    我真的很感谢你的帮助:—)


    请查看SciKit的交叉验证文档以了解更多信息。

    另外,您使用cross_val_predict的方式也不正确。它将在内部调用您提供的cv(cv=10),将所提供的数据(即您的x u列,t u列)重新划分到列中并进行测试,在列中匹配估计值并预测仍在测试中的数据。

    现在,对于您的X_testy_test的使用,您应该首先将您的估计值与列车数据相匹配(交叉预测值不匹配),然后使用它对测试数据进行预测,然后计算精度。

    描述上述内容的简单代码段(借用代码)(请阅读注释并询问是否理解任何内容):

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    # item feature matrix in X
    X = data[features[:-1]].as_matrix()
    # remove first column because it is not necessary in the analysis
    X = np.delete(X,0,axis=1)
    # divide in training and test set
    X_train, X_test, t_train, t_test = train_test_split(X, t, test_size=0.2, random_state=0)

    # Until here everything is good
    # You keep away 20% of data for testing (test_size=0.2)
    # This test data should be unseen by any of the below methods

    # define method
    logreg=LogisticRegression()

    # Ideally what you are doing here should be correct, until you did anything wrong in dataframe operations (which apparently has been solved)
    #cross valitadion prediction
    #This cross validation prediction will print the predicted values of 't_train'
    predicted = cross_validation.cross_val_predict(logreg, X_train, t_train, cv=10)
    # internal working of cross_val_predict:
      #1. Get the data and estimator (logreg, X_train, t_train)
      #2. From here on, we will use X_train as X_cv and t_train as t_cv (because cross_val_predict doesnt know that its our training data) - Doubts??
      #3. Split X_cv, t_cv into X_cv_train, X_cv_test, t_cv_train, t_cv_test by using its internal cv
      #4. Use X_cv_train, t_cv_train for fitting 'logreg'
      #5. Predict on X_cv_test (No use of t_cv_test)
      #6. Repeat steps 3 to 5 repeatedly for cv=10 iterations, each time using different data for training and different data for testing.

    # So here you are correctly comparing 'predicted' and 't_train'
    print(metrics.accuracy_score(t_train, predicted))

    # The above metrics will show you how our estimator 'logreg' works on 'X_train' data. If the accuracies are very high it may be because of overfitting.

    # Now what to do about the X_test and t_test above.
    # Actually the correct preference for metrics is this X_test and t_train
    # If you are satisfied by the accuracies on the training data then you should fit the entire training data to the estimator and then predict on X_test

    logreg.fit(X_train, t_train)
    t_pred = logreg(X_test)

    # Here is the final accuracy
    print(metrics.accuracy_score(t_test, t_pred))
    # If this accuracy is good, then your model is good.

    如果您的数据较少或不想将数据分成培训和测试,那么您应该使用@fuzzyhedge建议的方法。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    # Use cross_val_score on your all data
    scores = model_selection.cross_val_score(logreg, X, y, cv=10)

    # 'cross_val_score' will almost work same from steps 1 to 4
      #5. t_cv_pred = logreg.predict(X_cv_test) and calculate accuracy with t_cv_test.
      #6. Repeat steps 1 to 5 for cv_iterations = 10
      #7. Return array of accuracies calculated in step 5.

    # Find out average of returned accuracies to see the model performance
    scores = scores.mean()

    注意-交叉验证最好与网格搜索一起使用,以找出对给定数据表现最佳的估计量的参数。例如,使用logisticregression,它定义了许多参数。但是如果你用

    1
    logreg = LogisticRegression()

    将只使用默认参数初始化模型。可能是另一个参数值

    1
    logreg = LogisticRegression(penalty='l1', solver='liblinear')

    可能对您的数据性能更好。对更好参数的搜索是gridsearch。

    现在,关于第二部分的缩放、尺寸缩减等,请使用管道。您可以参考管道文档和以下示例:

    • http://scikit learn.org/stable/auto_examples/feature_stacker.html sphx glr auto examples功能stacker py
    • http://scikit learn.org/stable/auto-examples/plot-digits-pipe.html sphx-glr-auto-examples-plot-digits-pipe-py

    如果需要帮助,请随时联系我。


    这里是在一个示例数据帧上测试的工作代码。代码中的第一个问题是目标数组不是np.array。您的特性中也不应该有目标数据。下面我将演示如何使用train_test_split手动拆分培训和测试数据。我还演示了如何使用包装交叉值分数自动拆分、匹配和评分。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    random.seed(42)
    # Create example df with alphabetic col names.
    alphabet_cols = list(string.ascii_uppercase)[:26]
    df = pd.DataFrame(np.random.randint(1000, size=(1000, 26)),
                      columns=alphabet_cols)
    df['Target'] = df['A']
    df.drop(['A'], axis=1, inplace=True)
    print(df.head())
    y = df.Target.values  # df['Target'] is not an np.array.
    feature_cols = [i for i in list(df.columns) if i != 'Target']
    X = df.ix[:, feature_cols].as_matrix()
    # Illustrated here for manual splitting of training and testing data.
    X_train, X_test, y_train, y_test = \
        model_selection.train_test_split(X, y, test_size=0.2, random_state=0)

    # Initialize model.
    logreg = linear_model.LinearRegression()

    # Use cross_val_score to automatically split, fit, and score.
    scores = model_selection.cross_val_score(logreg, X, y, cv=10)
    print(scores)
    print('average score: {}'.format(scores.mean()))

    产量

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
         B    C    D    E    F    G    H    I    J    K   ...    Target
    0   20   33  451    0  420  657  954  156  200  935   ...    253
    1  427  533  801  183  894  822  303  623  455  668   ...    421
    2  148  681  339  450  376  482  834   90   82  684   ...    903
    3  289  612  472  105  515  845  752  389  532  306   ...    639
    4  556  103  132  823  149  974  161  632  153  782   ...    347

    [5 rows x 26 columns]
    [-0.0367 -0.0874 -0.0094 -0.0469 -0.0279 -0.0694 -0.1002 -0.0399  0.0328
     -0.0409]
    average score: -0.04258093018969249

    有用的参考资料:

    • 从熊猫变成麻木
    • 选择数据框中除列子集之外的所有列

    • sklearn.model_selection.train_test_拆分

    • sklearn.model_selection.cross_val_分数