关于机器学习：Python mlpy文本分类

Python mlpy Classification of text

我刚进入MLPY图书馆，正在寻找实现句子分类的最佳方法。我本来想用mply基本感知器来做，但据我所知，它使用的是预先定义的向量大小，但我需要在机器学习时动态增加向量的大小，因为我不想创建一个巨大的向量(所有英语单词中)。我实际需要做的是获得一个句子列表并从中构建一个分类器向量，然后当应用程序获得新的句子时，它将尝试自动将其分类到其中一个标签(监督学习)。

任何想法、想法和例子都会很有帮助，

谢谢

如果你事先掌握了所有的句子，你可以准备一份字(删除停止字)将每个字映射到一个功能。尺寸矢量的值就是字典中的单词数。

一旦你拥有了它，你就可以训练一个感知器。

请看一下我的代码，其中我用Perl进行了映射，然后在Matlab中使用Perceptron实现，以了解它是如何工作的，并用python编写了类似的实现。

准备单词包模型(Perl)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85

use warnings;
use strict;

my %positions = ();
my $n = 0;
my $spam = -1;

open (INFILE,"q4train.dat");
open (OUTFILE,">q4train_mod.dat");
while (<INFILE>) {
chomp;
my @values = split(' ', $_);
my %frequencies = ();
for (my $i = 0; $i < scalar(@values); $i = $i+2) {
if ($i==0) {
if ($values[1] eq 'spam') {
$spam = 1;
}
else {
$spam = -1;
}
}
else {
$frequencies{$values[$i]} = $values[$i+1];
if (!exists ($positions{$values[$i]})) {
$n++;
$positions{$values[$i]} = $n;
}
}
}
print OUTFILE $spam."";
my @keys = sort { $positions{$a} <=> $positions{$b} } keys %positions;
foreach my $word (@keys) {
if (exists ($frequencies{$word})) {
print OUTFILE"".$positions{$word}.":".$frequencies{$word};
}
}
print OUTFILE"
";
}
close (INFILE);
close (OUTFILE);

open (INFILE,"q4test.dat");
open (OUTFILE,">q4test_mod.dat");
while (<INFILE>) {
chomp;
my @values = split(' ', $_);
my %frequencies = ();
for (my $i = 0; $i < scalar(@values); $i = $i+2) {
if ($i==0) {
if ($values[1] eq 'spam') {
$spam = 1;
}
else {
$spam = -1;
}
}
else {
$frequencies{$values[$i]} = $values[$i+1];
if (!exists ($positions{$values[$i]})) {
$n++;
$positions{$values[$i]} = $n;
}
}
}
print OUTFILE $spam."";
my @keys = sort { $positions{$a} <=> $positions{$b} } keys %positions;
foreach my $word (@keys) {
if (exists ($frequencies{$word})) {
print OUTFILE"".$positions{$word}.":".$frequencies{$word};
}
}
print OUTFILE"
";
}
close (INFILE);
close (OUTFILE);

open (OUTFILE,">wordlist.dat");
my @keys = sort { $positions{$a} <=> $positions{$b} } keys %positions;
foreach my $word (@keys) {
print OUTFILE $word."
";
}

感知器实现(matlab)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

clc; clear; close all;

[Ytrain, Xtrain] = libsvmread('q4train_mod.dat');
[Ytest, Xtest] = libsvmread('q4test_mod.dat');

mtrain = size(Xtrain,1);
mtest = size(Xtest,1);
n = size(Xtrain,2);

% part a
% learn perceptron
Xtrain_perceptron = [ones(mtrain,1) Xtrain];
Xtest_perceptron = [ones(mtest,1) Xtest];
alpha = 0.1;
%initialize
theta_perceptron = zeros(n+1,1);
trainerror_mag = 100000;
iteration = 0;
%loop
while (trainerror_mag>1000)
iteration = iteration+1;
for i = 1 : mtrain
Ypredict_temp = sign(theta_perceptron'*Xtrain_perceptron(i,:)');
theta_perceptron = theta_perceptron + alpha*(Ytrain(i)-Ypredict_temp)*Xtrain_perceptron(i,:)';
end
Ytrainpredict_perceptron = sign(theta_perceptron'*Xtrain_perceptron')';
trainerror_mag = (Ytrainpredict_perceptron - Ytrain)'*(Ytrainpredict_perceptron - Ytrain)
end
Ytestpredict_perceptron = sign(theta_perceptron'*Xtest_perceptron')';
testerror_mag = (Ytestpredict_perceptron - Ytest)'*(Ytestpredict_perceptron - Ytest)

我不想在python中再次编写相同的代码，但这将为您提供如何继续进行的指导。