Python mlpy Classification of text
我刚进入MLPY图书馆,正在寻找实现句子分类的最佳方法。我本来想用mply基本感知器来做,但据我所知,它使用的是预先定义的向量大小,但我需要在机器学习时动态增加向量的大小,因为我不想创建一个巨大的向量(所有英语单词中)。我实际需要做的是获得一个句子列表并从中构建一个分类器向量,然后当应用程序获得新的句子时,它将尝试自动将其分类到其中一个标签(监督学习)。
任何想法、想法和例子都会很有帮助,
谢谢
如果你事先掌握了所有的句子,你可以准备一份字(删除停止字)将每个字映射到一个功能。尺寸矢量的值就是字典中的单词数。
一旦你拥有了它,你就可以训练一个感知器。
请看一下我的代码,其中我用Perl进行了映射,然后在Matlab中使用Perceptron实现,以了解它是如何工作的,并用python编写了类似的实现。
准备单词包模型(Perl)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 | use warnings; use strict; my %positions = (); my $n = 0; my $spam = -1; open (INFILE,"q4train.dat"); open (OUTFILE,">q4train_mod.dat"); while (<INFILE>) { chomp; my @values = split(' ', $_); my %frequencies = (); for (my $i = 0; $i < scalar(@values); $i = $i+2) { if ($i==0) { if ($values[1] eq 'spam') { $spam = 1; } else { $spam = -1; } } else { $frequencies{$values[$i]} = $values[$i+1]; if (!exists ($positions{$values[$i]})) { $n++; $positions{$values[$i]} = $n; } } } print OUTFILE $spam.""; my @keys = sort { $positions{$a} <=> $positions{$b} } keys %positions; foreach my $word (@keys) { if (exists ($frequencies{$word})) { print OUTFILE"".$positions{$word}.":".$frequencies{$word}; } } print OUTFILE" "; } close (INFILE); close (OUTFILE); open (INFILE,"q4test.dat"); open (OUTFILE,">q4test_mod.dat"); while (<INFILE>) { chomp; my @values = split(' ', $_); my %frequencies = (); for (my $i = 0; $i < scalar(@values); $i = $i+2) { if ($i==0) { if ($values[1] eq 'spam') { $spam = 1; } else { $spam = -1; } } else { $frequencies{$values[$i]} = $values[$i+1]; if (!exists ($positions{$values[$i]})) { $n++; $positions{$values[$i]} = $n; } } } print OUTFILE $spam.""; my @keys = sort { $positions{$a} <=> $positions{$b} } keys %positions; foreach my $word (@keys) { if (exists ($frequencies{$word})) { print OUTFILE"".$positions{$word}.":".$frequencies{$word}; } } print OUTFILE" "; } close (INFILE); close (OUTFILE); open (OUTFILE,">wordlist.dat"); my @keys = sort { $positions{$a} <=> $positions{$b} } keys %positions; foreach my $word (@keys) { print OUTFILE $word." "; } |
感知器实现(matlab)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | clc; clear; close all; [Ytrain, Xtrain] = libsvmread('q4train_mod.dat'); [Ytest, Xtest] = libsvmread('q4test_mod.dat'); mtrain = size(Xtrain,1); mtest = size(Xtest,1); n = size(Xtrain,2); % part a % learn perceptron Xtrain_perceptron = [ones(mtrain,1) Xtrain]; Xtest_perceptron = [ones(mtest,1) Xtest]; alpha = 0.1; %initialize theta_perceptron = zeros(n+1,1); trainerror_mag = 100000; iteration = 0; %loop while (trainerror_mag>1000) iteration = iteration+1; for i = 1 : mtrain Ypredict_temp = sign(theta_perceptron'*Xtrain_perceptron(i,:)'); theta_perceptron = theta_perceptron + alpha*(Ytrain(i)-Ypredict_temp)*Xtrain_perceptron(i,:)'; end Ytrainpredict_perceptron = sign(theta_perceptron'*Xtrain_perceptron')'; trainerror_mag = (Ytrainpredict_perceptron - Ytrain)'*(Ytrainpredict_perceptron - Ytrain) end Ytestpredict_perceptron = sign(theta_perceptron'*Xtest_perceptron')'; testerror_mag = (Ytestpredict_perceptron - Ytest)'*(Ytestpredict_perceptron - Ytest) |
我不想在python中再次编写相同的代码,但这将为您提供如何继续进行的指导。