R - Document Term Matrix with comma separated text column entries
我有一个数据框,其中有一列由字符串 (project_skills) 组成,表示某个工作 (job_id) 提供的技能。我想为每个工作拆分这个字符串以获得一个工作提供的技能向量,然后创建一个文档术语矩阵来表示某个工作提供哪些技能(在所有可能的技能中)。
我有以下数据框:
1 2 3 4 5 6 | job_id project_skills 107182 CSS,HTML,Joomla,PHP 108169 XTCommerce,Magento,Prestashop,VirtueMart,osCommerce 112969 Google Search Console,Google Analytics,Google Webmaster Central,C++,Java,C# 114660 Marketing,Email Marketing 118686 PHP |
结果应该是这样的(基本上是一个带有逗号分隔短语的文档术语矩阵:
1 2 3 4 5 6 7 | project_skills job_id CSS HTML PHP Google Search Console Google Analytics Java ... 107182 1 0 0 ... 108169 0 0 0 0 0 112969 0 0 0 1 1 ... 114660 0 0 0 ... 118686 0 0 1 ... |
我尝试了以下方法:
1 2 3 4 5 6 7 8 9 10 11 12 | df <- data.frame(job_id = c(107182, 108169, 112969, 114660, 118686), project_skills = c("CSS,HTML,Joomla,PHP","XTCommerce,Magento,Prestashop,VirtueMart,osCommerce","Google Search Console,Google Analytics,Google Webmaster Central,C++,Java,C#","Marketing,Email Marketing","PHP")) corpus <- Corpus(VectorSource(df$project_skills)) corpus <- tm_map(corpus, function(x) { PlainTextDocument( strsplit(x,"\\\\,")[[1]], id=ID(x) ) }) inspect(corpus) dtm <- DocumentTermMatrix(corpus) as.matrix(dtm) |
但不幸的是,这会拆分所有单词而不是逗号(例如,Google Search Console 应被视为 DTM 中的一个术语)。
有很多解决方案,但 strsplit 是你的朋友。这正是以下代码中所做的:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | library(udpipe) df <- data.frame(job_id = c(107182, 108169, 112969, 114660, 118686), project_skills = c("CSS,HTML,Joomla,PHP","XTCommerce,Magento,Prestashop,VirtueMart,osCommerce","Google Search Console,Google Analytics,Google Webmaster Central,C++,Java,C#","Marketing,Email Marketing","PHP"), stringsAsFactors = FALSE) dtm <- document_term_frequencies(x = df$project_skills, document = df$job_id, split =",") dtm <- document_term_matrix(dtm) colnames(dtm) [1]"C#" "C++" "CSS" "Email Marketing" [5]"Google Analytics" "Google Search Console" "Google Webmaster Central""HTML" [9]"Java" "Joomla" "Magento" "Marketing" [13]"osCommerce" "PHP" "Prestashop" "VirtueMart" [17]"XTCommerce" rownames(dtm) [1]"107182""108169""112969""114660""118686" dim(dtm) [1] 5 17 |
tm(或其他一些文本挖掘包)在单词(空格)上拆分,如果您不检查,则倾向于删除标点符号和#。最简单的选择就是使用
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | library(tidyr) library(dplyr) outcome <- df1 %>% group_by(job_id) %>% mutate(project_skills = strsplit(project_skills,",")) %>% unnest() %>% mutate(value = 1) %>% # add 1 for every value spread(key = project_skills, value = value) # use fill = 0 if you don't want NA's head(outcome) # A tibble: 5 x 18 # Groups: job_id [5] job_id `C#` `C++` CSS `Email Marketin~ `Google Analyti~ `Google Search ~ `Google Webmast~ HTML Java Joomla Magento Marketing <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 107182 NA NA 1 NA NA NA NA 1 NA 1 NA NA 2 108169 NA NA NA NA NA NA NA NA NA NA 1 NA 3 112969 1 1 NA NA 1 1 1 NA 1 NA NA NA 4 114660 NA NA NA 1 NA NA NA NA NA NA NA 1 5 118686 NA NA NA NA NA NA NA NA NA NA NA NA # ... with 5 more variables: osCommerce <dbl>, PHP <dbl>, Prestashop <dbl>, VirtueMart <dbl>, XTCommerce <dbl> |