how could I use complete penn treebank dataset inside python/nltk
我正在尝试学习在python中使用nltk包。特别是,我需要在NLTK中使用Penn树库数据集。据我所知,如果我调用
If you have access to a full installation of the Penn Treebank, NLTK
can be configured to load it as well. Download the ptb package, and in
the directory nltk_data/corpora/ptb place the BROWN and WSJ
directories of the Treebank installation (symlinks work as well). Then
use the ptb module instead of treebank:
所以,我从终端打开了python,导入了nltk,并输入了
1 2 3 4 5 | $: pwd $: ~/nltk_data/corpora/ptb/WSJ $: ls $:00 02 04 06 08 10 12 14 16 18 20 22 24 01 03 05 07 09 11 13 15 17 19 21 23 merge.log |
在从00到24的所有文件夹中,有许多
现在,让我们回答我的问题。同样,根据这里:
如果我写下以下内容,我应该能够获得文件ID:
1 2 3 | >>> from nltk.corpus import ptb >>> print(ptb.fileids()) # doctest: +SKIP ['BROWN/CF/CF01.MRG', 'BROWN/CF/CF02.MRG', 'BROWN/CF/CF03.MRG', 'BROWN/CF/CF04.MRG', ...] |
不幸的是,当我键入
1 2 | >>> print(ptb.fileids()) [] |
有人能帮我吗?
编辑以下是我的ptb目录和allcats.txt文件的内容:
1 2 3 4 5 6 7 8 9 10 11 12 | $: pwd $: ~/nltk_data/corpora/ptb $: ls $: allcats.txt WSJ $: cat allcats.txt $: WSJ/00/WSJ_0001.MRG news WSJ/00/WSJ_0002.MRG news WSJ/00/WSJ_0003.MRG news WSJ/00/WSJ_0004.MRG news WSJ/00/WSJ_0005.MRG news and so on .. |
PTB语料库阅读器需要大写的目录和文件名(根据问题中包含的
解决这个问题的快速方法是将文件夹
1 2 3 4 5 6 7 8 9 10 | find . -depth | \ while read LONG do SHORT=$( basename"$LONG" | tr '[:lower:]' '[:upper:]' ) DIR=$( dirname"$LONG" ) if ["${LONG}" !="${DIR}/${SHORT}" ] then mv"${LONG}""${DIR}/${SHORT}" fi done |
(从这个问题中获得)。它将以递归方式将目录和文件名更改为大写。