关于shell：Bash脚本找不到美元单词的速度不如希望的那样快

Bash Script To Find Dollar Words Not As Fast As Was Hoping

我已经创建了一个bash脚本来查找美元单词。对于那些不知道的人来说，一个一美元的单词就是当a的值为1时，字母的值加起来是100，b的值为2，c的值为3，一直到z的值是26。

我对编程还不熟悉，所以我创建了一个非常粗糙的脚本来完成这类工作，但它的工作速度不如我预期的快。我的代码中有些东西正在减慢速度，但我不知道是什么。这是我的密码。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64

#!/bin/bash

#370101 total words in Words.txt

line=$(cat line.txt)

function wordcheck {
letter=({a..z})
i=0
while ["$i" -le 25 ]
do
occurences["$i"]=$(echo $word | grep ${letter["$i"]} -o | wc -l)

((i++))
done
((line++))
}

until ["$line" -ge"370102" ]
do

word=$(sed -n"$line"p Words.txt)
wordcheck

echo"$line"> line.txt

x=0

while ["$x" -le '25' ]
do
y=$((x+1))
charsum["$x"]=$((${occurences[x]} * $y))
((x++))
done

wordsum=0

for n in ${charsum[@]}
do
(( wordsum += n ))
done

tput el

printf"Word #"
printf"$(($line - 1))"

if ["$wordsum" = '100' ]
then
echo $word >> DollarWords.txt
printf"

"
printf"$word
"
printf '$$$DOLLAR WORD$$$

'
else
printf" Not A Dollar Word $word
"
tput cuu1
fi
done

我只能推测它与while循环有关，或者与它如何不断地将EDOCX1的值(0)写入文件有关。

我之前已经创建了一个脚本，它添加数字来生成斐波那契序列，几乎是瞬间完成的。

所以我的问题是，有哪些方法可以帮助我的代码更高效地运行？如果这属于代码审查，请道歉。

任何帮助都非常感谢。

谢谢

编辑：

虽然我接受了戈丹·戴维斯的回答，但如果你想这样做，其他的回答也一样好。在试一试之前，我建议大家先看看别人的答案。而且，正如许多用户指出的那样，bash并不是一种很好的语言。再次感谢大家的建议。

相关讨论

鉴于：

1 2	$ wc -l words.txt 370101 words.txt

(即链接在此处的370101字文件)

仅在bash中，从一个循环开始，该循环一行一行地读取文件：

1
2
3
4
5
6

c=0
while IFS= read -r word; do
(( c+=1 ))
done <words.txt
echo"$c"
# prints 370,101

要计算bash(同一文件)中的行数，在我的计算机上需要7.8秒。相比之下，wc以微秒执行。所以bash版本需要一段时间。

一旦你有了逐字的文件，你就可以一个字符一个字符地读取每个字符，并在字母表的字符串中找到该字符的索引：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

lcl=' abcdefghijklmnopqrstuvwxyz'
ucl=' ABCDEFGHIJKLMNOPQRSTUVWXYZ'

while IFS= read -r word; do
ws=0
for (( i=0; i<${#word}; i++ )); do
ch=${word:i:1}
if [["$ch" == [a-z] ]]; then
x="${lcl%%$ch*}"
(( ws +="${#x}" ))
elif [["$ch" == [A-Z] ]]; then
x="${ucl%%$ch*}"
(( ws +="${#x}" ))
fi
done
if (( ws==100 )); then
echo"$word"
fi
done <words.txt

印刷品：

1
2
3
4
5
6
7
8
9
10
11
12

abactinally
abatements
abbreviatable
abettors
abomasusi
abreption
...
zincifies
zinkify
zithern
zoogleas
zorgite

在370101字的文件中大约需要1:55。

作为比较，考虑在python中使用相同的函数：

1
2
3
4
5
6
7
8
9
10

import string

lets={k:v for v,k in enumerate(string.lowercase, 1)}
lets.update({k:v for v,k in enumerate(string.uppercase, 1)})

with open('/tmp/words.txt') as f:
for word in f:
word=word.strip()
if sum(lets.get(c,0) for c in word)==100:
print word

在580毫秒内更容易理解和执行。

bash非常适合将不同的工具粘合在一起。在大型处理任务中不是很好。大任务使用awkperlpythonruby等。更容易写，读，理解和更快。

相关讨论

由于您正在寻找加快处理速度的方法，下面是用户agc提供的解决方案的一个调整。

我提取了man/tr/sort并将结果转储到一个文件(words.txt)中，以模拟文件已经存在的原始问题(即，我想从计时中提取man/tr/sort)：

1 2	man bash csh dash ksh busybox find file sed tr gcc perl python make \| tr '[:upper:][ \t]' '[:lower:] ' \| sort -u > Words.txt

这个调整的要点是用一个循环替换eval/sed子进程调用，该循环逐步遍历一个有效单词的字符。[见帖子-如何在bash中对字符串中的每个字符执行for循环？-有关更多详细信息，请参阅用户Thunderbeef和Six提供的解决方案。]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35

#!/bin/bash
# make an Associative Array of the 26 letters and values.

declare -A lval=$$(seq 26 | for i in [{a..z}] ; do read x ; echo $i=$x ; done)$

while read word
do
# skip words that contain a non-letter
[[ !"${word}" =~ ^[a-z]+$ ]] && continue

sum=0

# process ${word} one character at a time

while read -n 1 char
do
# here string dumps a newline on the end of ${word}, so we'll
# run a quick test to break out of the loop for a non-letter

[["${char}" != [a-z] ]] && break

sum=$(( sum + lval[${char}] ))

# from the referenced SO link - see above - the solutions of interest
# use process substitution and printf to pass the desired string into
# the while loop; I've replaced this with the 'here' string and added
# the test to break the loop when we see the the newline character.

#done < <(printf $s"${word}")
done <<<"${word}"

(( sum == 100 )) && \
echo"${word}"

done < Words.txt

在运行在旧i5上的Linux虚拟机中运行3个不同测试的时间(前10个字符串)：

AGC的解决方案：37秒
上述溶液w/工艺替代：11秒
上面的解决方案w/here字符串：2.7秒

编辑：关于各种命令正在做什么的一些注释…

$(seq 26 | for/do/read/echo/done)生成字符串列表"[a]=1[b]=2…[Z]＝26
declare -A lval=$ $(seq...done) $：声明lval为关联数组，并加载前26个条目([a]=1[b]=2…[Z]＝26)
=~用于测试特定的模式；^表示模式的开始，$表示字符串的结束，[a-z]表示匹配a和z之间的任何字符，+表示匹配1个或多个字符。
如果$word是a)仅由字母a-z组成，并且b)至少有一个字母，则"${word}" =~ ^[a-z]+$的计算结果为真。
!否定了模式测试；在这种情况下，我正在寻找任何具有非字母字符的单词[注意：有许多方法可以测试特定模式；这恰好是我选择用于此脚本的方法]
[[ !"${word}" ... ]] && continue：如果单词包含非字母，测试生成true和(&&)，然后我们continue(即，我们对这个单词不感兴趣，所以跳到下一个单词；换句话说，跳到循环的下一个迭代)
while read -n 1 char：一次解析输入(在本例中，${word}作为'here'字符串传入)1个字符，将得到的字符串放入名为'char'的变量中。
[["${char}" != [a-z] ]] && break：另一种/不同的模式匹配方法；这里我们测试单个字符$char变量，看它是否是字母，如果是(例如，evals为true)，那么我们将break退出当前循环；如果$char是字母(a-z)，那么处理将继续执行循环中的下一个命令(本例中为sum=...)。
(( sum == 100 )) && \ echo"${word}"：另一种运行测试的方法；在这种情况下，我们要测试字母的和是否为100；如果它的值为真，那么我们也要测试echo"${word}"[注：反斜杠(\表示继续下一行的命令]
done <<<"${word}"：<<<称为"这里"字符串；在这种情况下，它允许我将当前字符串(${word}作为参数传递给while read -n 1 char循环。

相关讨论

工作得很好！不幸的是，当我用我原来的Words.txt 字(有370101个)尝试它时，它没有任何作用。我从这里得到了github.com/dwyl/english-words/blob/master/words_alpha.txt它是words_alpha.txt文件。另外，您知道有什么好地方可以进一步了解bash语法吗？我对bash很陌生，所以我对你的代码不太了解！谢谢！
@埃里克：我下载了那个文件，并通过我的脚本(这里的字符串版本)运行了它；它在3.6秒内找到了前10个单词[abactinally，abatements，abbreviatable，abettors，abomasus，abreption，abrogative，absconders，absinthol，absorbancy；我的agc的脚本副本[agc的脚本〔20〕的副本。找到同样的10个字大约65秒后；至于为什么你的剧本"什么都不做"…如果看不到代码，很难说；也许您可以用显示脚本最新版本的新部分更新原始文章。
@埃里克：至于学习巴什……google是你的朋友；[注：我通常用ksh编码，所以我不得不自己做一些google搜索，以找出如何将ksh的想法转换为bash。]；是的，我确实理解所有这些命令在我的脚本中都在做什么，但不确定哪些命令有问题，所以你必须问
添加了一些关于脚本中各种命令的注释
哦！我不是问你是否知道你的代码发生了什么，我是说我几乎不知道其中的任何东西！对不起，如果我冒犯了你。另外，我的意思是我在我的原始字典上尝试了你的代码，它只是"坐"在那里不输出任何东西。可能存在兼容性问题？我用的是CentOS。非常感谢你的帮助。
实际上，我不同意谷歌是任何人学习bash的朋友——它发现了很多不好的做法和有害的例子(高级bash脚本指南中充满了这些例子)。对于从已知的好参考资料——bashguide、bash hacker的wiki等工作的人来说，这样做要好得多。
@Erik，…如果你想看看它是否真的无所事事，运行bash -x scriptname--也许PS4=':$LINENO+' bash -x scriptname--以记录每个命令的源代码行--是一个很好的开始。
@埃里克：没有冒犯的意思；我之所以这么说是因为，虽然我提到过用谷歌做一些bash研究，但我确信我理解了每个命令的作用(即，我不仅仅是剪切n-paste并交叉手指)；-)
这个答案的纯bash方法在这里起作用，与我修改后的答案相比，它在i3上的速度是3x(即，基于man的words.txt对46s的速度是14s)。
好吧，谢谢大家的帮助，我想我暂时不谈这个了。我可以回到这些答案并尝试一下，但现在我认为这就是一切。
这一速度非常慢的原因是while read -n 1 chardone <<<"${word}"循环。它为每个被输入循环的单词创建一个临时文件。
@道：掌心面对面……而且，由于这个特定的虚拟机运行在TrueCrypt分区之上(说来话长)，那些讨厌的磁盘iOS将比"正常"慢；感谢现实检查！
您可以尝试用创建匿名FIFO的done < <(printf '%s '"$word")替换Herdone <<<"${word}"。最好还是有一个C风格的for循环来遍历字符串，并完全避免这个问题。
@DAWG：过程替换(又称anon-fifo？)消除了临时文件，但实际运行速度比这里的字符串代码慢得多；将查看C样式的for循环，但不太喜欢重复的子字符串(尽管对于较长的字符串来说，性能问题更大)。

注意：请跳到3以获得更快的方法。

一个循环，一个(长)流方法：

1
2
3
4
5
6
7
8
9
10
11

# make an Associative Array of the 26 letters and values.
declare -A lval=$$(seq 26 | for i in [{a..z}] ; do read x; echo $i=$x ; done)$
# spew out 240,000 words from some man pages.
man bash csh dash ksh busybox find file sed tr gcc perl python make |
tr '[:upper:][ \t]' '[:lower:]
' | sort -u |
while read x ; do
["$x" ="${x//[^a-z]/}" ] &&
(( 100 == $(sed 's/./lval[&]+/g' <<< $x) 0 )) &&
echo"$x"
done | head

输出以打印前10个字(在Intel核心上大约13秒)I3-230M)：

1
2
3
4
5
6
7
8
9
10

accumulate
activates
addressing
allmulti
analysis
applying
augments
backslashes
bashopts
boundary

它是如何工作的。

使所有单词都小写，然后进行唯一排序。

如果一个单词只包含小写字母，请运行测试，也许打印出来。

测试使用sed将一个单词(比如说"foo")转换为bash。像这样的代码(( ${lval[f]}+${lval[o]}+${lval[o]}+0 ))；即要相加的关联数组值列表。

除上述方法外，其余方法与上述方法十分相似。代替了带sed的零件，改为：

1	(( 100 == $( hexdump -ve '/1"(%3i - 96) +" ' <<< $x ;) 86 ))

这里，hexdump使用十进制ASCII码转储一个方程(参见man ascii和man hexdump中的"示例"，输入"foo"将输出：

1	(102 - 96) + (111 - 96) + (111 - 96) + ( 10 - 96) +

- 96是一个偏移量，但由于hexdump甚至转储了换行(ASCII 10十进制)，在末尾添加86，更正那。

代码：

1
2
3
4
5

while read x ; do
["$x" ="${x//[^a-z]/}" ] &&
(( 100 == $( hexdump -ve '/1"(%3i - 96) +" ' <<< $x ;) 86 )) &&
echo"$x"
done < words.txt

它比关联数组方法快20%。

软件工具预循环方法，使用paste和单个实例其中：hexdump、sed、tr、egrep。首先列出清单(3秒)与Markp的答案一样：

1
2
3

man bash csh dash ksh busybox find file sed tr gcc perl python make |
tr '[:upper:][ \t]' '[:lower:]
' | sort -u | egrep '^[a-z]+$' > words.txt

然后将所有单词粘贴到它们各自的方程式旁边，(请参见上一个答案)，将这些输入循环，然后打印美元话：

1
2
3
4
5
6
7

paste words.txt
<(hexdump -ve '/1"%3i" ' < words.txt |
sed 's/ *[^12]10[^0-9] */
/g;s/^ //;s/ $//' |
sed 's/ \+\|$/ + -96 + /g;s/ + $//'
) |
while read a b ; do (( 100 == $b )) && echo $a ; done

在循环之前进行处理是一个很大的改进。它需要大约一秒钟就可以打印出整个美元单词列表。

工作原理：需要的是decdump(即decimal dump)将每一个单词在一个单独的行上。因为hexdump不能这样做，使用sed将所有10s(即换行代码)转换为实际换行，然后像上面的方法2一样继续。

相关讨论

正如@thatotherguy在评论中指出的，这里有两个大问题。首先，从文件中读取行的方式是每行读取整个文件。也就是说，要读取运行sed -n"1"p Words.txt的第一行，它读取整个文件，只打印第一行；然后运行sed -n"2"p Words.txt，它再次读取整个文件，只打印第二行；等等，要修复此问题，请使用while read循环：

1
2
3

while read word; do
...
done <Words.txt

注意，如果循环中有任何内容试图从标准输入中读取，它将从words.txt中窃取一些输入。在这种情况下，您可以通过fd_3发送文件，而不是使用while read -u3 ... done 3发送标准输入。

第二个问题是这个位：

1	occurences["$i"]=$(echo $word \| grep ${letter["$i"]} -o \| wc -l)

…它创建了3个子进程(echo、grep和wc，除了文件中的每个字都运行26次外，这并不太糟糕。与大多数shell操作相比，创建过程是昂贵的，因此您应该尽最大努力避免它，尤其是在运行多次的循环中。试试这个：

1 2	matches="${word//[^${letter[i]}]/}" occurences[i]="${#matches}"

它的工作原理是将所有不是$letter[i]的字符替换为"，然后查看结果字符串的长度。解析完全发生在shell进程中，因此应该更快。

相关讨论

让我们用awk试试这个。

注意：我不是awk的大用户，所以可能有一些方法可以调整它以提高速度。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48

awk '
# initialize an array of character-to-number values
BEGIN {
# split our alphabet into an array: c[1]=a c[2]=b ... c[26]=z;
# NOTE: assumes input is all lower case, otherwise we could either
# add array values for upper case letters or modify processing to
# convert all characters to lower case ...
split("abcdefghijklmnopqrstuvwxyz", c,"")

# build associative array to match letters w/ numeric values:
# ord[a]=1 ord[b]=2 ... ord[z]=26
for (i=1; i <= 26; i++) {
ord[c[i]]=i
}
}
# now process our file of words
{
# loop through words; just in case more than 1 word per line (ie, NF > 1)
word=1
while ( word <= NF ) {
sum=0

# split our word into an array of characters
split($word, c,"")

# loop through our array of characters
for (i=1; i <= length($word); i++) {

# if not a letter then break out of loop
if ( c[i] !~ /[a-z]/ ) {
sum=999
break
}

# add letter to our running sum
sum=sum + ord[c[i]]

# if we go over 100 then break
if ( sum >= 101 )
break
} # end of character loop

if ( sum == 100 )
print $word

word++
} # end of word loop
}' Words.txt

我用整个words.txt文件运行了一些测试：

我以前的bash解决方案：我们不要谈论我的机器有多慢！
dawg的bash解决方案：3分钟32秒(比dawg的机器慢2倍左右)
在awk解决方案之上：3.5秒(在我的电脑以外的任何设备上都会更快)

相关讨论