pass multiple files based on date in same directory as Input to Mapreduce
我需要使用同一目录中的多个文件,并将特定日期作为mapreduce作业的输入。
不知道我该怎么做。
1 | hadoop jar EventLogsSW.jar EventSuspiciousWatch /user/hdfs/eventlog/*.snappy /user/hdfs/eventlog_output/op1 |
示例:从eventlog目录中,我只需要显示要处理的日期文件。
eventlog目录从FlumeLogger代理获取日志数据,因此它每天有1000个新文件。我只需要现在的日期文件为我的过程。
谢谢。
当做,莫汉。
您可以使用bash
例如,如下运行将查找
1 | hadoop jar EventLogsSW.jar EventSuspiciousWatch /user/hdfs/eventlog/$(date +%Y-%m-%d).snappy /user/hdfs/eventlog_output/$(date +%Y-%m-%d) |
要获得特定的日期格式,请参阅此答案或键入
提供更多详细信息后更新:
1。说明:
1 2 3 4 5 6 7 | $ file=$(hadoop fs -ls /user/cloudera/*.snappy|grep $(date +%Y-%m-%d)|awk '{print $NF}') $ echo $file /user/cloudera/xyz.snappy $ file_out=$(echo $file|awk -F '/' '{print $NF}'|awk -F '.' '{print $1}') $ echo $file_out xyz $hadoop jar EventLogsSW.jar EventSuspiciousWatch /user/hdfs/eventlog/$file /user/hdfs/eventlog_output/$file_out |
2。使shell脚本每天重复使用这些命令…以更合乎逻辑的方式
此脚本可以处理HDFS中当前系统日期的多个文件:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | #!/bin/sh #get today's snappy files files=$(hadoop fs -ls /user/hdfs/eventlog/*.snappy|grep $(date +%Y-%m-%d)|awk '{print $NF}') #Only process if today's file(s) available... if [ $? -eq 0 ] then # file(s) found now create dir hadoop fs -mkdir /user/hdfs/eventlog/$(date +%Y-%m-%d) counter=0 #move each file to today's dir for file in $files do hadoop fs -mv $file /user/hdfs/eventlog/$(date +%Y-%m-%d)/ counter=$(($counter + 1)) done #run hadoop job hadoop jar EventLogsSW.jar EventSuspiciousWatch /user/hdfs/eventlog/$(date +%Y-%m-%d) /user/hdfs/eventlog_output/$(date +%Y-%m-%d) fi echo"Total processed file(s): $counter" echo"Done processing today's file(s)..." |
对于当前系统日期,此脚本可以在HDFS中处理多个文件(一次一个文件):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | #!/bin/sh #get today's snappy files files=$(hadoop fs -ls /user/hdfs/eventlog/*.snappy|grep $(date +%Y-%m-%d)|awk '{print $NF}') #Only process if today's file(s) available... if [ $? -eq 0 ] then counter=0 for file in $files do echo"Processing file: $file ..." #get output dir name file_out=$(echo $file|awk -F '/' '{print $NF}'|awk -F '.' '{print $1}') #run hadoop job hadoop jar EventLogsSW.jar EventSuspiciousWatch /user/hdfs/eventlog/$file /user/hdfs/eventlog_output/$file_out counter=$(($counter + 1)) done fi echo"Total processed file(s): $counter" echo"Done processing today's file(s)..." |