How to download ftp urls that meet certain conditions?
我有一个ftp链接,其中包含一些指向我感兴趣下载的文件的链接:
ftp://lidar.wustl.edu/phelps_rolla/
我可以使用以下方法列出所有URL:
1 2 3 4 5 6 | import urllib2 import BeautifulSoup request = urllib2.Request("ftp://lidar.wustl.edu/Phelps_Rolla/") response = urllib2.urlopen(request) soup = BeautifulSoup.BeautifulSoup(response) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | >>> soup drwxrwxrwx 1 user group 0 Nov 7 2012 . drwxrwxrwx 1 user group 0 Nov 7 2012 .. drwxrwxrwx 1 user group 0 Nov 7 2012 ESRI_Grids drwxrwxrwx 1 user group 0 Nov 7 2012 ESRI_Shapefiles drwxrwxrwx 1 user group 0 Nov 7 2012 LAS_Files -rw-rw-rw- 1 user group 545700 May 27 2011 LiDAR Accuracy Report_Rolla.pdf drwxrwxrwx 1 user group 0 Nov 7 2012 Rolla Survey -rw-rw-rw- 1 user group 4865 May 26 2011 Rolla_SEMA_Tile_Index.dbf -rw-rw-rw- 1 user group 503 May 26 2011 Rolla_SEMA_Tile_Index.prj -rw-rw-rw- 1 user group 188 May 26 2011 Rolla_SEMA_Tile_Index.sbn -rw-rw-rw- 1 user group 124 May 26 2011 Rolla_SEMA_Tile_Index.sbx -rw-rw-rw- 1 user group 1100 May 26 2011 Rolla_SEMA_Tile_Index.shp -rw-rw-rw- 1 user group 12682 May 31 2011 Rolla_SEMA_Tile_Index.shp.xml -rw-rw-rw- 1 user group 140 May 26 2011 Rolla_SEMA_Tile_Index.shx |
如何只下载包含扩展名为".dbf"、".prj"、".shp"和".shx"的"tile"或"tile"的链接?
您使用的是urllib abd漂亮的汤,但在处理ftp专用标准库模块ftplib时,可能是更好的选择。头到文档和阅读如何连接到FTP和打开连接和列表目录,有简单的走槽那里。
下一步是找出如何过滤文件,这是一些列表理解过滤字符串到那些有一些字符串在里面的问题,例如,看到这个问题或这个问题。最后你需要谷歌如何通过ftp下载文件,你会发现这个问题。原来文件下载是通过调用
下面是一个简单的脚本,可以完成上面提到的所有操作:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | from ftplib import FTP ftp = FTP("lidar.wustl.edu") ftp.login() ftp.cwd("Phelps_Rolla") # list files with ftplib file_list = ftp.nlst() for f in file_list: # apply your filters if"tile" in f.lower() and any(f.endswith(ext) for ext in ['dbf', 'prj', 'shp', 'shx']): # download file sending"RETR <name of file>" command # open(f,"w").write is executed after RETR suceeds and returns file binary data ftp.retrbinary("RETR {}".format(f), open(f,"wb").write) print("downloaded {}".format(f)) ftp.quit() |