Adding a for loop to a working web scraper (Python and Beautifulsoup)
我对 for 循环有疑问,并将其添加到已经工作的网络抓取工具中以运行网页列表。我在看的可能是两三行简单的代码。
我很感激这个问题之前可能已经被问过很多次并得到了回答,但我一直在努力让一些代码为我工作很长一段时间了。我对 Python 比较陌生,希望有所改进。
背景信息:
我已经使用 Python 和 Beautifulsoup 编写了一个网络抓取工具,它能够成功地从 TransferMarkt.com 获取网页并抓取所有需要的网络链接。该脚本由两部分组成:
例如英超联赛,并提取所有的网页链接
排名表中的各个团队,并将它们放入列表中。
我的问题是关于如何在这个网络爬虫的第一部分添加一个 for 循环,不仅可以从一个联赛网页中提取球队链接,还可以从联赛网页列表中提取链接。
下面我包含了一个足球联赛网页的示例、我的网络爬虫代码和输出。
示例:
要抓取的示例网页(英超联赛 - 代码 GB1):https://www.transfermarkt.co.uk/jumplist/startseite/wettbewerb/gb1/plus/?saison_id=2019
代码(第 1 部分,共 2 部分)- 从联赛网页上抓取各个球队的链接:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | # Python libraries ## Data Preprocessing import pandas as pd ## Data scraping libraries from bs4 import BeautifulSoup import requests # Assign league by code, e.g. Premier League = 'GB1', to the list_league_selected variable list_league_selected = 'GB1' # Assign season by year to season variable e.g. 2014/15 season = 2014 season = '2019' # Create an empty list to assign these values to team_links = [] # Web scraper script ## Process League Table headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'} page = 'https://www.transfermarkt.co.uk/jumplist/startseite/wettbewerb/' + id + '/plus/?saison_id=' + season tree = requests.get(page, headers = headers) soup = BeautifulSoup(tree.content, 'html.parser') ## Create an empty list to assign these values to - team_links team_links = [] ## Extract all links with the correct CSS selector links = soup.select("a.vereinprofil_tooltip") ## We need the location that the link is pointing to, so for each link, take the link location. ## Additionally, we only need the links in locations 1, 3, 5, etc. of our list, so loop through those only for i in range(1,59,3): team_links.append(links[i].get("href")) ## For each location that we have taken, add the website before it - this allows us to call it later for i in range(len(team_links)): team_links[i] ="https://www.transfermarkt.co.uk" + team_links[i] # View list of team weblinks assigned to variable - team_links team_links |
输出:
从示例网页中提取的链接(示例网页共 20 个链接,仅显示 4 个):
1 2 3 4 5 6 | team_links = ['https://www.transfermarkt.co.uk/manchester-city/startseite/verein/281/saison_id/2019', 'https://www.transfermarkt.co.uk/fc-liverpool/startseite/verein/31/saison_id/2019', 'https://www.transfermarkt.co.uk/tottenham-hotspur/startseite/verein/148/saison_id/2019', 'https://www.transfermarkt.co.uk/fc-chelsea/startseite/verein/631/saison_id/2019', ..., 'https://www.transfermarkt.co.uk/sheffield-united/startseite/verein/350/saison_id/2019'] |
使用这个团队列表 -
代码(第 2 部分,共 2 部分)- 使用 team_links 列表抓取个人球员信息:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | # Create an empty DataFrame for the data, df df = pd.DataFrame() # Run the scraper through each of the links in the team_links list for i in range(len(team_links)): # Download and process the team page page = team_links[i] df_headers = ['position_number' , 'position_description' , 'name' , 'dob' , 'nationality' , 'value'] pageTree = requests.get(page, headers = headers) pageSoup = BeautifulSoup(pageTree.content, 'lxml') # Extract all data position_number = [item.text for item in pageSoup.select('.items .rn_nummer')] position_description = [item.text for item in pageSoup.select('.items td:not([class])')] name = [item.text for item in pageSoup.select('.hide-for-small .spielprofil_tooltip')] dob = [item.text for item in pageSoup.select('.zentriert:nth-of-type(4):not([id])')] nationality = ['/'.join([i['title'] for i in item.select('[title]')]) for item in pageSoup.select('.zentriert:nth-of-type(5):not([id])')] value = [item.text for item in pageSoup.select('.rechts.hauptlink')] df_temp = pd.DataFrame(list(zip(position_number, position_description, name, dob, nationality, value)), columns = df_headers) df = df.append(df_temp) # This last line of code is mine. It appends to temporary data to the master DataFrame, df # View the pandas DataFrame df |
我的问题 - 添加一个 for 循环来遍历所有联赛:
我需要做的是替换在我的代码的第一部分中分配给单个联赛代码的
1 | list_all_leagues = ['L1', 'GB1', 'IT1', 'FR1', 'ES1'] # codes for the top 5 European leagues |
我已经阅读了几个解决方案,但我很难让循环正常工作并将团队网页的完整列表附加到正确的部分。我相信我现在真的接近完成我的爬虫了,任何关于如何创建这个 for 循环的建议都将不胜感激!
提前感谢您的帮助!
实际上,我已经花时间清除了您代码中的许多错误。并缩短大路。您可以在下面实现您的目标。
I considered been under antibiotic protection (??) meant under
requests.Session() to maintain theSession during my loop, which means to preventTCP layer security fromblocking/refusing/dropping mypacket/request whileScraping .
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 | import requests from bs4 import BeautifulSoup import pandas as pd headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0' } leagues = ['L1', 'GB1', 'IT1', 'FR1', 'ES1'] def main(url): with requests.Session() as req: links = [] for lea in leagues: print(f"Fetching Links from {lea}") r = req.get(url.format(lea), headers=headers) soup = BeautifulSoup(r.content, 'html.parser') link = [f"{url[:31]}{item.next_element.get('href')}" for item in soup.findAll( "td", class_="hauptlink no-border-links hide-for-small hide-for-pad")] links.extend(link) print(f"Collected {len(links)} Links") goals = [] for num, link in enumerate(links): print(f"Extracting Page# {num +1}") r = req.get(link, headers=headers) soup = BeautifulSoup(r.content, 'html.parser') target = soup.find("table", class_="items") pn = [pn.text for pn in target.select("div.rn_nummer")] pos = [pos.text for pos in target.findAll("td", class_=False)] name = [name.text for name in target.select("td.hide")] dob = [date.find_next( "td").text for date in target.select("td.hide")] nat = [" /".join([a.get("alt") for a in nat.find_all_next("td")[1] if a.get("alt")]) for nat in target.findAll( "td", itemprop="athlete")] val = [val.get_text(strip=True) for val in target.select('td.rechts.hauptlink')] goal = zip(pn, pos, name, dob, nat, val) df = pd.DataFrame(goal, columns=[ 'position_number', 'position_description', 'name', 'dob', 'nationality', 'value']) goals.append(df) new = pd.concat(goals) new.to_csv("data.csv", index=False) main("https://www.transfermarkt.co.uk/jumplist/startseite/wettbewerb/{}/plus/?saison_id=2019") |
输出:在线查看