Spidering the Web

Δεκεμβρίου 09, 2016

Spidering the Web

Spidering (or better crawling) the Web is such a common task for crawlers that a lot of ready to use good quality tools are already available in the web. For example check this page of Top 50 crawlers.

If you are interested on what these programs do in their core as me, you can find a simple spider in my git hub account.

As you can see in code with Python and only in a few lines you can have an honest spider that saves a list of urls links in a text file using all cores of your pc. Parsing a page is a kind of cpu intensive process so spawing this process to all cores we can have production performance.

As a reference check code and comments below:

 from multiprocessing import Pool  
 import bs4 as bs  
 import requests  
   
   
 urls = ['http://martinfowler.com', 'https://www.tutorialspoint.com']  
   
   
 def handle_local_links(url, link):  
   print('Url:{}, Link {}'.format(url,link))  
   if link.startswith('//'):  
     return url  
   elif link.startswith('/'):  
     return ''.join([url, link])  
   elif link.startswith('http'):  
     return link  
   elif link.startswith('#'):  
     return ''  
   else:  
     return ''.join([url, link])  
   
   
 def get_links(url):  
   try:  
     print('Start parsing URL: {}'.format(url))  
     response = requests.get(url)  
     soup = bs.BeautifulSoup(response.text, 'lxml')  
     body = soup.body  
     links = [link.get('href') for link in body.find_all('a')]  
     links = [handle_local_links(url, link) for link in links]  
     # Encode  
     links = [str(link.encode('ascii')) for link in links]  
     return links  
   except TypeError as type_error:  
     # Iterate over None  
     print (type_error)  
     print ('Got a Type Error, probably a None that we tried to iterate')  
     return []  
   except IndexError as index_error:  
     print (index_error)  
     print ('We probably did not find any useful links, returning empty list')  
     return []  
   except AttributeError as attribute_error:  
     print (attribute_error)  
     print ('Probably we got None for links, so we return empty list')  
     return []  
   except Exception as exception:  
     print (str(exception))  
     return []  
   
   
 def main():  
   how_many = 100  
   p = Pool(processes=how_many)  
   data = p.map(get_links, [link for link in urls])  
   print ('Data after multiprocessing {}'.format(data))  
   # data is now a list of lists  
   # lets flatten it to a list  
   data = [url for url_list in data for url in url_list]  
   print ('Data is {}'.format(data))  
   sorted_list = sorted(data)  
   print ('Sorted List is {}'.format(sorted_list))  
   p.close()  
   with open('urls.txt', 'w') as f:  
     f.write('\n'.join(sorted_list))  
   
   
 if __name__ == '__main__':  
   main()

Αναζήτηση αυτού του ιστολογίου

First Thoughts

Spidering the Web

Spidering the Web

Σχόλια

Δημοφιλείς αναρτήσεις

Django Forms Choice Tutorial Sample Code

SQL Social-Network Query Exercises Solutions