Web Scraping Tutorial
Some data is easily accessible through API’s. In cases where no API exist, you can still get the data you need with web scraping. In this tutorial, we will scrape a website to get data on cruise arrival and departure dates.
To get started we need our imports.
The urllib2 module defines functions and classes which help in opening URLs
Beautiful Soup is a Python library for pulling data out of HTML and XML files
Module for writing and reading css files
The datetime module supplies classes for manipulating dates and times
Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language
We then need to specify which page we want to scrape data from.
Now we can use the urllib2 module to query the website and return the HTML to a variable.
With our HTML saved to a variable we can now use the BeautifulSoup library to find data. In this case we are looking for cruise ship arrival and departure dates. By viewing the HTML, you will notice that one date in a row has multiple cruise ships. In that case, it wouldn’t be easy for us to get a date connected to each corresponding cruise ship. But you will also notice each tag has a title that includes the ship name and the ships arrival date and departure date. With that, we will parse the HTML using BeautifulSoup and then use find_all to get all the tags.
Once we have all the tags, we can use a for loop to add the title to our all_links list.
We have all our titles but not in a format that can easily get uploaded into a database for use. We will want to run through each of those titles and get the cruise ship and date. All the other text within the link aren’t needed for our purposes.
The dates are still in a string format and we want to change them to a date format. We can run another for loop through our dates list and change the date accordingly.
With all our data ready to go, we can now add our data to a Pandas DataFrame and then generate a css file.
There is much more than can be done with this data. And right now it’s only querying one page but you will probably need cruise dates for a year. To query all dates for 2018, you can get the completed code from my github.