Web Scraping Tutorial

2018-01-05

Some data is easily accessible through API’s. In cases where no API exist, you can still get the data you need with web scraping. In this tutorial, we will scrape a website to get data on cruise arrival and departure dates.

To get started we need our imports.

import urllib2

The urllib2 module defines functions and classes which help in opening URLs

from bs4 import BeautifulSoup

Beautiful Soup is a Python library for pulling data out of HTML and XML files

import csv

Module for writing and reading css files

from datetime import datetime

The datetime module supplies classes for manipulating dates and times

import pandas as pd

Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language

1
2
3
4
5
6
7
# import libraries
import urllib2
from bs4 import BeautifulSoup
import csv
from datetime import datetime
import pandas as pd
from datetime import date

We then need to specify which page we want to scrape data from.

1
url = 'http://ports.cruisett.com/schedule/United_States_Of_America/534-Port_Canaveral_Florida/January_2018/‘

Now we can use the urllib2 module to query the website and return the HTML to a variable.

1
page = urllib2.urlopen(url)

With our HTML saved to a variable we can now use the BeautifulSoup library to find data. In this case we are looking for cruise ship arrival and departure dates. By viewing the HTML, you will notice that one date in a row has multiple cruise ships. In that case, it wouldn’t be easy for us to get a date connected to each corresponding cruise ship. But you will also notice each tag has a title that includes the ship name and the ships arrival date and departure date. With that, we will parse the HTML using BeautifulSoup and then use find_all to get all the tags.

1
2
3
# parse the html using beautiful soup and store in variable `soup`
soup = BeautifulSoup(page, 'html.parser')
rows = soup.find_all('a')

Once we have all the tags, we can use a for loop to add the title to our all_links list.

1
2
3
4
5
6
7
8
all_links = []
for html in rows:
try:
if html.find_all('a') is not None:
x = html['title']
all_links.append(x)
except KeyError:
pass

We have all our titles but not in a format that can easily get uploaded into a database for use. We will want to run through each of those titles and get the cruise ship and date. All the other text within the link aren’t needed for our purposes.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# query saved links and get cruises and dates from link title
cruise = []
dates = []
for a in all_links:
b = a.rsplit(' to',1)[0]
c = a.rsplit(' to',1)[1]
#arrival date
d = b.rsplit(' from ')[1]
e = d[:-6].strip()
#depart date
f = c[:-7].strip()
g = b.replace(' is in the port ', ' ')
h = g[:-6]
#cruise ship name
i = h.rsplit(' from ',1)[0]
cruise.append(i)
dates.append(e)
#if arrival is not equal to depart, add depart date to list
if e != f:
cruise.append(i)
dates.append(f)

The dates are still in a string format and we want to change them to a date format. We can run another for loop through our dates list and change the date accordingly.

1
2
3
4
5
6
#format date
for l in range(len(dates)):
m = dates[l].replace('-', ' ')
n = m + ' 2018'
o = datetime.strptime(n, '%d %b %Y').date()
dates[l] = o

With all our data ready to go, we can now add our data to a Pandas DataFrame and then generate a css file.

1
2
3
#generate csv
df = pd.DataFrame({'cruise': cruise, 'dates': dates})
df.to_csv('cruise_dates_t.csv')

There is much more than can be done with this data. And right now it’s only querying one page but you will probably need cruise dates for a year. To query all dates for 2018, you can get the completed code from my github.