Consequently, I opened up the page using the Firefox DOM inspector, and noticed that each title was associated with HTML class 'title'. Surely, the data element of interest could be extracted using some higher-level language!
I elected to do some research and discovered that I could solve this problem, easily, using Python 2.7 and BeautifulSoup.
After some research (having never used BeautifulSoup before), this is the unbelievably simple script that I came up with:
from requests import get
from bs4 import BeautifulSoup
url = 'https://rainforroots.bandcamp.com/album/the-kingdom-of-heaven-is-like-this'
htmlString = get(url).text
html = BeautifulSoup(htmlString, 'html5lib')
tags = html.find_all('div', {'class':'title'})
text = [t.get_text() for t in tags]
print str(len(text)) + ' items matched:\n'
# join(j.split()) is a quick hack to remove excess whitespace
for i,j in enumerate(text): print ' '.join(j.split())
WOW! Clearly, this is a useful library.
No comments:
Post a Comment