We conducted our first scraping exercises this week. After reviewing some command line and Python basics, we installed pip, a Python package manager, and virtualenv to create isolated Python environments—useful if projects require different libraries and/or versions of Python.
For our assignment to generate a looong list from scraped web text, I thought about search engines as both oracles and confessionals. Initially, I hoped to scrape the headlines from the returns of search queries, specifically the results to “the answer is.” I mean, aren’t ALL the answers online? Seriously, where do you turn if you have a question? Your phone, a person, the card catalog? More importantly, what might you ask if you thought you were anonymous or perhaps didn't realize that your query was being logged for future publication by an algorithm or someone like me?
Scraping from Google’s search page proved a different animal from the comparatively straight forward examples in class with Craigslist and my experiments with the NYTimes and Reddit. With my many repeated attempts to solve the puzzle, it didn’t take long for Google to block my IP. Sam suggested I try Bing or DuckDuckGo, and in the process of exploring those options, we couldn’t help but notice the search engines’ autocomplete search suggestions for my query. Though I could not locate the specifics of DuckDuckGo's auto-suggest algorithm, Google's autocomplete predictions are "based on several factors, like how often others have searched for a term" and trending, popular topics.
With his help using the browser’s Developer Tools, we figured out that on DuckDuckGo this information was formatted as JSON. Fortunately, there’s a JSON parser built directly into Python (so no need to use the beautifulsoup library required for parsing HTML), and together we walked through writing the initial lines of code for scraping with this condition.
My program passes any phrase into the URL request parameters along with each letter from the alphabet, so “the answer is a…” followed by the “the answer is b…”. Each individual pass generates a list of auto-suggestions. After it exhausts all 26 letters, it then passes the phrase plus double letters to increase the number of possible returns. So again, “the answer is aa…” followed by “the answer is ab…” Once I figured out the code and created a working template, I could request results with any phrase I wished. SO MUCH FUN! Thank you, Sam!!
Here are my steps for this process:
1. Create a project directory
2. Within that directory, create a virtual environment and activate it
3. Install the requests library
4. Run my Python program and save the results to a new file:
python autocompleteme.py > rawtext.txt
5. Sort the results, remove duplicate lines, and save to a new file:
sort rawtext.txt | uniq > sorted_noduplicates.txt
The code for autocompleteme.py is on Github, along with my favorite and the most poignant auto-suggest lists so far: