What is Web Scraping and What is it Used For?
Web scraping, web data extraction, or harvesting is a technique in which a computer program collects data from websites using a bot or a web crawler (e.g. a spider or internet bot). The crawler systematically scans the world wide web and copies specific pages/data for processing and indexing. The copied data is typically moved into a local database or spreadsheet.
Why Scrape the Web?
There is really only one reason why investment analysts and data vendors use web scraping and that’s to speed up the data collection process. With the amount of data typically collected, the manual process of copy/paste, clicking, scrolling, etc. is a huge waste of time. Data scientists who know how to code will conduct this scrape in-house at the fund if there is enough capacity. For large volume scrapes, the fund will either purchase an on-the-shelf offering or request a bespoke scrape from one of the vendors in this space like Thinknum, Quandl, Yipit, etc.
What Type of Data is an Investment Firm Looking to Collect?
Some popular use cases include:
- Price Monitoring – investors want to know what products are selling for what price, at which discount/promotion, whether something is sold out or not, how many people are looking at an item/have it in their cart. All of this feeds into pricing and revenue optimization, competitor monitoring, product trend monitoring, etc.
- Company Filings – any data about a covered company is crucial when building an investment model. Data and updates scraped are typically from SEC filings, company fundamentals, public sentiment, news/media/social platforms monitoring.
- Job Listings – increased hiring, headcount all indicate whether a company is experiencing growth.
- Company Ratings – any platform that allows employees to rate their company, like Glassdoor or LinkedIn provide insights into increased/decreased ratings, commentary about a CEO or management, or benefits and other growth indicators.
- Online Retail Data – product rankings on online retailers, discounts/promotions items provide insight into sales and whether they are strong or weak.
- Social Media Sentiment – access to a Twitter or Facebook stream provides data on number of followers, campaign monitoring, comments/reactions, sentiment analysis, number of shares/reposts, trends, historical data, and very often, news.
Web Scraped data can provide important new information about a covered company’s business and future. The data typically generates alpha when combined with other data sets and some qualitative analysis.
Compliance Checklist for Web Scraping
Whether a fund is conducting web scraping internally or utilizing a third party vendor to obtain the data set, some important things to review for from a legal and compliance perspective include:
- What do the robots.txt say – are there any specific instructions in the robots.txt page that would prevent a crawler from collecting any particular type of data? (for example, which items are in carts)
- Is the website completely open source – or is the data collected from behind a login or a password protected site? Usually data that requires a login to be accessed is meant to be private and not totally publicly available.
- Is the crawler programmed to evade CAPTCHA? What is legal/ethical reasoning behind this if it is happening. The entire purpose of a CAPTCHA is to verify that a bot is not being used. Some CAPTCHA carry the weight of law and carry a similar legal implication as signing a legally binding contract/document.
- Are you monitoring for site load/site burden? Excessive burden on a host site may result in a site block or cease and desist letter.
- What is the data that is being collected? Does that data collected via web scraping contain any PII? First name, last name, user name, location, address, health information, etc. Any data that can be tied back to an individual should be avoided completely.