- Dave approaches us through the contact form on our website. He advises that he would like to process a competitors website's eCommerce store - recording the following information for each product. There are roughly 20,000 products to be crawled. The data needs to be updated each friday for competitive research. This allows him to update his pricing each week and gives his business an edge.
- We clarify a few of Dave's queries.
- We send Dave a quote, which he accepts.
- Dave signs up to our billing management system. This system is used for billing, support and secure data access. He pays the deposit.
- We start work on Dave's project, and finish within three days.
- Dave receives an email with a secure link - allowing him to download the data in CSV format.
- We setup his custom software solution to run on on our servers each week - delivering him a new email link every Friday at approximately the same time.
- After a few months Dave's business becomes more profitable as he has managed to use the data to not only beat his competitors pricing, but raise his prices when his competitors raise theirs. Dave reaches out again and asks if possible to scrape data from 4 more competitors sites.
- As the data structures for these new sites are the same, our setup fee for each is less as we can integrate these sites into our existing solution.
- In less than a week we setup the scrapers for these new sites.
- Now every Friday he recieves a list of products for 5 of his competitors businesses. He uses this data to make decisions that affect his business positively.
Most of our scrapers are written using golang, which is our preferred language for web scraping. It's a compiled language which provides huge performance benefits over traditional web scraping tools. We utilize existing packages as well as our own to create an efficient scraper, which fits your requirements.
Once we've unit tested the golang code, we deploy cloud server instances to run the scraping work. The amount of server instances we deploy depends on the amount of work required (quantity of data to be collected, complexity of websites to be crawled, etc). Often we rotate through proxies, and use threading (go routines) to control the rate at which we process data.
This data is then stored in a database, usually postgres or mysql.
Once complete, depending on the requirements - the database is analyzed and processed into the format requirements of the customer. Often this will be a simple CSV file, which may be shared via Dropbox, FTP, AWS s3, Google cloud, or directly uploaded to a clients server via rsync, scp or similar.
Web scraping is the process of scraping websites for information, and turning that information into data. This data may be loaded into a database, spreadsheet, API or another format for further use. Uses may include: competitive analysis, lead generation, price monitoring, research, brand monitoring and a huge range of other business use cases.
Web scraping involves programming a bot to crawl a website and process its data. This data may come from various endpoints, utilizing various techniques including API's, GET and POST requests, parsed HTML, autocomplete forms and more.
This type of information gathering allows a business to gain competitive intelligence - as it can be processed using programs, applications, spreadsheets and more. From a research, intelligence, and business perspective you gain a huge advantage over browsing the internet as a human.
Scraping websites is simply a form of data gathering, using a bot or program, to save information in a structured manner for later analysis.
Depending on the project a range of website scraping techniques may be utilized:
This is the most commonly thought of web scraping technique - simply downloading a html website and parsing it with code. For example, a program may save all `<h1>` tags to a database.
Direct HTTP interfacing
Many publicly available API endpoints (usually http methods GET and POST) already have well structured data which may provide a bot with an advantage. For example, many client side applications interface with a JSON api which is publicly available. This data can often be read directly into a database.
The four main costs associated with web scraping are: software development, volume, quantity of sites and frequency.
Software Development Cost
Web Scraping Software can cost anywhere between $250 for a tiny project to tens of thousands of dollars. The average project requires a small level of complexity such as bypassing rate limiting by routing through proxies, dealing with error messages and dealing with badly structured code. The average project may cost around $450 - $1600 for initial software development.
The next three cost components I will group under 'compute' as they all influence the amount of compute power required to complete the project. Compute power is the amount and size of servers required to collect and process the data.
How much data is required? Are we crawling hundreds, thousands, hundreds of thousands or millions of endpoints?
The answer to this question will greatly influence the compute cost. The more data points, and the more data required, the higher the cost.
Let's give an example. Say you wish to crawl pricing for an eCommerce provider weekly. The eCommerce provider may have rate limiting setup, which only allows us 1 second per request. Now, let's say you only want 1000 data points saved each week. This would be easy enough - respecting their rate limiting requirements, your project would take 1000 seconds (roughly 17 minutes) to complete each week.
Using the same example, let's say you require 1 million data points saved each week. Your bot will now take 1 million seconds to complete its crawl. This is roughly 277 hours - 11.5 days: more than a week!
As you can see the volume of data to be crawled will greatly influence compute cost. In the latter case above, there may be ways to decrease crawl time by switching proxies, user agent randomization and more. However, if your project requires a large volume, perhaps consider less frequent updates, or consider a smaller data set. Perhaps the research you plan to undertake can be completed with 20,000 rows - not 1,000,000.
- Quantity of sites
The quantity of sites to be crawled, including subdomains and data points with different representations, will greatly impact compute cost and software cost. If we need to crawl 10 websites each with different HTML, POST and GET endpoints - we basically need to rewrite the bot for each, and the compute power becomes a multiple of volume and quantity.
Also, if your data set requires multiple data types - for example 'products' and 'documentation', the bot will need to include different parsing functionality for each data type.
Frequency is the time interval between updates of your data. If your bot needs to crawl a website daily, the compute cost will of course be far higher than monthly updates. If your bot is simply a one off research project, this cost will be a one off.
To summarize, the cost of your website crawling project depends on a number of factors including website complexity, quantity of websites and data points, frequency of updates and volume of data to be collected.