The vast majority of modern large websites today will have multiple backend APIs which connect to their frontend apps such as websites or mobile applications.

When collecting data from these sites, often you'll have a look at how the site is loading its data and find that the data you need is being loaded by javascript. So, you try and setup a headless browser to scrape the data by loading javascript and then parsing the HTML after it has loaded.

Generally speaking, if your target site has a backend API which contains all the data you require, you can interface with it directly. If possible, you'll save a ton of compute power - meaning your entire collection process will be cheaper: less bandwidth required, less intrusive (hence less likely to get blocked), and a smaller server will be required.

Running a headless browser is a fairly compute intensive task, especially if you are running multiple threads and wish to obtain data fairly quickly.

Hence, finding a hidden API is valuable and worth spending a bit of time trying to figure out.

Where is the Hidden API?

If all you do is load the webpage, then download it via curl or similar, realize the data isnt there - so you start setting up a resource intensive headless browser to load javascript - you should probably dive deeper into the application and try to reverse engineer it a little. Of course some web scraping projects will require that you load javascript, but if you can avoid it you can collect data far more efficiently and potentially reduce bandwidth and other costs by a factor of 100 or more. Also - often public APIs are designed to be queried fairly quickly and it might actually be easier to avoid getting blocked.

This can be done fairly simply by using the 'network' tab inside any web browsers developer toolkit:

Firefox Network Tab

Generally all you'll need to do now is reload the page (perhaps clearing cache as well) and ensure the network tab is open. Now watch all the various requests. Often filtering by 'XHR' will find an API endpoint, but be sure to check them all.

Be sure to check the 'response' section and eventually you might find something like this:

Network tab response (json) example

This could be JSON, XML or similar. Of course your web scraping project may require that you query different API endpoints to obtain the data that you require. Often you'll simply need to navigate the site like a normal user, keeping the network tab open, and step through each network request manually.

Once you've found the target endpoint, try right clicking and open in a new tab. If this fails to load, you will need to setup your scraper with the correct HTTP headers. This can be done fairly easily by right clicking, selecting 'Copy as Curl' and extracting the required headers. For example, many APIs will require that you send the correct 'Accept' header.

Thats it! Now you can write a scraper without loading bulky unneccesary javascript and other assets.