Web scraping Amazon can be challenging due to their sophisticated anti-bot measures. This guide will show you how to build a reliable Amazon scraper using Playwright, a modern automation library and Bright Data the internet’s most trusted web data platform.
Prefer learning with video material? Checkout our in-depth course: Web Scraping for Developers that Just Works. In it we do everything below but dive deeper into other topics like selector best practices and utilizing AI when writing scrapers. Best of all it’s FREE!
async/await
First, install Playwright:
npm install playwright
Also if you're using Bright Data's scraping browser (recommended to avoid blocks), you'll need your authentication credentials.
Our scraper will consist of three main components:
async function main(){
//...
// visit the site and wait for the data to load
// pagination handler to extract the data for all result pages
await paginateResults(page, async ()=> {
// product extraction function to get the data
await getBooks(page)
// ...
})
}
To prevent getting blocked by Amazon you can connect via the Bright Data proxy.
const browser = await pw.chromium.connectOverCDP(SBR_CDP);
Or you can use a local browser (though this is more likely to get blocked):
const browser = await pw.chromium.launch({ headless: false });
Using Bright Data's solution provides several advantages:
Now that we have a browser initialized we can visit the page that we want to scrape on Amazon. We’ll scrape data about books whose topic is “live on mars”.
async function main() {
try {
const page = await browser.newPage();
const booksSearch = '"live on mars" books'
await page.goto(`https://amazon.com/s?k=${encodeURIComponent(booksSearch)}`,{ timeout: 2 * 60 * 1000 });
await page.waitForSelector('[data-component-type="s-search-result"]')
// scraping will happen here ...
} finally {
await browser.close();
}
}
if (require.main === module) {
main().catch(err => {
console.error(err.stack || err);
process.exit(1);
});
}
Notice how in the code block above, we do as little work as possible to get to the data we want to visit. Instead using the scraper to visit the homepage and then interact with the search input to get to books about living on mars, we go directly to the search results page relying on the k
query variable that’s highly unlikely to change.
Next, we’ll define a getBooks
function and use Playwright to target the HTML elements with the data that we want to extract. Selectors are determined simply by inspecting the page with the browser dev tools.
async function getBooks(page){
const books = await page.$$('[data-component-type="s-search-result"]')
const results = [];
for (let i = 0; i < (books.length); i++) {
const titleElement = await books[i].$('h2 a span');
const title = titleElement ? await titleElement.innerText() : '';
const priceWholeElement = await books[i].$('span.a-price-whole');
const priceWhole = !priceWholeElement ? '' : (await priceWholeElement.innerText()).replace('.', '');
const priceFractionElement = await books[i].$('span.a-price-fraction')
const priceFraction = !priceFractionElement ? '' : await (priceFractionElement).innerText();
const book = {
title,
price: Number(`${priceWhole.trim()}.${priceFraction.trim()}`) || null
}
results.push(book);
}
return results;
}
This function demonstrates several important scraping techniques:
[data-component-type="s-search-result"]
const priceWhole = !priceWholeElement ? '' : (await priceWholeElement.innerText());
Number(`${priceWhole.trim()}.${priceFraction.trim()}`) || null
Using the getBooks
function to scrape the data for the first page then is straightforward.
// all the same above ...
await page.waitForSelector('[data-component-type="s-search-result"]')
await getBooks(page)
Finally, in order, to get data for all the results, we must handle pagination. This is done with a paginateResults
function that looks like this:
async function paginateResults(page, processPage) {
let currentPage = 1;
let hasNextPage = true;
while (hasNextPage) {
console.log(`\nScraping page ${currentPage}...`);
// Wait for results to load
await page.waitForSelector(`[aria-label="Current page, page ${currentPage}"]`);
// Execute the callback function for this page
await processPage();
// Check for next page button
const nextButton = await page.$('a.s-pagination-next');
if (!nextButton) {
console.log('\nReached the last page.');
hasNextPage = false;
} else {
await nextButton.click();
currentPage++;
}
}
return currentPage;
}
There are several key takeaways from this function.
processPage
callback function makes the paginateResults
function more flexible re-usable.To put it to use we call alter the main function slightly:
// all the same above ...
await page.waitForSelector('[data-component-type="s-search-result"]')
// provide an array to store the data from ALL the pages in
let resultsForAllPages = [];
await paginateResults(page, async ()=> {
// move the call to getBooks inside the callback function
// so it will work per page
const resultsPerPage = await getBooks(page)
// push all the parsed data into the resultsForAllPages array
resultsForAllPages = [...resultsForAllPages, ...resultsPerPage]
})
This implementation provides a robust foundation for scraping Amazon product data. By following these practices and using tools like Playwright and Bright Data's scraping browser, you can build reliable and scalable scraping solutions while minimizing the risk of blocks and errors.
If you’d like a more step by step walkthrough of this project, plus some more in depth guidance and best practices about how to scrape the web, you can dive deeper with our course Web Scraping for Developers That Just Works.
Finally, here is the full scraper code in one go:
const pw = require('playwright');
const AUTH = "YOUR_BRIGHT_DATA_AUTH_STRING";
const SBR_CDP = `wss://${AUTH}@brd.superproxy.io:9222`;
async function main() {
console.log('Connecting to Scraping Browser...');
const browser = await pw.chromium.connectOverCDP(SBR_CDP);
// const browser = await pw.chromium.launch({ headless: false })
try {
console.log('Connected! Navigating...');
const page = await browser.newPage();
const booksSearch = '"live on mars" books'
await page.goto(`https://amazon.com/s?k=${encodeURIComponent(booksSearch)}`,{ timeout: 2 * 60 * 1000 });
await page.waitForSelector('[data-component-type="s-search-result"]')
let resultsForAllPages = [];
await paginateResults(page, async ()=> {
const resultsPerPage = await getBooks(page)
resultsForAllPages = [...resultsForAllPages, ...resultsPerPage]
})
console.log(resultsForAllPages)
await page.screenshot({ path: './page.png', fullPage: true });
} finally {
await browser.close();
}
}
if (require.main === module) {
main().catch(err => {
console.error(err.stack || err);
process.exit(1);
});
}
async function getBooks(page){
const books = await page.$$('[data-component-type="s-search-result"]')
const results = [];
for (let i = 0; i < (books.length); i++) {
const titleElement = await books[i].$('h2 a span');
const title = titleElement ? await titleElement.innerText() : '';
const priceWholeElement = await books[i].$('span.a-price-whole');
const priceWhole = !priceWholeElement ? '' : (await priceWholeElement.innerText()).replace('.', '');
const priceFractionElement = await books[i].$('span.a-price-fraction')
const priceFraction = !priceFractionElement ? '' : await (priceFractionElement).innerText();
const book = {
title,
price: Number(`${priceWhole.trim()}.${priceFraction.trim()}`) || null
}
results.push(book);
}
return results;
}
async function paginateResults(page, processPage) {
let currentPage = 1;
let hasNextPage = true;
while (hasNextPage) {
console.log(`\nScraping page ${currentPage}...`);
// Wait for results to load
await page.waitForSelector(`[aria-label="Current page, page ${currentPage}"]`);
// Execute the callback function for this page
await processPage();
// Check for next page button
const nextButton = await page.$('a.s-pagination-next');
if (!nextButton) {
console.log('\nReached the last page.');
hasNextPage = false;
} else {
await nextButton.click();
currentPage++;
}
}
return currentPage;
}
Our goal is to be the number one source of Vue.js knowledge for all skill levels. We offer the knowledge of our industry leaders through awesome video courses for a ridiculously low price.
More than 200.000 users have already joined us. You are welcome too!
© All rights reserved. Made with ❤️ by BitterBrains, Inc.