Page 2 | Best Open Source Web Scrapers 2026

Web Scrapers

View 183 business solutions

Web Scrapers Clear Filters

MongoDB Atlas runs apps anywhere
Deploy in 115+ regions with the modern database for every enterprise.

MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.

Start Free
ContractSafe: Contract Management Software
Take Control Of Your Contracts Without Wrecking The Budget

Ditch those spreadsheets, shared drives & crazy-expensive solutions with too many bells & whistles. ContractSafe offers the simplest way to manage your contracts efficiently without breaking the bank.

Learn More
1

ECommerceCrawlers

Collection of Python ecommerce and website crawler examples projects

ECommerceCrawlers is a collection of practical Python web crawler projects designed to gather data from a variety of ecommerce platforms, websites, and online services. It aggregates many independent crawler examples created by contributors and organized into separate subprojects that target specific sites or data sources. These examples demonstrate how to build and operate web scrapers capable of collecting structured information such as product listings, news content, job postings, social media data, and other publicly available web data. It aims to help developers understand the full workflow of web scraping, including request simulation, data extraction, storage, and handling anti-scraping techniques. It includes crawlers for platforms such as ecommerce marketplaces, blogging platforms, recruitment sites, and social networks, providing real-world practice scenarios. Developers can study the individual project documentation to understand the analysis process.

Downloads: 8 This Week

Last Update: 2 hours ago
See Project
2

UI.Vision RPA

Open-Source RPA Software (formerly Kantu)

The UI Vision RPA software is the tool for visual process automation, codeless UI test automation, web scraping and screen scraping. Automate tasks on Windows, Mac and Linux. The UI Vision RPA core is open-source with enterprise security. The free and open-source browser extension can be extended with local apps for desktop UI automation. UI.Vision RPA's computer-vision visual UI testing commands allow you to write automated visual tests with UI.Vision RPA - this makes UI.Vision RPA the first and only Chrome and Firefox extension (and Selenium IDE) that has "👁👁 eyes". A huge benefit of doing visual tests is that you are not just checking one element or two elements at a time, you’re checking a whole section or page in one visual assertion. The visual UI testing and browser automation commands of UI.Vision RPA help web designers and developers to verify and validate the layout of websites and canvas elements.

Downloads: 8 This Week

Last Update: 2025-03-22
See Project
3

crwlr

Library for Rapid (Web) Crawler and Scraper Development

This library provides kind of a framework and a lot of ready-to-use, so-called steps, that you can use as building blocks, to build your own crawlers and scrapers with. Before diving into the library, let's have a look at the terms crawling and scraping. For most real-world use cases, those two things go hand in hand, which is why this library helps with and combines both. A (web) crawler is a program that (down)loads documents and follows the links in it to load them as well. A crawler could just load actually all links it is finding (and is allowed to load according to the robots.txt file), then it would just load the whole internet (if the URL(s) it starts with are no dead end). Or it can be restricted to load only links matching certain criteria (on same domain/host, URL path starts with "/foo",...) or only to a certain depth. A depth of 3 means 3 levels deep. Links found on the initial URLs provided to the crawler are level 1 and so on.

Downloads: 8 This Week

Last Update: 2026-01-05
See Project
4

dxy-covid-19-crawler

Realtime crawler for COVID-19 outbreak statistics from DXY data

DXY-COVID-19-Crawler is a Python-based project designed to collect real-time COVID-19 infection data from the public dataset provided by Ding Xiang Yuan (DXY). The crawler periodically retrieves pandemic statistics and stores them in a database so that historical changes in the outbreak can be preserved and analyzed later. It was created to make up-to-date infection data more accessible for developers, researchers, and analysts who wanted to build visualizations or conduct data analysis during the early stages of the pandemic. DXY-COVID-19-Crawler automatically crawls data at regular intervals, typically every minute, ensuring that newly published statistics are captured as quickly as possible. Retrieved data is stored in MongoDB and archived so that the entire progression of the outbreak can be traced over time. It also provided an API that allowed developers to easily access the collected data for building dashboards, visualizations, and other analytical tools.

Downloads: 8 This Week

Last Update: 2 days ago
See Project
Budgyt Is The Highest Rated Business Budgeting Software In The Market.
Affordable budgeting software for companies with multiple users and multiple departments.

Budgyt is an easy to use, intuitive platform with a clean simple interface that makes budgeting multiple P&L’s easy to do without needing Excel.

Book a Demo
5

fess

Open source enterprise search server for websites, files, and data

Fess is an open source enterprise search server designed to provide powerful full-text search capabilities across multiple data sources. It enables organizations to quickly deploy a scalable search environment without requiring deep knowledge of underlying search technologies. Fess is built on top of OpenSearch and offers an integrated solution for crawling, indexing, and searching documents from websites, file systems, and various data stores. Fess includes a built-in crawler that can collect content from sources such as databases, CSV files, and shared storage, making it suitable for centralized knowledge discovery. It supports indexing and searching across many document formats including office documents, PDFs, and compressed archives. It also provides a web-based administrative interface that allows administrators to configure crawling targets, manage indexing tasks, and adjust search settings from a graphical dashboard.

Downloads: 7 This Week

Last Update: 10 hours ago
See Project
6

sqliv

Massive SQL injection vulnerability scanner for automated web testing

SQLiv is a command-line security tool designed to identify SQL injection vulnerabilities in web applications through automated scanning techniques. Written primarily in Python, the project focuses on discovering potentially vulnerable web pages by analyzing URLs that contain database query parameters. It can perform large-scale scanning by using search engine queries known as SQL injection dorks to collect candidate websites and then test them for vulnerabilities. In addition to bulk scanning, SQLiv supports targeted analysis of specific domains or individual URLs, allowing security researchers to focus on particular web applications. When a domain is supplied, the scanner can crawl the site to gather URLs with parameters and evaluate them for potential SQL injection weaknesses. SQLiv also supports reverse domain scanning to locate other websites hosted on the same server, which can then be examined for similar vulnerabilities.

Downloads: 7 This Week

Last Update: 19 hours ago
See Project
7

Crawlab

Distributed web crawler admin platform for spiders management

Golang-based distributed web crawler management platform, supporting various languages including Python, NodeJS, Go, Java, PHP and various web crawler frameworks including Scrapy, Puppeteer, Selenium. Please use docker-compose to one-click to start up. By doing so, you don't even have to configure MongoDB database. The frontend app interacts with the master node, which communicates with other components such as MongoDB, SeaweedFS and worker nodes. Master node and worker nodes communicate with each other via gRPC (a RPC framework). Tasks are scheduled by the task scheduler module in the master node, and received by the task handler module in worker nodes, which executes these tasks in task runners. Task runners are actually processes running spider or crawler programs, and can also send data through gRPC (integrated in SDK) to other data sources, e.g. MongoDB.

Downloads: 6 This Week

Last Update: 2023-07-26
See Project
8

GoSpider

Gospider - Fast web spider written in Go

GoSpider - Fast web spider written in Go. Fast web crawling. Brute force and parse sitemap.xml. Parse robots.txt. Generate and verify link from JavaScript files. Link Finder. Find AWS-S3 from response source. Find subdomains from the response source. Get URLs from Wayback Machine, Common Crawl, Virus Total, Alien Vault. Format output easy to Grep. Support Burp input. Crawl multiple sites in parallel.

Downloads: 6 This Week

Last Update: 2023-01-27
See Project
9

Pydoll

Async Python library in automating Chromium browsers without WebDriver

Pydoll is a Python library designed for automating Chromium-based web browsers such as Chrome and Edge without relying on a traditional WebDriver layer. Instead of using external drivers, it connects directly to the Chrome DevTools Protocol through WebSocket, allowing scripts to control browser behavior more efficiently and with fewer compatibility issues. It provides a high-level API that simplifies common browser automation tasks while still offering access to low-level protocol features for advanced control. Its architecture is built around asynchronous programming using Python’s asyncio framework, enabling concurrent automation of multiple tabs and browser contexts. Pydoll also includes tools for monitoring and intercepting network traffic, allowing developers to analyze or modify requests and responses during automation workflows. It emphasizes realistic interactions and fingerprint management to reduce the likelihood of automated actions.

Downloads: 6 This Week

Last Update: 2026-04-11
See Project
MaintainX is the world-leading mobile-first workflow management platform for industrial and frontline workers.
Trusted by Operational Leaders Across the Globe

Your day-to-day maintenance tasks, simplified. MaintainX eliminates the paperwork, so you can spend less time on your clipboard and more time getting things done.

Learn More
10

Spider

High-performance Rust web crawler and scraper for large-scale data

Spider is a high-performance web crawler and web scraping library written in Rust that enables developers to crawl and index websites efficiently. It focuses on speed, concurrency, and reliability by using asynchronous and multi-threaded processing to handle large volumes of web pages. It can rapidly crawl websites to collect links, retrieve page content, and extract structured information from HTML documents. Spider can operate concurrently across many pages, allowing it to gather large datasets in a short period of time. Spider also provides mechanisms for subscribing to crawl events so developers can process page data such as URLs, status codes, or HTML content as it is discovered. It supports advanced capabilities such as headless browser rendering, background crawling tasks, and configurable rules that control crawl depth or ignored paths. These capabilities make the project suitable for building search indexers, data extraction pipelines, & SEO analysis tools.

Downloads: 6 This Week

Last Update: 2026-03-31
See Project
11

dirhunt

Web crawler that finds hidden web directories without brute force

Dirhunt is an open source security tool designed to discover web directories and analyze website structures without relying on brute-force techniques. Instead of sending large numbers of guess-based requests, it operates as a specialized crawler that intelligently explores websites to identify accessible or hidden directories. Dirhunt can detect directories that expose “Index Of” listings, which may reveal files and other resources that were not intended to be publicly visible. It can also identify situations where directories are intentionally hidden through empty index files or servers that return misleading responses such as fake 404 errors. Dirhunt processes HTML pages and other available sources to discover additional paths and directories while minimizing the number of requests sent to the server, making scans faster and less intrusive. It supports scanning multiple targets at the same time and allows results to be filtered, analyzed, and exported for further review.

Downloads: 6 This Week

Last Update: 2026-03-11
See Project
12

douyin

Open source Douyin crawler for collecting and downloading public data

DouyinCrawler is an open source data collection tool designed to gather publicly available information from the Douyin platform. It demonstrates how to build a Python-based web crawler combined with a graphical interface and command line functionality. It allows users to collect data from various types of Douyin content, including user profiles, videos, hashtags, and music pages. DouyinCrawler supports both automated scraping and batch operations to process multiple targets efficiently. It also integrates with the Aria2 download utility to enable large-scale downloading of videos and images associated with collected content. It includes multiple usage modes such as a desktop GUI, a web service interface, and a command line tool for flexible deployment. In addition to data collection, it supports incremental updates so users can track and gather newly published content without reprocessing previously collected data.

Downloads: 6 This Week

Last Update: 2026-03-13
See Project
13

pandora-box

Lightweight cross-platform desktop client for managing Mihomo proxies

Pandora-Box is a lightweight desktop client designed to provide a graphical interface for the Mihomo proxy core. It allows users to manage proxy configurations and subscriptions through a simple and user-friendly interface rather than working directly with configuration files. Pandora-Box supports multiple proxy protocols and provides tools to organize and control network routing rules. It is designed to work for both casual users who want an easy setup and advanced users who need more control over proxy behavior. It also supports automatic rule grouping and features such as TUN mode to enable system-wide proxy routing. Pandora-Box focuses on delivering a clean interface with practical features for importing, managing, and converting proxy subscriptions. Pandora-Box combines a desktop interface with backend components to create a functional proxy management environment that simplifies complex networking configurations.

Downloads: 6 This Week

Last Update: 2026-03-11
See Project
14

ruia

Async Python framework for fast and flexible web scraping spiders

Ruia is an asynchronous web scraping micro-framework built for Python that focuses on simplicity, speed, and flexibility when creating web crawlers. Ruia is powered by Python’s asyncio library along with aiohttp, enabling developers to perform concurrent network requests efficiently and scrape data from websites with minimal overhead. Ruia follows a “write less, run faster” philosophy, emphasizing concise code and streamlined spider development. It provides a structured approach to building scraping projects through components such as data items, spiders, middleware, and plugins. Developers can define structured fields to extract information from HTML content and process responses asynchronously to improve crawling performance. It also supports middleware and plugin systems that allow customization of request handling, response processing, and additional functionality.

Downloads: 6 This Week

Last Update: 2026-03-11
See Project
15

Roach

The complete web scraping toolkit for PHP

Roach is a complete web scraping toolkit for PHP. It is a shameless clone heavily inspired by the popular Scrapy package for Python. Roach allows us to define spiders that crawl and scrape web documents. But wait, there’s more. Roach isn’t just a simple crawler, but includes an entire pipeline to clean, persist and otherwise process extracted data as well. It’s your all-in-one resource for web scraping in PHP. Roach doesn’t depend on a specific framework. Instead, you can use the core package on its own or install one of the framework-specific adapters. Currently, there’s a first-party adapter available to use Roach in your Laravel projects with more coming. Roach is built from the ground up with extensibility in mind. In fact, most of Roach’s built-in behavior works the exact same way that any custom extensions or middleware works.

Downloads: 5 This Week

Last Update: 2025-03-21
See Project
16

bt-btt

Guide and resources for accessing and using the U3C3 BitTorrent site

BT-btt is a repository that provides information and guidance related to the U3C3 BitTorrent resource site and its magnet-based content search ecosystem. It primarily serves as documentation describing how the site works, how users can access it, and how magnet link resources are organized and retrieved. It explains how BitTorrent and magnet link downloads operate, including the role of trackers and distributed hash table (DHT) networks in locating peers and downloading files. BT-btt also discusses different ways users can search for torrent resources, including strategies for improving search results when dealing with multiple language variants or different character encodings. Additional documentation explains why certain statistics such as active download counts may not always be accurate when DHT-based downloads are used.

Downloads: 5 This Week

Last Update: 2026-03-11
See Project
17

crawlee

A web scraping and browser automation library for Node.js

Crawlee is a web scraping and browser automation library. It helps you build reliable crawlers. Fast. Crawlee won't fix broken selectors for you (yet), but it helps you build and maintain your crawlers faster. When a website adds JavaScript rendering, you don't have to rewrite everything, only switch to one of the browser crawlers. When you later find a great API to speed up your crawls, flip the switch back. It keeps your proxies healthy by rotating them smartly with good fingerprints that make your crawlers look human-like. It's not unblockable, but it will save you money in the long run. Crawlee is built by people who scrape for a living and use it every day to scrape millions of pages. Meet our community on Discord. We believe websites are best scraped in the language they're written in. Crawlee runs on Node.js and it's built in TypeScript to improve code completion in your IDE, even if you don't use TypeScript yourself.

Downloads: 5 This Week

Last Update: 2026-02-06
See Project
18

DotnetSpider

Lightweight .NET framework for fast web crawling and data scraping

DotnetSpider is a web crawling and data extraction framework built on the .NET Standard platform. It is designed to help developers create efficient and scalable crawlers for collecting structured data from websites. It provides a high-level API that simplifies the process of defining spiders, managing requests, and extracting content from web pages. Developers can create custom spiders by extending base classes and configuring pipelines that handle downloading, parsing, and storing collected data. DotnetSpider is modular, allowing different components such as request schedulers, downloaders, and storage systems to work together in a flexible workflow. DotnetSpider also supports distributed crawling environments, making it possible to scale data collection across multiple agents and machines. With support for various storage backends and extensible parsing mechanisms, it is suitable for building complex scraping systems or automated data gathering pipelines.

Downloads: 4 This Week

Last Update: 2026-03-10
See Project
19

Selectolax

Python binding to Modest and Lexbor engines

A fast HTML5 parser with CSS selectors using Modest and Lexbor engines. Selectolax supports two backends: Modest and Lexbor. By default, all examples use the Modest backend. Most of the features between backends are almost identical, but there are still some differences. Currently, the Lexbor backend is in beta and missing some of the features. To use lexbor, just import the parser and use it in the similar way to the HTMLParser.

Downloads: 4 This Week

Last Update: 2026-03-06
See Project
20

WeChatSogou

Python library to crawl and retrieve data from WeChat accounts

WechatSogou is an open source Python library designed to retrieve data from WeChat official accounts by using the Sogou WeChat search service as its data source. It provides developers with a programmatic way to search for public accounts and collect article information without manually browsing the search interface. It functions as a crawler interface that sends requests to the search engine, retrieves results, and converts the returned pages into structured data that can be used in applications or analysis pipelines. Internally, the library separates its functionality into several layers including an API interface, request handling, and response parsing components to organize the crawling workflow. These components work together to process HTTP requests, handle verification mechanisms, and transform HTML or JSON responses into usable objects. Developers can integrate the library into scripts or larger data collection systems to automate gathering content from public accounts.

Downloads: 4 This Week

Last Update: 2026-03-10
See Project
21

Web Scraping for Laravel

Laravel adapter for Roach, the complete web scraping toolkit for PHP

This is the Laravel adapter for Roach, the complete web scraping toolkit for PHP. Easily integrate Roach into any Laravel application. The Laravel adapter mostly provides the necessary container bindings for the various services Roach uses, as well as making certain configuration options available via a config file. The Laravel adapter of Roach registers a few Artisan commands to make out development experience as pleasant as possible. Roach ships with an interactive shell (often called Read-Evaluate-Print-Loop, or Repl for short) which makes prototyping our spiders a breeze. We can use the provided roach:shell command to launch a new Repl session.

Downloads: 4 This Week

Last Update: 2025-03-21
See Project
22

WebMagic

A scalable web crawler framework for Java

WebMagic is a scalable crawler framework. It covers the whole lifecycle of crawler, downloading, url management, content extraction and persistent. It can simplify the development of a specific crawler. WebMagic is a simple but scalable crawler framework. You can develop a crawler easily based on it. WebMagic has a simple core with high flexibility, a simple API for html extracting. It also provides annotation with POJO to customize a crawler, and no configuration is needed. Some other features include the fact that it is multi-thread and has distribution support. WebMagic is very easy to integrate. Add dependencies to your pom.xml. WebMagic use slf4j with slf4j-log4j12 implementation. If you customized your slf4j implementation, please exclude slf4j-log4j12. You can write a class implementation of PageProcessor.

Downloads: 4 This Week

Last Update: 2025-02-10
See Project
23

crawly

High-level web crawling and scraping framework for Elixir apps

Crawly is a high-level application framework for crawling websites and extracting structured data using the Elixir programming language. It provides a complete environment for building web crawlers that systematically visit pages, collect information, and transform that data into structured formats for further processing. Crawly is designed for tasks such as data mining, information processing, and building historical archives of web content. Crawly follows the Elixir and OTP architecture model, enabling concurrent and fault-tolerant crawling processes that can handle many requests efficiently. Developers define specialized components called spiders to control how pages are visited and how information is extracted from them. It also supports extensibility through middlewares, pipelines, and fetchers that allow customization of request handling, data processing, and crawling behavior.

Downloads: 4 This Week

Last Update: 2026-03-11
See Project
24

img2dataset

Easily turn large sets of image urls to an image dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine. Also supports saving captions for url+caption datasets. Opt-out directives: Websites can pass the http headers X-Robots-Tag: noai, X-Robots-Tag: noindex , X-Robots-Tag: noimageai and X-Robots-Tag: noimageindex By default img2dataset will ignore images with such headers.

Downloads: 4 This Week

Last Update: 2025-08-09
See Project
25

watercrawl

AI-ready web crawler that extracts and structures website content

WaterCrawl is an open source web crawling and data extraction platform designed to transform website content into structured data suitable for machine learning and AI workflows. It enables developers and researchers to crawl web pages, extract meaningful information, and convert it into formats that are easier to process and analyze. It provides a modern crawling system that can automatically navigate links, control crawl depth, and collect content from targeted sections of a website. WaterCrawl supports customizable extraction rules so users can focus only on relevant elements while ignoring unnecessary page components. WaterCrawl also offers real-time monitoring capabilities, allowing users to track crawling progress, performance metrics, and errors during large data collection jobs. Developers can integrate the tool into applications through a REST API and multiple client SDKs, enabling automated data pipelines and AI data preparation workflows.

Downloads: 4 This Week

Last Update: 2026-03-11
See Project