Key Features
| Language | C# |
| Framework | .NET 9 |
| Project type | Library |
| Status | Proof of Concept |
| License | MIT |
| Nuget | |
| Downloads | |
| Github | Laraue.Crawling |
What Is The Crawling
Crawling, also known as web scraping, is the automated process of systematically browsing the internet to extract data from websites. It involves a program, called a crawler or spider, following links and downloading content from pages. This extracted data can then be analyzed, stored, or used for various purposes like market research or price comparison.
Initial Implementation Vision
There were two ideas on how to implement the crawling program
Application Implementation
This is the user-friendly way to allow non-programmatic users to work with software. It should have an interface where users can create a sequence of blocks to make a crawling schema.
Pros
- User-friendly: easy to start for users without programming knowledge
- Easier to promote: when the interface is ready, it is straightforward to make GIFs and videos with it, create tutorials, etc.
Cons
- Hard implementation: tto make an interface means not only to create a frontend (which is definitely hard), but also to create an additional architecture that will transform a human-like view into programming code. Any edit on the backend can lead to edits on this layer and on the frontend. It seems like this part can be added in the future, when the backend will be stable.
- Limitations: not everything that can be described in the NP-full programming language can be described with an interface. I even think almost nothing (but it can be enough for most common cases). As soon as the first target was to make something that can grab data in any situation, the limitation sounds bad.
Library Implementation
The programmer-friendly way to work with software. It will be written in the specific language that only users that use that language can work with.
Pros
- Flexible: when the library follows software principles, it can allow users to do almost everything they want
- Less development time: the product can be made on the clear C#
- Cheaper support: no need to have a database or domain to share development results
- Can be self-hosted: the MIT License allows using the library for any purposes
Cons
- Limited audience: sharply reduces possible users to C# programmers
- Harder to start: it is required to download the repo and write a batch of code to get a working example
The Main Problems This Library Tries to Solve
- Decrease the amount of routine work: the typical crawler writing is not hard but takes a lot of time for the engineer.
- Simplify support: sometimes requested resources change their structure, and it leads to code rewriting. Fast-developed crawlers often have a bad architecture, and it's easier to rewrite them fully than make the changes.
- Better testability: a strongly typed library should show type errors as soon as possible, and the models defined properties can be tested as usual C# classes.
How To Use The Library
- Users should choose the type of crawling schema builder. The chosen class defines what actions will be available for nodes. Inbuilt builders are made for Static html, Dynamic html or even Static xml. Actually, an implementation almost for any tree structure can be added; the user needs to implement the parser class, like that for the required node type and a batch of related classes to define node methods.
- Then the user builds schema as shown in these examples for static html and dynamic html
- The schema can be run via parser class for the specified schema: static html or dynamic html. Or the mini-example
public record OnePage(string Title) : ICrawlingModel;
var schema = new AngleSharpSchemaBuilder<OnePage>()
.HasProperty(x => x.Title, ".title")
.Build();
var parser = new AngleSharpParser(new NullLoggerFactory());
var model = await parser.RunAsync(schema, "<html><p class='title'>Hi</html>");
Assert.Equal("Hi", model.Title);
Challenges
The main solved problems will be described in the separated articles
- How the library can be strongly-typed for the client despite the inner layer using untyped delegates and working with
object - How to make a common API for sync and async cases without increasing the code amount
- How to make an API flexible enough to allow the customer to do even what the developer did not cover
Timeline
- Jun 2022 Base version with dynamic and static HTML parsing support
- Oct 2022 Refactoring that separates common crawling logic from the specific mode details
- Apr 2023 Made the base class for crawler job - ASP NET Host that runs crawling with the scheduling
- May 2024 Refactoring that allowed to add new tree-structures crawlers in one-two hours
Real Use Cases
The library is widely used in the SPB Real Estate project. Crawlers of the main advertisements sites, Avito, and Cian made with the library and launches as jobs.