Laraue Project: Crawler

Key Features


Language	C#
Framework	.NET 9
Project type	Library
Status	Proof of Concept
License	MIT
Nuget
Downloads
Github	Laraue.Crawling

What Is The Crawling

Crawling, also known as web scraping, is the automated process of systematically browsing the internet to extract data from websites. It involves a program, called a crawler or spider, following links and downloading content from pages. This extracted data can then be analyzed, stored, or used for various purposes like market research or price comparison.

Initial Implementation Vision

There were two ideas on how to implement the crawling program

Application Implementation

This is the user-friendly way to allow non-programmatic users to work with software. It should have an interface where users can create a sequence of blocks to make a crawling schema.

Pros

User-friendly: easy to start for users without programming knowledge
Easier to promote: when the interface is ready, it is straightforward to make GIFs and videos with it, create tutorials, etc.

Cons

Hard implementation: tto make an interface means not only to create a frontend (which is definitely hard), but also to create an additional architecture that will transform a human-like view into programming code. Any edit on the backend can lead to edits on this layer and on the frontend. It seems like this part can be added in the future, when the backend will be stable.
Limitations: not everything that can be described in the NP-full programming language can be described with an interface. I even think almost nothing (but it can be enough for most common cases). As soon as the first target was to make something that can grab data in any situation, the limitation sounds bad.

Library Implementation

The programmer-friendly way to work with software. It will be written in the specific language that only users that use that language can work with.

Pros

Flexible: when the library follows software principles, it can allow users to do almost everything they want
Less development time: the product can be made on the clear C#
Cheaper support: no need to have a database or domain to share development results
Can be self-hosted: the MIT License allows using the library for any purposes

Cons

Limited audience: sharply reduces possible users to C# programmers
Harder to start: it is required to download the repo and write a batch of code to get a working example

The Main Problems This Library Tries to Solve

Decrease the amount of routine work: the typical crawler writing is not hard but takes a lot of time for the engineer.
Simplify support: sometimes requested resources change their structure, and it leads to code rewriting. Fast-developed crawlers often have a bad architecture, and it's easier to rewrite them fully than make the changes.
Better testability: a strongly typed library should show type errors as soon as possible, and the models defined properties can be tested as usual C# classes.

How To Use The Library

Users should choose the type of crawling schema builder. The chosen class defines what actions will be available for nodes. Inbuilt builders are made for Static html, Dynamic html or even Static xml. Actually, an implementation almost for any tree structure can be added; the user needs to implement the parser class, like that for the required node type and a batch of related classes to define node methods.
Then the user builds schema as shown in these examples for static html and dynamic html
The schema can be run via parser class for the specified schema: static html or dynamic html. Or the mini-example

public record OnePage(string Title) : ICrawlingModel;

var schema = new AngleSharpSchemaBuilder<OnePage>()
    .HasProperty(x => x.Title, ".title")
    .Build();

var parser = new AngleSharpParser(new NullLoggerFactory());
var model = await parser.RunAsync(schema, "<html><p class='title'>Hi</html>");

Assert.Equal("Hi", model.Title);

Challenges

The main solved problems will be described in the separated articles

How the library can be strongly-typed for the client despite the inner layer using untyped delegates and working with object
How to make a common API for sync and async cases without increasing the code amount
How to make an API flexible enough to allow the customer to do even what the developer did not cover

Timeline

Jun 2022 Base version with dynamic and static HTML parsing support
Oct 2022 Refactoring that separates common crawling logic from the specific mode details
Apr 2023 Made the base class for crawler job - ASP NET Host that runs crawling with the scheduling
May 2024 Refactoring that allowed to add new tree-structures crawlers in one-two hours

Real Use Cases

The library is widely used in the SPB Real Estate project. Crawlers of the main advertisements sites, Avito, and Cian made with the library and launches as jobs.

Projects

Articles

Documentation

Crawler

Tags

Table of content

Key Features

What Is The Crawling

Initial Implementation Vision

Application Implementation

Library Implementation

The Main Problems This Library Tries to Solve

How To Use The Library

Challenges

Timeline

Real Use Cases

All Tags

Projects

Articles

Documentation

Tags

Table of content

Key Features

What Is The Crawling

Initial Implementation Vision

Application Implementation

Library Implementation

The Main Problems This Library Tries to Solve

How To Use The Library

Challenges

Timeline

Real Use Cases

Related Articles

The history of the crawler project

How I tried to rank apartments based on their photos

All Tags