MadStore is made up of three main components: the crawler
, the repository
and the server
.
Let's take a look at each one.
The crawler component has the responsibility of crawling one or more target sites
, applying a breadth-first
algorithm to each crawled site.
For each visited page, the crawler will execute the following operations:
Crawling of different target sites is executed in parallel, as well as crawler operations explained above (where possible). However, it is possible to control several aspects of the crawler behaviour by setting the following parameters in the MadStore configuration file:
Other than that, it is possible to run the crawler in two distinct modes: local and grid .
In local mode, the crawler will work in the same process as the MadStore web application.
In grid
mode, the crawler will work as a distributed application, with the master node running in the same process of the MadStore web application, and one or more grid nodes running as separate web applications on the same host, as well as on different ones.
The MadStore main application, containing the master crawler node, and the MadStore crawler grid nodes, will communicate through IP Multicast, dispatching one another tasks representing parts of the whole crawling process.
The local
crawler is a lightweight multi-thread implementation to use for crawling a few small-to-medium web sites.
The grid
crawler has been implemented as a distributed application providing horizontal scalability and optimal performances for crawling several, large, web sites.
However, thanks to the MadStore ease of use, switching between the two is just a matter of a single configuration parameter, and the eventual deploy/undeploy of the external grid nodes.
The repository component is where all extracted Atom contents are stored. It has the responsibility of:
It is possible to control several aspects of the repository behaviour by setting the following parameters:
All configurations are applied, as always, by editing the MadStore configuration file.
The server component has the responsibility of implementing the Atom Publishing Protocol and Open Search specifications in order to serve the Atom web feeds to end users.
It provides the following capabilities: