In this talk he is going to introduce new open source framework Frontera https://github.com/scrapinghub/frontera. Frontera allows to build real-time, large scale, distributed web crawlers and website focused ones. Offering:
- customizable storage (RDBMS or Key-Value based),
- crawling strategies management,
- transport layer abstraction,
- fetcher abstraction.
Along with framework description he'll demonstrate how to build a distributed crawler using Scrapy, Apache Kafka and HBase, and hopefully present some statistics of Spanish internet collected with newly built crawler.