Blog

Web Scraping Tools

Web Scraping Tools

In a previous post, we have explained the basics of this topic: What is web scraping, and how would you program a software that performs web scraping?

Alas, programming is a special skill that needs some time and effort to be mastered. We have also introduced in another post the declarative web scraping language OXPath, which can help non-programmers to get a web scraper up and running in less time.

In Smart Harvesting II, we had asked ourselves: What kind of tool would a librarian need to be able to extract bibliographic metadata from the Web? In the beginning, we focused on OXPath, but we soon realized, that even though this declarative language is easier to read and write than a script in a full-blown programming language, there are still some hurdles involved that render OXPath not the best alternative for our user group.

In addition, we realized that, in the meantime, there are a good deal of web scraping tools suitable for the layman available.

In this post, we want to give an overview on the - in our view - most promising web scraping tools.

Integrating OXPath into the DDA

Integrating OXPath into the DDA

In this post, we present one interesting outcome of the Smart Harvesting project, which is the integration of a web scraping module utilizing OXPath into the Document Deposit Assistant (DDA).

OXPath for Web Scraping

In our previous blog post, we discussed the different options for programmers to create web scrapers in several programming languages. This approach to web scraping is fine for people with a proficient background in programming. Unfortunately, there are far more people who need to extract data from the web, that don’t have the necessary programming skills.

Web Scraping

A main focus of the Smart Harvesting II project was the topic of web scraping. In this post, we are going to explain what web scraping - also called web data extraction - really is, and how you would program a software that performs this task.

© IR-Group, TH Köln. All rights reserved.  

Back to top