Any process designed to automatically gather information from the Internet (specifically the world wide web) is referred to as web scraping. This is an innovative and rapidly-evolving field of endeavor, as there are several breakthroughs standing between the current state of the art and the ability to accurately and automatically convert web content into structured data that can then be used effectively. At present, fully-automated web scraping systems are very limited in their capabilities. Full accuracy still requires human intervention, but this may change as advances are made in artificial intelligence, semantic interpretation, and text processing.

Web Scraping Techniques

Human Copy-Paste

This is exactly what it sounds like. There are certain situations where no automated technology can stand in for an astute human who purposefully identifies and then copies sought-after information. This technique is especially necessary for recovering information from websites which are set up specifically to exclude automated data acquisition.

Greppping Text / Regular Expression Matching

One simple technique that can potentially yield vast amounts of data is regular expression matching. This is still commonly known as “grepping” after the grep command from UNIX. Many other programming languages (e.g. Python, Perl) have similar regular expression matching capabilities.

HTTP Programming

A very easy way to archive complete data from a site is to request pages (either static or dynamic) pages from the site’s server via socket programming.

Parsing HTML

On a lot of websites, individual pages are generated on the fly by referring to an underlying source of structured data, typically a database. Data mining often involves using a program that extrapolates the format of the underlying data source based on the generated HTML pages, categorizes the data and converts it into useable information. This is called a wrapper. The algorithms which govern wrappers operate on the assumption that the pages they’re fed all have a similar organization and are organized logically by a common URL system. There are some languages designed explicitly for data query that perform similar HTML-parsing functions. Two of the most common are XQuerty and HTQL.

DOM Parsing

A program that encapsulates a complete web browser (e.g. Mozilla Firefox or Internet Explorer) will be capable of extracting dynamic content controlled by client side scripting. This software is also capable of generating a DOM tree based on the recovered web pages according to the programs required to pick up different information from the pages.

Software For Web-Scraping

Many software tools are available that allow their operators to build customized web-scraping techniques. Common features include the capability to record actions (macroing) so that writing scraping code isn’t necessary, data structure recognition features, scripting features that can automatically extract and convert data, and interfaces for exporting scraped information to common database software.

Vertical Aggregation

Several organizations have invested in building their own vertically-focused data harvesters. These systems create and then follow multiple independent “bots” tasked with scraping specific verticals. These harvesting platforms operate without human involvement and collect data without bias, i.e. not from any specific sites. In order to be effective, these platforms need a robust knowledge base that covers the whole vertical comprehensively. Any given harvesting platform can be judged according to two factors: how good the data it gathers is (e.g. how many fields can it pick up) and how it scales when used on large numbers of different sites. Harvesters with good scalability are excellent for scraping data off of sites’ long tails where ordinary aggregators cannot extract meaningful data without excessive human involvement.

Recognition Of Semantic Annotation

Sometimes sites targeted for scraping feature rich amounts of metadata or semantic markup. This makes it much easier to locate certain types of data. Annotation systems that embed markup data directly into web pages (e.g. Microformat) can be scraped in a way that more or less equates to another form of DOM parsing. Other scraping systems may collect and store semantic annotations separately from page data.

Data Security

It is important to secure your data from data scraping. There are many ways you can do this, Trusty Locksmith outlines a few methods on their facebook page which you can use to protect your data.

Page Analyzers Using Computer Vision

Currently experimental, these systems involve setting up machine learning systems to replicate the way human visitors interpret web pages visually.