How we get data for text analysis

How we get data for text analysis

event_note 25.04.2019

Data for text analysis is often available only in web presentations in an unstructured form. How to get the data as easily as possible?

For the purpose of downloading text from websites, there are specialised tools called scrapers or crawlers. For some programming languages, there are frameworks which considerably simplify the creation of the scraper tool for individual websites. We use one of the most popular frameworks called Scrapy written in Python.

As a practical example, we can mention a tool for collecting subsidy incentives as supporting documents for Dotační manager (Grant Manager), the largest portal about grants in the Czech Republic, which unites “calls” from various public sources. The tool automatically looks through the structure of a portal, such as Agency for business and innovations, finds a website of s subsidy call and mechanically processes it into a structured format. The tool can be run repeatedly so that it can also catch newly published calls. The whole tool including the source code is available at http://git.pef.mendelu.cz/MTA/oppik-scraper/.

The above example is fairly simple, however, the practice tends to be more complicated. Website structure of every portal varies, often, it is not unified even within one portal, it changes in time etc. In order not to write similar tools for individual sources again and again, we are developing our own robust crawler which is able to automatically get text data from various sources.

Vladimír Vacula