"When you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind; it may be the beginning of knowledge, but you have scarcely, in your thoughts, advanced to the stage of science, whatever the matter may be."
Lord Kelvin

Extracting data from public procurement documents 1998-2004

Although the Hungarian Public Procurement Authority has made all the information publicly available online for the public procurement tenders between 1998 and 2004, the data format is inappropriate for statistical analysis. The information is stored in basic HTML files which does not provide any interface for sorting and searching among the data. In this technical paper we describe our data extraction process which we used to turn the HTML based information into database format by extracting relevant fields of information. The Python programming language was used for data cleaning and extraction, which resulted in a database suitable for further analysis. At the end of the paper, basic statistics about the dataset is shown to provide some examples for the usability of this dataset.

We know that this work should not be our job, but the Hungarian state’s.

PDF (in Hungarian with English summary)