Contents
Description of the work package
The aim of this pilot is to demonstrate by concrete estimates which approaches (techniques, methodology etc.) are most suitable to produce statistical estimates in the domain of job vacancies and under which conditions these approaches can be used in the ESS. The aim is not to develop a system suitable for production. The pilot will focus on feasibility and will explore a mix of sources including job portals, job adverts on enterprise websites, and job vacancy data from third party sources, from which collected data may be acquired.
For SGA-1, the work package will focus on job portals and third party sources. For SGA-2 the intention is to explore the potential of capturing vacancies from enterprise websites using the approaches developed by WP 2 (subject to progress and SGA-2 funding). These two work packages will co-ordinate their work.
Methodological, quality and technical results of the work package, including intermediate findings, will be used as inputs for the envisaged WP 8 of SGA-2, in case SGA-2 will be realised. When carrying out the tasks listed below, care will be taken that these results will be stored for later use, by using the facilities described at WP 9.
Tasks
Task 1 – Data access
- Prepare an inventory of relevant job portals in each participating country. For this, the method to compile and maintain a list/register of job portals will be investigated.
- Literature review on previous research into producing job vacancy statistics from web data.
- Qualitative assessment of the information available (e.g. kind of information provided regarding: job title, occupation, economic activity, location, etc.).
- Conceptual analysis in comparison with the conceptual standards of current job vacancy statistics and identification of variables to be used for testing.
- Coverage assessment of potential job portal data against existing job vacancy statistics and identify country specific gaps (i.e. industry and/or occupation sectors that are not well covered via job portals).
- Investigate the feasibility of accessing relevant web scraped data already collected by others (e.g. EURES, Wanted Analytics, Textkernel). This will also include the feasibility of gaining access directly from the owners of job portals.
- Evaluate the role of business and/or administrative registers to support the collection and quality assurance of web scraped data.
- Evaluate the availability of information on the enterprise publishing the vacancy, and the possible use for linkage with business registers, which could be used as input for the investigation of enterprise characteristics by WP 2.
- The task of obtaining a list of URLs for scraping job vacancies from enterprise websites will be taken forward by WP 2.
Task 2 – Data handling
- Study the technical aspects of webscraping job portals (e.g. dealing with blocking mechanisms), and evaluate legal aspects. It is envisaged that the collection of data from job portals will be achieved mainly through the use of simple point and click tools.
- Design a set of webscraping experiments of job portals with the aim of exploring different aspects including coverage of different employment sectors, and stability of data over time.
- Identify technical requirements for webscraping, prepare a suitable IT environment and design and build a database for storing and processing web scraped data.
- In addition to the point and click tools, build and test robots in accordance with the specified design criteria.
- Deploy a data collection system for job portals and maintain the prototype data collection system in line with the design of the experiment.
- Where appropriate, obtain data directly from owners of job portals or other third parties and design systems to store, process and analyse the data. When buying web scraped data from third parties, quality assurance activities will be carried out.
Task 3 – Methodology for output production
- Develop the necessary data processing steps to transform semi-structured web data from job portals into a structure suitable for analysis. This is expected to include things such as:
- Data cleaning, correction of formatting errors, evaluation and treatment of missing data.
- De-duplication of job adverts both within and across job portals.
- Classification of data (e.g. occupation and geography coding).
- Explore the feasibility of linking employer information obtained from job portals back to business registers.
- Initial quality assessment of job portal data (e.g. assessment of job adverts advertised by agencies versus direct employers; evaluation of coverage by job portal/employment sector/ industry sector).
- Visualise results for at least 4 Member States to show potential of the methods.
Task 4 – Future perspectives (foreseen for SGA-2)
- Explore the feasibility of webscraping job vacancies from enterprise websites using the approach developed by WP 2 [1]
- Depending on feasibility, develop methodology to produce experimental job vacancy estimates for 5-6 Member States. Compare the different web scraped estimates with selected current estimates about job vacancies in terms of quality aspects, timeliness, detailed level of job specifications as well as detail of level of production.
- Explore new statistical products in the domain of job vacancies.
- Explore whether the findings of this pilot can be used for new applications. For example, relating vacancies to the demand for skills.
- Explore whether the findings of this pilot may lead to better international comparability.
- Preparation of final technical report.
Deliverables (SGA-1 only)
1.1 | Inventory and qualitative assessment of job portals | month 6 |
1.2 | Interim feasibility report | month 10 |
1.3 | Final technical report of SGA -1 period | month 18 |
Milestones (SGA-1 only)
1.4 | Progress and technical report of 1st internal work package meeting | month 4 |
1.5 | Progress and technical report of 2nd internal work package meeting | month 9 |
1.6 | Simulated production system deployed | month 12 |
1.7 | Simulated production system decommissioned | month 18 |
- ↑ This approach will not be fully developed by WP 2 during SGA-1 and so there is a risk that there will not be sufficient time to investigate this option.