enEnglish
CROS

WP1 Meeting 2017 02 01 Virtual Sprint

Introduction

A third WP1 “virtual sprint” was held on 1st of February 2017, focusing on quality. We decided to explore two different quality frameworks:

 

  1. The Quality Assessment Framework used by Statistics New Zealand as reporting tool for administrative data quality. The aim was to test the suitability of this framework for web-scraping for on-line job advertisements. This was in part in response from some initial proposals put forward by WP8 for approaching big data quality
  2. The Framework for the Quality of Big Data developed by the UNECE Big Data Quality Task team.

 

Sweden, Germany, Greece and the United Kingdom focused on the first while Slovenia focused on the second

 

Statistics New Zealand Quality Framework for Administrative Data

The framework serves as a tool to help identify imperfections and errors that can arise when stepping down from ideal concepts and populations to the captured data we obtain in practice. It is split into two phases, allowing to evaluate single datasets in isolation against the purpose for which the data was collected (phase 1), as well as the process of combining variables and objects from several datasets to measure a target statistical concept. In what follows we explain the individual phases in more detail and give examples relevant to our scenario of web scraping for job vacancies (JV).

 

Phase 1

Phase 1 of the framework applies to a single dataset in isolation and considers errors in terms of measurement (variables) and representation (objects). 

The measurement side describes the errors occurring when stepping down from an abstract target concept we would ideally measure, to a concrete variable we eventually work with. The representation side, on the other hand, describes the types of errors arising when narrowing down the ideal (target) population to the set of objects we eventually observe. The output is a single source micro dataset that could be used for different statistical purposes. The framework is designed to support the measurement of total “survey” error. 

[[File:|700x367px|https://sites.google.com/site/jvpilotons/_/rsrc/1498560195342/home/virtual-sprint-1st-february-2017/figure-1.png]]

As an example, consider the following scenario of web-scraping for job vacancy counts from a given job portal.

 

Measurement:

The measurement side describes the steps from the abstract target concept through to an edited value for a concrete variable.

  • Target concept: Number of job vacancies per company
    • This concept is not directly measurable (job vacancies are "abstract"), however, we can measure the number of job ads per company. This gives rise to a validity error, since a job advertisement is an indicator of, but not a direct representation of a job vacancy and there are various ways in which they may be different (e.g. job ad representing more vacancies)
  • Target measure: Number of job ads per company
    • Although this is now measurable, we may still make a number of measurement errors, e.g. by the scraper incorrectly counting the number of JVs on the portal
  • Obtained measure: Scraped number of job ads per company
    • In some cases, we may wish to process the scraped counts, e.g. adjust the values as a rolling average for past few days. By doing this, we may produce a processing error.
  • Edited measure: Number of job ads per company - scraped and processed

Representation:

The representation side describes the definition and creation of the elements of the population being measured, or ‘objects’. The output is a single source micro dataset that could be used for different statistical purposes. The framework is designed to support the measurement of total “survey” error.

  • Target set: All company JV counts on the given portal
    • It is not always possible to be able to access the full dataset. This may be because the portal in question restricts scraping, or because it is technologically infeasible to scrape it in its entirety. Narrowing down the target population to what is actually accessible results in a frame error
  • Accessible set: e.g. (if this is what the portal permits) top 1000 JV counts from each industry sector on the portal
    • Not all the accessible data will eventually be accessed. This may be due to a conscious decision (e.g. lack of resources, scraping just a sample) or mistake (incorrectly interpreting portal's T&Cs). These kind of errors are selection errors.
  • Accessed set: e.g. only top 100 JV counts from each industry sector on the portal
    • Finally, just like in the "measurement" branch, a missing/redundancy error may be introduced by processing the data, for example by removing zero entries, duplicates or outliers.
  • Observed set: processed top 100 JV counts from each industry sector on the portal 

It is important to note that the examples of the errors are driven by the choice of the individual measures and sets, effectively serving as an input to the framework. For example, if we chose a different target concept and target measure, there will be different validity errors (see examples below).

 

Other examples of the errors in phase 1:

 

 Measurement

 Representation

Validity error

Target concept = JV. Target measure = job ad

  • Location data in the ad may lack precision
  • Employer uploading incorrect data
  • One advert may represent more than one vacancy. (Different sub-types: Ad does not make it clear that there is more than one vacancy. Ad indicates that there is more than one vacancy, but doesn’t say how many).

Frame error

Target set = all JVs in a country. Accessible set = all job ads online

  • Vacancies advertised through other channels (e.g. by word of mouth) are not accessible
  • Old ads that are not removed from portals
  • Ghost vacancies (i.e. jobs posted by employers that do not actually exist)

Target set = all job ads online. Accessible set = all job ads from job portals

  • Not all online ads are advertised on job portals (they may be also e.g. on company websites)

Measurement error

In general:

  •  Scraper downloading incorrect data from the web page

Selection error

Accessible set = all job ads from all online portals. Accessed set = job ads from selected portals

  • If the selected portals are e.g. highly specialized, it may skew the results

Processing error

Obtained measure = scraped job ad. Edited measure = cleaned job ad.

  • Errors that may result from cleaning of variables such as job titles
  • Coding/classification errors
  • Imputation errors

Missing/redundancy error

 

In general:

  • E.g. removing outliers

 

Phase 2

Phase 2 involves combining data from several different sources for the use of measuring a statistical target concept and population. In this phase, the measurement side focuses on variables and the kinds of errors that may arise when combining and aligning datasets. The representation side focuses on coverage error of linked datasets, identification error (the alignment between base units and composite linked units) and unit error, which may be relevant if the output involves creation of new statistical units. 

To continue the previous example, consider now the scenario when we want to compare the counts from the scraped portal with those from a survey.

 

Measurement:

  • Target concept: difference between JV counts from portal and survey for a company
    • The "companies" appearing on the portal do not necessarily align with the "companies" from the survey: While the former is likely going to be based on company names "commonly known to people", the latter is more likely to correspond to entries from a business register (in ONS, UK, these are called Reporting Units, and there can be many-to-many relationships with the set of "commonly known companies" ). This inability to get the required values from our datasets results in a relevance error.
  • Harmonized measure: difference between JV counts for the matching pairs of entries
    • Actually carrying out the matching can produce errors that need to be taken into account, for example, algorithm producing a wrong match, or failing to detect a matching pair. This type of error is called mapping error.
  • Re-classified measure: difference between JV counts for the matched pairs of entries
    • It is possible that the matching will produce multiple matches for a given entry. We then may need to decide how we break the chains of matched entries. Such decisions may lead to comparability errors.
  • Adjusted measure: difference between JV counts for the matched pairs of entries, after post-processing

Representation

  • Target population: differences between JV counts from portal and survey for each company from the survey
    • Obviously, not every entry from the survey has a matching counterpart in the scraped data. This produces coverage error
  • Linked sets: differences between JV counts from portal and survey for matching companies
    • Because of the mentioned misalignment (companies vs. reporting units) of the types of entries in our datasets, we may want to link the datasets a less granular level, e.g. by the parent enterprise. Incorrectly identifying a parent enterprise for individual companies/reporting units may lead to identification error.
  • Aligned sets: differences between JV counts from portal and survey for matching enterprises
    • Finally, a unit error may occur due to creation of new statistical units. We were not able to identify errors of this kind
  • Statistical units: differences between JV counts from portal and survey for matching enterprises

Other examples of the errors in phase 2:

 

 Measurement

Representation 

Relevance error

Target concept = job vacancy in given format. Harmonized measure = job vacancy in different format

  • Different levels of information may be present in different portals, resulting in a need to consolidate the final format of the job vacancy when combining these datasets.

Coverage error

In general:

  • Imperfect matching between entries

Mapping error

Harmonized measure = industry sector for a JV. Re-classified measure = mapped SIC code

  • Errors when mapping the information about the industry sector occurring the ad to a SIC code

Identification error 

Linked sets = Total number of JVs per location. Aligned sets = JV counts per county

  • Suppose we want to produce estimates of JV counts per location, but the amount of location information in individual ads varies (ZIP codes, cities, ...). We may choose to map each location to a low-granularity level such as county. Errors arising at this stage are identification errors.

Comparability error

Re-classified measure = job vacancy. Adjusted measure = job vacancy after deduplication

  • If multiple portals are scraped, the same vacancy may be found several times, with e.g. conflicting information on salary. Resolving such conflicts can result in comparability errors

Unit error 

No examples found

===

Discussion:

We were able to successfully apply the Quality framework for administrative data to the scenario of web-scraping for job vacancies, which identified several possible sources of errors. Phase 1 proved easier to apply and seem to fit well evaluating a single web-scraped dataset, or the job vacancy survey. Phase 2 was slightly more complex, possibly because in our scenario there seem to be two distinct integration steps. The first corresponds to the integration of data from different job portals, while the second involves integration of the resulting composite micro data source with the JVS to produce a final statistical output. To certain extent, phase 2 can be applied in both cases. However, Reid et al. (2017), propose a “Phase 3” that focuses on potential errors resulting from the creation of final estimates from the composite micro-data, suggesting that this may be where integration with survey data should be considered.

 

 

UNECE Framework for the Quality of Big Data

 

Input Phase

HYPERDIMENSION: SOURCE

  • Institutional environment

Type of BD:  www data

BD supplier:   BD is pulled from “free” (depends on legislation issues) sources (e.g. Job portals websites). Due to the fact that IT robots are used for web scraping    there is always a chance being blocked by owners of websites.

 

  • Reliability

 

There is always a possibility for enterprises which carry out job portals to cease to exist. Under the assumption that you don’t need the permission for accessing the JV data the risk (for sustainability through time) is low.   In that case we would find the substitutes (

new JP owner)  which will provide the same services. If we need the permission for accessing to JV data from new owner the risk would be higher.

In the scale “1:  high risk, 2: medium risk, 3: low risk, 0: don't know” we would estimate that there is medium risk for sustainability through time.

  • Relevance

The relevance of JV data is very high. JV data could be used as a solely or one of the sources for producing existing statistics.    This source could be used for creating of new type of statistics as well.

  • Privacy and security

JV data are in general not sensitive to privacy and security issues. However there could be some privacy issues in case of specialized agencies which advertise JV vacancies for purposes of other parties (enterprises). They may not be authorized to reveal the name of     enterprise which is searching for certain employee.

  • Availability/Delivery

Currently there is no written agreement with the owners of job portals which could give us the terms and conditions for accessing to their JV data.  

Periodicity: periodical and repeated (weekly scraped data)

Punctuality of delivery: /

Cost: /

  • Procedures

SURS uses IT robots (Web scraping studio) for web collection of JV data…..

  • Usage

Owner: usage for advertising purposes,…

NSI: for creating official statistics in JV domain, as additional source in creating early economic indicators,

 

HYPERDIMENSION: METADATA

  • Complexity  

Technical constraints:

JV data are collected by Web scraping studio   and stored in excel files.

 

Structure:

Some of the JV data are well structured (e.g. position, name of the advertiser, date of the advertisement,…). Some of data are unstructured (number of employees needed,  ).

For some of the variables their name is specified (name of enterprise, job title, date of advertisement,.. ), for some of the variables their name is not specified (e.g. deadline of application, number of employees needed. There is  no description of   variables  which wea re interested in.  

 

Readability:

Some of JV   advertisements are in forms (pdf, image,..) which don’t allow us to scrape them.  

  • Completeness

There is almost no metadata available.

  • Usability

New skills are needed in order to collect and process this kind of data. 

  • Time related factors

Timeliness: / (related to data which is provided by data provider)

Periodicity: / (related to data which is provided by data provider)

Changes trough time: possibility of changes of websites, possibility of changes of structure of certain websites

 

  • Linkability

Presence and quality of linking variables:

There are no keys (variables) available which could be used as a linking key.

There are some variables which could be used in record linkage (matching) procedure.

Name of enterprise + location of JV= id of enterprise from BR

Linking level: /

  • Coherence - consistency 

Linking: There are potential (indirect) linking variables such as:

 -name of enterprise which could be used as data for linkage with BR

 

Consistency: There is some inconsistency (anomalies)  in the data (values out of the range,…).missing values (number of employees, deadline for applications,..)

  • Validity

Transparency of methods and process:

All methods and process for phases of data are documented (…),

Methods and processes how the JV data are generated are not known  

 

HYPERDIMENSION: DATA

Accuracy and Selectivity

Overcoverage: overcoverage occurs, because when we scrape data, we also obtain job ads for working outside of Slovenia and also job ads for student work. Those job ads represent about 17 % of all scraped job ads.

Undercoverage: it occurs if we don’t collect job ads for companies from certain SKD activities, because they are not there (maybe they are advertised somewhere else). We have not assessed the amount of undercoverage yet.

Under/over-representation, exclusion of sub-populations: under-representation and exclusion is possible for some of SKD activities. We have not assessed that yet. It is also possible, that some type of occupational groups are not (as much) advertised on our biggest job portals, where we obtain our data.

Missing data: we are facing data missing from two variables:

  • Name of the company: specialized recruitment agencies are recruiting people via job portal, but they do not tell for which company they are searching new staff.
  • Deadline: on some job portals deadline is not published at all; on some portals, where this data exist, there are some ads where this data is also missing.

Reference datasets: we do not have any reference dataset.

Duplicates: are represented in form of publishing the same job vacancy on more than one job portal. We have not detected any duplicates within certain job portal.

Acceptable range of data: most of the data are in acceptable range, however place of work is not always inside that range as some ads advertise jobs for working abroad. Also we would like to have place of work in level of municipality, but in reality that is not the case.

 

Linkability

Variables being linked: we are linking company name with Slovenian Business Register, we will link place of work with NUTS and the title of the job post with Standard Classification of Occupations.

Percentage of linkage: about 95 % of company names are linked unambiguously, about 3.5 % are linked ambiguously, and about 1.5 % of company names are not linked.

Data integration with other files: we could use the title of the job post for easier removal of duplicates between job portals and between different data sources (data from company websites, data from administrative source), which would make integration of data sources more accurate.

Consistency:

Validity:

Data are currently measuring the number of job ads and not the number of job vacancies that are advertised via job portals.

Discussion:

Assessment of the quality framework was not completed due to time constraints. However, the work to date suggests this framework is suitable.

 

Conclusion:

Overall, the UNECE framework seems more intuitive and so could be a better starting point for identifying quality issues with on-line advertisement data. However, the Statistics New Zealand Quality Framework is designed to support a total survey error approach which could further deepen the accuracy and selectivity dimension of the UNECE framework. These elements would become more important when considering how on-line job advertisements could be moved into statistical production.

Category: