enEnglish
CROS

WP1 Sprint 2016 07 28-29 Virtual Notes

Notes from Virtual Sprint on Deduplicating Job Ads (28-29 July)

1. UK Report

Introduction:

The main objective of the ONS/UK team was to explore approaches for using dedupe (https://github.com/datamade/dedupe) for indentifying potential duplicates within job portal data. 

The participants were: Nigel Swier, Robert Breton, Liz Metcalfe and Vidhya Shekar

Overview of Dedupe:

Dedupe is a Python library for scalable data de-duplication and entity-resolution. It uses the results from active learning to derive parameters that are used to identify potential duplicates. It can be used to identify records that may be very similar but not identical. The first step is to prepare the data set and select the matching variables to be selected. It works by making an initial matching run using logistic regression and then presenting back paired sets of potential duplicates. These pairs are skewed towards those that the matching algorithm is most uncertain about. These are manually labelled as duplicates (y), non-duplicates (n), or uncertain (u). The resulting training data is then used to produce a new set of weights which are then rerun against the dataset. The output is the original dataset with records clustered with a similarity score for each potential set of matches.

Data Sources:

In preparation for the virtual sprint the UK collected a sample of IT and construction jobs from the following job portals/search engines all taken at about the same time:

  • Adzuna (using their API)
  • Indeed (using their API)
  • Totaljobs (using Import.io)
  • Monster (using Import.io)
  • Universal job match (Using their API)

For each job portal, we did not collect the full job description but rather a “snippet” of between 40-60 words along with key variables such as the job title, location, company name and application closing date. This is the information that is most readily available from the API search query or simple web scraping. Although it should be possible to collect the full text description, this is quite complicated and we felt that this summary information would be sufficient for this initial experiment.

Data Wrangling/Cleaning:

The first task of the virtual sprint was to agree a schema of common variables from each job portal to be used for matching. These variables are often labelled differently and some data may need to be reformatted or derived.

Some data fields are quite messy and involve a considerable amount of data wrangling (i.e.  cleaning and manipulation) before any matching can be attempted. A particular problem is that the job title field is often polluted with other information, for example:

 " .NET Developer - Stoke-On-Trent - £35-£40K "

Job portals often deliberately include additional information in the job title field, presumably to get the attention of job seekers. These data need to be cleaned before any deduplication step. Regex techniques were applied to identify and remove text strings that contain salary information and to remove all spaces and punctuation and so just leaving text and numerals. We considered some different approaches for removing location information in the job title field. One possible approach would be to develop a taxonomy of locations and then use this a set to match and remove location strings. However, we noticed that when location information is incorporated into the job title it is nearly always also present as a separate location field. So instead we wrote some simple code that used the text-string in the location field to match and remove any corresponding strings in the job title field. This seems to be a very simple yet highly effective approach that could work in the same way for job portals in different countries and different languages.

We also did some experimental text processing on the job description field. Although some job descriptions for duplicated jobs are copied verbatim (and therefore easy to identify as duplicates) some job agencies will adapt the job description to their particular house style. These are more difficult to identify the job as the same job. We experimented with reducing the text to a set of key words that would hopefully be fairly similar between job ads relating to the same vacancy. The effectiveness of this was certainly constrained by the use of “snippets” rather than the full description, but this is certainly a general approach that could be explored further.

Another idea we discussed was to assign SOC codes from the job title and then using this as an additional feature to help identify potential duplicates. This would need to be done at some point anyway, but there could possible be some advantages in doing it prior to the deduplication step. We did not get very far with this idea but we may explore this later.

Methodology:

The problem of duplicating job ad data is multi-dimensional. Job portals (and specifically job search engines) can have the same vacancy listed more than once. The same vacancy may then also appear across multiple job portals.

One approach could be to concatenate all records into a single large file and then run the deduplication step as a single process. However, we considered it likely that particular patterns of duplication were portal specific and therefore it might be better to have machine learning algorithms that were tailored to each portal. This could be followed by a step focused on duplication across portals using a separate ML algorithm.  Apart from this, we thought that this approach would give us a better sense of the level of duplication of individual job portals, how job ads are spawned across the whole on-line job vacancy eco-system and therefore which portals we should prioritise.

Therefore we decided to pursue this approach of deduplicating individual job portals and then deduplicating across portals in a secone iteration. However, deduplicating in a single step is something that we could explore another time.

 For each portal we cleaned data for the following variables:

  • Job_title
  • Job_description
  • Location_city
  • Location_region
  • Date_posted
  • Company

We decided to stop working with Monster data at this point because the date posted information was too difficult to derive. We also discarded the data from Universal Job Match as the API limits each search query to 50 job vacancies and therefore it is very complicated to get the complete data. The data cleaning performed on the remaining portals was far from perfect but given the time constraints, we thought it was sufficient for this initial experiment.

Some job advertisements in Adzuna and Indeed share a job vacancy id number. Therefore some duplicates can be identified and very easily removed as a first step.

The next step was to run the dedupe program against each job portal file. The active learning step was then followed so as to train the matching algorithm to mimic the human decision making process. This step was mostly performed independently by individuals in the ONS team, just once per data set and was generally quite rushed (due to time pressures). We generally aimed to produce a training data set of at least 100 records, but again this was sometimes less due to time pressures.

We did not discuss as a group or agree the logic about how to decide whether two advertisements related to the same vacancy. In some cases, the decision is straightforward. For example, if all the fields are exactly the same apart from the company name then this indicates that the same vacancy has been advertised by more than one agency and can be marked as a duplicate. However in other cases can be much more ambiguous and it is likely that two individuals could have made different judgements. We agreed that an important future task would be to either:

i) agree some broad and consistently applied principles about how to recognise duplicate job ads

ii) identify a methodology for building training datasets for deduplication (e.g. multiple training of the same data by different people and focusing attention on where different decisions are made).

For these reasons, one should not read too much into the specific findings as the deduplication algorithms may not have been trained to a sufficiently high standard.

Results:

The percentage of duplicates found for each job portal were as follows:

  • Totaljobs: 8%
  • Indeed: 16%
  • Adzuna: 22%

Indeed and Adzuna are job search engines and so the higher volume of duplicates is not surprising. Totaljobs.com is a job engine where most jobs are uploaded directly by the company or agency explaining the lower proportion of duplicates. With Adzuna, we found that nearly all the duplicates were identifed by simply direct matching on job vacancy id. This suggests that Adzuna have good systems for identifying duplicate vacancies within their system.

Each deduplicated file was concatenated into a file of about 93,000 records. The initial match run and the active learning step was repeated on this file. The final match run found that about 10 per cent of all records were duplicates.

Conclusion:

Dedupe proved to be a very useful tool for identifying duplicate job ads, where the they are similar but not necessarily identical. Although some of the steps were quite rushed, we were able to develop an approach that could identify duplicate records both with job portals and across them. However, there is a lot more work needed to refine the approach in terms of cleaning data and then properly training the dedupe algorithm.

 

2. Sweden Report

Introduction:

This is a summary of the virtual sprint from Sweden.

 Participants:

  • Dan Wu, carried out the task
  •  Billius Åsa and Varlakova Olga, attended the web meetings.

Task:

To identify duplicate adverts on ONS data by calculating jaccard similarity of two adverts. The higher the score is the more similar the adverts are.

Data and Tools:

ONS provided data of different sources. The data used in this sprint are Adzunaconstructions-jobs, monsterCon, totaljobsITCon and universaljobmatch-construction-jobs, presented in the table.

Source

Total ads number

Main data features

Adzunaconstructions

26

Category_label, company_canonical_name, company_display_name, contract_time, contract_type, description, id, location_area, location_display_name, salary, title, region, etc.

MonsterCon

1000

url, companylogo.image, jobtitle.link, jobtitle.link_link, company_link, company.link_link, locationarticle.value, locationarticle.value_link, locationblock.value, preview.description, etc.

totaljobsITCon

1950

url, jobtitle.link, jobtitle.link_link, label.value, location.content, location.link_link, salary.value, recruiter.image, permanent.link, permanent.link_link, dateposted.value, jobintr.value, jobintro.description, hidden.descriptions, see.link, etc.

Universaljobmatch-construction-jobs

2162

Activedate_end, activedate_start, company, id, location_area, location_city, summary, etc.

 

Tool: R and packages “tm” and “textreuse”

Methodology:

  1. Read in the data files;
  2. Decide which features’ values are used of each data set for representing adverts : description of adzunaconstruction,  Preview.description of monsterCon, Jobintro.description and Hidden.descriptions of totalITCon, title and summary of universialjobmatch-construction-jobs are retrieved from the dataset for text analyzing;
  3. Process text with normalization functions with package tm and save the clean text in a folder; (the data are attached)
  4. Calculate candidates of similar text on the cleaned text;
  5. Calculate the jaccard_similarity scores of the candidates text,
  6. Evaluate the result by checking samples of the clean text manually.

Results:

of total 5138 text document representing 5138 adverts, there are 5135 documents included in the comparison. Three documents contained too short text are excluded from the comparison. Of the combinations[[File:|105x24px]], 29 526 pairs are identified as similar.

The results in the different sources

Source

Number of identified pairs

Feature used

Adzunaconstructions

2

description

MonsterCon

4 127

Preview.description

Universaljobmatch

2 763

title and summary

totaljobsITCon

22 634

Jobintro.description and Hidden.descriptions

 

The mean is 0.9831 of the documents’ jaccard_similarity scores.

The result is attached.

Conclusion:

It is possible to identify duplicate adverts by calculating jaccard similarity scores. Use hash on text saves computing time.

Many similar pairs share the same text, but the jobs are located in different cities. It seems better to check feature values such as company name, location and publication date, if they are available, and combine the result with the text comparison. The definition of duplicate advert on the semantic level is important. In the sprint, the definition is simplified by calculating jaccard similarity scores of certain features’ values.

It is a surprise to find few similar text documents. It can depend on the data quality; many documents contain uncompleted descriptions of adverts. Another surprise is that similar pairs are not identified from two different sources. I thought in the beginning the duplicate adverts are most from different sources, e.g. companies publish adverts on several sources.

3. Greece Report

Introduction:

The main objective of ELSTAT team was to conduct web scraping/data collection and then to explore the coverage of occupations and sectors in the sample and the percentage of companies covered in relation to the Statistical Business Register.   

The participants were: Christina Pierrakou and Eleni Bisioti

Data Sources:

In preparation for the virtual sprint, ELSTAT collected the ads for July from the following job portals all taken at about the same time:

  • Kariera  
  • Skywalker  

For each job portal, we did not collect the full job description but rather a “snippet” of between 40-60 words along with key variables such as the job title, location, company name,  posted date, salary and job type (full time/temporary).

Methodology - Results:

A good knowledge of the structure of the job portal site is very useful, in order to use tools like import.io in an efficient way.

Our first step was to study the structure of each job portal, in order to understand how information is distributed, by using xml-generator of the above mentioned portals to create their site maps. (https://www.xml-sitemaps.com).

The next step was to prioritise the findings for each site map in order to get all the ads from each site without splitting into specific categories. These results were used to import.io to create one data set for each portal without duplicates.

We tried to match the company name from the data sets to the company name from the Statistical Business Register. However, matching degree was very low. The main problem was the language. The Statistical Business Register contains the official company names mainly written in Greek characters, while, in the data sets, the company names use mainly Latin characters for commercial reasons. Some code is needed in order to solve this problem and increase the matching degree. Due to time constrains, we are going to explore this issue later on. After matching the companies to Statistical Business Register we are going to explore the coverage of occupations and sectors.



4. Germany Report 

Introduction

The main objective of the Destatis team was to explore the prevalence and detection possibilities of duplicates on and in between job portals. The approach was an exploratory one: A for a subset of data obtained by web-scraping, the prevalence of (possible) duplicates was assessed manually. In suspicious cases, the relevant job advertisements were checked to investigate whether there actually was a duplicate. The results of the exercise were subsequently analysed regarding the possibilities to define rules for web-scraping based on the structured information.

The participants were: Thomas Körner, Martina Rengers and Holger Ostermann

 

Data Sources

In analogy to the sprint in the UK, in preparation for the virtual sprint in Germany, we collected a sample of IT and construction jobs from two job portals identified for the project during the job portal inventory, stepstone.de and de.gigajob.com. Table 1 gives an overview of the data obtained. The data were collected via web scraping. As a technical basis, the free software Selenium, Firefox and a custom Java program were used. The custom Java program controls Selenium which controls the Firefox browser. Based on the specified criteria, the custom Java program searches for jobs and collects the relevant data found. After the data retrieval, the custom Java program generates the results in the Microsoft Excel (xlsx) format.

Table 1: Job portals included for the sprint

Job portal

Type

Area

No. of ad-vertisements

No. of different job names and companies

Variables collected

Stepstone.de

Job board

“Sector building and construction”

1,716

1,515 different job titles

698 different companies

URL

Job Name

Company

Location

Date

“Sector IT & internet”

7,150

6,609 different job titles

1,745 different companies

De.gigajob.com

Hybrid

“Topic Construction”

12,678

4,902 different job titles

3,885 different companies

URL

Job Name

Company

Location

Age

Date

Source

ID of Job Offer

“Topic IT”

49,824

24,597 different job titles

5793 different companies

No advertisements older than 30 days were considered

For both job portals, we did not collect the full job advertisements as plain text but only the structured information that was available on the overview hit list provided by the portal. As can be seen in table 1, this information turned out to be rather limited. We considered including the information that appears when clicking at the job advertisement in the hit list, but finally decided against. The reason for this was that only little additional information was available on that “second level”, which in addition was judged to be of limited interest for the detection of duplicates. At Stepstone, the “second level” contained structured information on the contract type and the full-time/part-time status, at Gigajob no standardised additional information was available at all (as the hit list directly links to the advertisement at the original job portal). The use of further information would require processing the plain text of the job advertisement, which is usually made available in the corporate design of the employer and not in a pre-defined structure (with the exception “own” advertisements of Gigajob that follow a pre-defined structure).

Figure 1: Example of a hit list at stepstone.de

RTENOTITLE

The allocation of a job advertisement to the IT or construction area relied on the classification provided by the job portals. As it later turned out, the “sector” at Stepstone and the “topic” at Gigajob are fundamentally different concepts, the first one being closer to the economic activity and the latter closer to the occupation. However, neither classification seemed to fully match the standard classifications of official statistics.

 

Figure 2: Example of a hit list at de.gigajob.com

RTENOTITLE

 

Compared to the hit list of Stepstone, the hit list of Gigajob contains two additional variables as structured information (see figure 2): (1) The job portal where the job advertisement originally appeared (as a hybrid job portal, Gigajob contains job advertisements from 36 different job portals, the share of “own” advertisements could roughly be estimated to be around 15%), and (2) an ID number of the job advertisement. While the first information was highly useful to identify the advertisements carried over from Stepstone (in order to check possible approaches to identify these duplicates), the ID number is of limited use as it is a number specific to the Gigajob web site and cannot be found in the job advertisement itself (i.e., even duplicates on the Gigajob site have different ID numbers). Some entries in the Gigajob hit list contain a small preview text, which is just the first line of text that can be found in the plain text job advertisement, including also technical information from the web site like “Aktuelle Stellenangebote - Detailansicht Jobs suchen! Aktuelle Stellenangebote -…“ [“Current job offers – detail view search for jobs! – current job offers…”] (having not necessarily any relation to the specific job advertisement). As it quickly turned out that this preview text was different also in obvious cases of duplicates, the information was not used for the sprint exercise.

 

Data analysis

The approach towards data analysis was an exploratory one, i.e. the objective was rather to identify possible approaches towards deduplication than to technically implement deduplication or to produce estimates regarding the percentage of duplicates. To this end, the data scraped were sorted and checked manually, mainly with the help of Microsoft Excel. Cases, for which a duplicate was considered to be likely, were reconciled by comparing the plain text of the relevant job advertisements.

In a first step, the job advertisements found at Stepstone were analysed for duplicates within the job portal, using the following approach:

1. Sorting of the Excel tables collected by web scraping by company and then job name and then location in alphabetical order

2. Identification of advertisements with identical (or at least similar) entries regarding company, job name and location

3. Investigation whether the identified cases were really duplicates by a comparison of the (full) job advertisements.

(Note: This approach does not allow to identify cases in which the same job is advertised by two different companies (e.g. the company itself and a recruitment agency); although one such case was found in the Stepstone data by chance, it was deemed rather rare to have the company and a recruitment agency advertising the same post on the same job portal)

 

In a second step we analysed the possibilities to identify duplicates of jobs that are at the same time advertised at Stepstone and at Gigajob. From the establishment of the inventory of job portals we knew that practically all jobs advertised at Stepstone can also be found at Gigajob. From the large number of duplicates we tried to find our whether there are characteristic patterns for duplicates:

1. Identification of duplicates at Stepstone and Gigajob (by the information available at Gigajob)

2. Comparison of the information provided for company, job name, location and publication date

3. Investigation whether differences in these variables exist for duplicates.

 

Main results

a) Duplicates within Stepstone

Table 2: Results of analysis of duplicates found at Stepstone

Sector

Advertisements checked

Possible duplicates

Could not be decided by comparing plain text ads.

Duplicate confirmed

Duplicate not confirmed

Building and construction

500

8

4

1

3

IT & internet

1000

11

1

1

9

 

A duplicate was marked as “not confirmed” if the job descriptions in the ad differed and had different reference numbers (in the ad). Practically no duplicates were found neither in the construction nor in the IT sector at Stepstone (see table 2). In the construction sector, eight pairs of job advertisements had (nearly) identical information regarding job name, company, and location (1.6%); in the IT sector this applied to 11 pairs of job advertisements (1.1%).[1] After manually checking the suspected job advertisements, actual duplicates were found only in one case each in the construction and the IT sector. This result shows that similar and even identical values for job name, company and location do not always reliably identify duplicates and would lead to removing job advertisements that are actually no duplicates. (Also the overall impression was that the Stepstone data were remarkably consistent and clean, which makes the job portal a good candidate for the remaining steps of the project.)

The analysis of the data collected from Stepstone revealed also cases in analogy to duplicates, i.e. job advertisements that appear in the list only once, but actually contain more than one job: About 170/1751 job advertisements in the construction sector and around 1200/7150 job advertisements in the IT sector contained multiple locations. In these cases, even after studying the full job advertisement, it was often not possible to know whether there was actually more than one post offered. Some job advertisements indicate that the same post is available at different locations (and would have to be counted more than once), in the majority of the cases it was just not possible to find reliable information on the actual number of posts. As the share of such cases is considerable, it would be important to develop approaches to deal with such (possible) multi job advertisements.

 

b) Stepstone vs. Gigajob

As indicated before, a different approach was chosen for the analysis of the data collected from Gigajob. As practically all Stepstone job advertisements can be found at Gigajob, it was no major problem to identify the job advertisements included in either job portal. These pairs were subsequently analysed for the possibility to identify the duplicates with the help of the structured information available. Due to time restrictions, this exercise was limited to the job advertisements in construction and only to the first 200 advertisements found at Stepstone (ordered alphabetically by company, job name and location). Despite this limited sample, a number of interesting findings were made:

  • The “topics” at Gigajob and the “sectors” at Stepstone are conceptually distinct. Out of the 200 job advertisements checked (and allocated to the sector “building and construction” at Stepstone), only 46 were allocated to the topic “construction” at Gigajob. (The remaining 154 job advertisements were – at least to a vast majority – equally included in Gigajob, but allocated to a different topic).
  • The 46 common job advertisements were identified as duplicates, but the structural information available at Stepstone was not always identical with the structural information at Gigajob (see table 3). Only three of the duplicates had identical information regarding the variables of company, job name, location and date. The date indicated differed in 39/46 cases and seems to be of no use for the identification of duplicates (this is even more so, as there was no apparent system behind the difference, e.g. the publication date on Stepstone was not always preceding the one on Gigajob). The location differed in 20/46 cases, which was mostly due to the omission of multiple locations at Gigajob (13/20). Still, in 7/20 cases an entirely different location was indicated at Gigajob. While there were no deviations regarding the company, the job name differed in 9/46 cases, which was always due to the fact that Gigajob seems to limit the length of the job name to about 50 characters and cuts off longer job names. In summary, only the variables company and job name are sufficiently reliable to be used for the identification of duplicates (but, as many cases in the Stepstone data show, a match of these to variables is probably not sufficient to identify a duplicate).

Table 3: Structural information of the 46 duplicates identified

Type of identical structural information

Number of job advertisements

Company, job name, location and date

3

Company, job name and location

17

Company and job name

18

Job name only

0

Company only

4

Company and location

4

 

Conclusions

The results of the sprint are of major importance for the further work in the project. Especially the following point should be emphasised:

  • While practically no duplicates were discovered at Stepstone, there is some evidence that many duplicates can be found at Gigajob. (This needs to be further analysed, but the first impression is that the portal owner of Gigajob runs at most very limited deduplication procedures. A considerable number of identical job advertisements can be found that originally come from different job boards covered by Gigajob.)[2]
  • Deduplication is hardly feasible on the basis of the structural information given in the hit lists of the job portals studied in the sprint (neither inside Stepstone nor between Stepstone and Gigajob). A reliable deduplication, necessarily involves considering the plain text of the job advertisements in addition.
  • When including hybrid portals, it should be considered to make only use of “own” job advertisements to reduce deduplication issues.
  • It should be noted that, as different data cleaning procedures are used by different job portals, an analysis similar to the present one, needs to be done for each job portal intended to be used for statistics production
 

[1] This is in line with quantitative analysis of all the job advertisements collected via web scraping at Stepstone: In construction 12 (pairs) out of 1751 job advertisements had (exactly) identical values for company, job name, and location (IT: 45/7150). Note that these cases are not necessarily duplicates.

[2] This also suggested by the fact that 4033/12678 Gigajob advertisements in the topic “construction” and 19497/49823 in the topic “IT” have a perfect match of company and job name with another job advertisement.