Summary
For a number of variables in questionnaires, one wants the answer in closed form, e.g., “city”; this is a relatively simple classifying task. Sometimes this task is much harder, e.g., when trying to get a code for occupation. One approach is to ask an open question (“what is your occupation”) and then try and code this text at the statistical office. For the sake of efficiency, that coding process will start by an automatic step.
In some cases, no previously coded material is available in electronic form. The starting point then consists of the data to be coded and a classification with a textual description per code. In this situation, we can either build the informative base transforming the classification manual so as to be ‘processable’ by a computerised system and ensuring pre-coded descriptions or one must try and code open text answers based on the texts themselves and the associated semantics, to enable the approach from the module “Coding – Automatic Coding Based on Pre-coded Datasets”.
Although an informative base can be constructed based on expert knowledge, pre-coded answers may also be added to the informative base to enhance the coding rate. This makes the distinction with the module “Coding – Automatic Coding Based on Pre-coded Datasets” less strict. The main distinction between the latter module and this one is the amount of manual work to construct an informative base: the methods in the other module are based on machine-learning requiring much less manual work. As described in the module “Coding – How to Build the Informative Base”, the informative base can contain:
- the classification manual descriptions, transformed so as to be ‘processable’ by a computerised system;
- pre-coded descriptions collected in previous surveys;
- different kinds of synonymous, hypernyms and hyponyms.
There are general systems (ACTR, now G-CODE (Wenzowski, 1988) and Cascot (Cascot)) that use the elements above to code text in a number of steps, like pre-processing the text, replacing words and finally assign a code. Alternatively, most of these steps can be combined into a so-called semantic network (Hacking and Janssen-Jansen, 2009). In the following section we will describe the “spreading activation” search method in the semantic network in more detail as an example; at certain points we will describe the link with the “processing approach” in the ACTR tool.
To read the entire document, please access the pdf file (link under "Related Documents" on the right-hand-side of this page).
Your feedback is appreciated. Please send your remarks, suggestions for improvement, etc. to memobust@cbs.nl.