You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In CAPI, rather than being numbered, questions are given a unique name – this name is usually
derived from the content of the question. In most cases the first letter of the question name
indicates in which module the question was asked. Questions are identified by their bold
formatting. The text of the question that should be read out by the interviewer is displayed in
lower case, with the end of the question usually indicated by a question mark. This may involve
the interviewer reading through a list of pre-defined answers. At most questions, the respondent
chooses his/her answer(s) from a pre-defined list which is either read out to him/her by the
interviewer or which he/she reads from a card given to him/her by the interviewer. At other
times the respondent is not offered a pre-defined choice of answer categories, instead the
interviewer codes his/her spontaneous response to a pre-defined list of answers. Alternatively the
interviewer may be asked to type in the answer given verbatim. At ‘text’ questions of this type, the
number of characters allowed is limited (although interviewers can, where necessary, enter more
characters in an electronic memo). At other ‘OPEN’ questions, there is no limit on the number of
characters. Interviewer may also be asked to enter answers in the form of a date, time or number.
Notes to help or instruct the interviewer are shown in upper case.
Questions at which a pre-defined list of answers is given can be split into two types: single-coded
and multi-coded. Single-coded questions allow only one answer category to be chosen – unless
otherwise stated, assume that the question is single-coded. Multi-coded questions are usually
identified by a note to the interviewer to ‘CODE ALL THAT APPLY’. At some multi-coded
questions, the maximum number of codes allowed is less than the number offered. If this is the
case, the maximum is stated below e.g. ‘[maximum 4 codes]’. In addition, at some multi-coded
questions, one of the answers may be an ‘exclusive code’. This means that if this answer is chosen,
no others may be. These codes are identified by ‘[exclusive code]’ after the answer category. Unless
otherwise stated all questions also allow ‘Don’t Know’ and ‘Refusal’ answers to be entered. Where
these are not allowed, it is stated below the answers.
At various points in the CAPI program it was necessary to derive variables in order primarily to
route respondents correctly to subsequent questions. In particular, several variables were derived
from answers given to the household grid. These are listed at the end of the section which
documents the household module. In addition, there are some derived variables listed at the
beginning of the Module A which indicate key information about the main/partner respondent
which is needed for routing through the questionnaire. Derived variables also occur in a number
of other modules and are identified as such.
Routing instructions
Routing instructions are fully detailed in italics at appropriate points. The routing condition is
both explained in words and given in terms of the logical command. The expressions ‘<’ ,’=’, ‘>’
are used to denote ‘less than’, ‘equal to’ and ‘more than’. The term ‘<>’ means ‘not equal to’. The
routing condition is displayed immediately before the first question to which it applies and is
indicated by an ‘IF’ statement. The end of the influence of a particular routing condition is
indicated by an ‘ENDIF’. In some cases, where the routing is more complex a routing box is used
instead of ‘IF’ and ‘ENDIF’ statements to explain when the questions are asked. For example:
I have an idea of how to handle CAPI files which is just to look for the capitalised variable names, and take the text following it. But this needs some refinement and testing on many different files. Potentially we can handle CAPI without needing the usual machine learning component...
import re
with open(capi_file_converted_to_text, "r", encoding="utf-8") as f:
file_content = f.read()
lines = file_content.split("\n")
re_code = re.compile(r'^[A-Z][A-Z0-9][A-Z0-9]+ ')
capi_code_lookup = {}
for idx, line in enumerate(list(lines)):
matches = re_code.findall(line)
if len(matches) > 0:
match = matches[0]
if match not in capi_code_lookup:
capi_code_lookup[match] = []
capi_code_lookup[match].append(idx)
lines_to_start_question = set()
for a, b in sorted (capi_code_lookup.items(), key=lambda x : len(x[1])):
print (a, b)
if len(b) == 1:
lines_to_start_question.add(list(b)[0] + 1)
for idx, line in enumerate(list(lines)):
if idx in lines_to_start_question and idx not in capi_code_lookup:
print (line)
Description
Many large studies have a PDF report in a format called CAPI. This can run to hundreds of pages.
https://dimewiki.worldbank.org/Computer-Assisted_Personal_Interviews_(CAPI)
Description of CAPI below from MCS:
Example: Millennium Cohort Study
https://cls.ucl.ac.uk/wp-content/uploads/2017/07/MCS2_CAPI_Questionnaire_Documentation_June_2006_v1-2.pdf
Environment
Web Harmony
How to Reproduce
Expected Behavior
Harmony should find the correct questions
Maybe we can write some Python code to do this instead of using the standard machine learning approach since CAPI is a special case
The text was updated successfully, but these errors were encountered: