Make separate parsing code for CAPI formatted PDF files #57

woodthom2 · 2024-10-26T10:31:06Z

Description

Many large studies have a PDF report in a format called CAPI. This can run to hundreds of pages.

https://dimewiki.worldbank.org/Computer-Assisted_Personal_Interviews_(CAPI)

Description of CAPI below from MCS:

In CAPI, rather than being numbered, questions are given a unique name – this name is usually 
derived from the content of the question. In most cases the first letter of the question name 
indicates in which module the question was asked. Questions are identified by their bold 
formatting. The text of the question that should be read out by the interviewer is displayed in 
lower case, with the end of the question usually indicated by a question mark. This may involve 
the interviewer reading through a list of pre-defined answers. At most questions, the respondent 
chooses his/her answer(s) from a pre-defined list which is either read out to him/her by the 
interviewer or which he/she reads from a card given to him/her by the interviewer. At other 
times the respondent is not offered a pre-defined choice of answer categories, instead the 
interviewer codes his/her spontaneous response to a pre-defined list of answers. Alternatively the 
interviewer may be asked to type in the answer given verbatim. At ‘text’ questions of this type, the 
number of characters allowed is limited (although interviewers can, where necessary, enter more 
characters in an electronic memo). At other ‘OPEN’ questions, there is no limit on the number of 
characters. Interviewer may also be asked to enter answers in the form of a date, time or number. 
Notes to help or instruct the interviewer are shown in upper case.    
 
Questions at which a pre-defined list of answers is given can be split into two types: single-coded 
and multi-coded. Single-coded questions allow only one answer category to be chosen – unless 
otherwise stated, assume that the question is single-coded. Multi-coded questions are usually 
identified by a note to the interviewer to ‘CODE ALL THAT APPLY’. At some multi-coded 
questions, the maximum number of codes allowed is less than the number offered. If this is the 
case, the maximum is stated below e.g. ‘[maximum 4 codes]’. In addition, at some multi-coded 
questions, one of the answers may be an ‘exclusive code’. This means that if this answer is chosen, 
no others may be. These codes are identified by ‘[exclusive code]’ after the answer category. Unless 
otherwise stated all questions also allow ‘Don’t Know’ and ‘Refusal’ answers to be entered. Where 
these are not allowed, it is stated below the answers.  
 
At various points in the CAPI program it was necessary to derive variables in order primarily to 
route respondents correctly to subsequent questions. In particular, several variables were derived 
from answers given to the household grid. These are listed at the end of the section which 
documents the household module. In addition, there are some derived variables listed at the 
beginning of the Module A which indicate key information about the main/partner respondent 
which is needed for routing through the questionnaire. Derived variables also occur in a number 
of other modules and are identified as such. 
 
Routing instructions 
Routing instructions are fully detailed in italics at appropriate points. The routing condition is 
both explained in words and given in terms of the logical command. The expressions ‘<’ ,’=’, ‘>’ 
are used to denote ‘less than’, ‘equal to’ and ‘more than’. The term ‘<>’ means ‘not equal to’. The 
routing condition is displayed immediately before the first question to which it applies and is 
indicated by an ‘IF’ statement. The end of the influence of a particular routing condition is 
indicated by an ‘ENDIF’. In some cases, where the routing is more complex a routing box is used 
instead of ‘IF’ and ‘ENDIF’ statements to explain when the questions are asked.  For example:

Example: Millennium Cohort Study

https://cls.ucl.ac.uk/wp-content/uploads/2017/07/MCS2_CAPI_Questionnaire_Documentation_June_2006_v1-2.pdf

Environment

Web Harmony

How to Reproduce

Take a CAPI file such as that for Millennium Cohort Study and upload it into Harmony
Harmony does not find the questions

Expected Behavior

Harmony should find the correct questions

Maybe we can write some Python code to do this instead of using the standard machine learning approach since CAPI is a special case

The text was updated successfully, but these errors were encountered:

woodthom2 · 2024-10-26T10:31:55Z

@bmoltrecht tagging you as this is the issue you have flagged for me

woodthom2 · 2024-10-26T10:40:04Z

I have an idea of how to handle CAPI files which is just to look for the capitalised variable names, and take the text following it. But this needs some refinement and testing on many different files. Potentially we can handle CAPI without needing the usual machine learning component...

import re

with open(capi_file_converted_to_text, "r", encoding="utf-8") as f:
    file_content = f.read()
    
lines = file_content.split("\n")
re_code = re.compile(r'^[A-Z][A-Z0-9][A-Z0-9]+ ')
capi_code_lookup = {}

for idx, line in enumerate(list(lines)):
    matches = re_code.findall(line)
    if len(matches) > 0:
        match  = matches[0]
        if match not in capi_code_lookup:
            capi_code_lookup[match] = []
        capi_code_lookup[match].append(idx)


lines_to_start_question = set()
for a, b in sorted (capi_code_lookup.items(), key=lambda x : len(x[1])):
    print (a, b)
    if len(b) == 1:
        lines_to_start_question.add(list(b)[0] + 1)
        
for idx, line in enumerate(list(lines)):
    if idx in lines_to_start_question and idx not in capi_code_lookup:
        print (line)

woodthom2 added the bug Something isn't working label Oct 26, 2024

woodthom2 assigned woodthom2 and bmoltrecht Oct 26, 2024

woodthom2 unassigned bmoltrecht Oct 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make separate parsing code for CAPI formatted PDF files #57

Make separate parsing code for CAPI formatted PDF files #57

woodthom2 commented Oct 26, 2024

woodthom2 commented Oct 26, 2024

woodthom2 commented Oct 26, 2024

Make separate parsing code for CAPI formatted PDF files #57

Make separate parsing code for CAPI formatted PDF files #57

Comments

woodthom2 commented Oct 26, 2024

Description

Environment

How to Reproduce

Expected Behavior

woodthom2 commented Oct 26, 2024

woodthom2 commented Oct 26, 2024