Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make separate parsing code for CAPI formatted PDF files #57

Open
woodthom2 opened this issue Oct 26, 2024 · 2 comments
Open

Make separate parsing code for CAPI formatted PDF files #57

woodthom2 opened this issue Oct 26, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@woodthom2
Copy link
Contributor

Description

Many large studies have a PDF report in a format called CAPI. This can run to hundreds of pages.

https://dimewiki.worldbank.org/Computer-Assisted_Personal_Interviews_(CAPI)

Description of CAPI below from MCS:

In CAPI, rather than being numbered, questions are given a unique name – this name is usually 
derived from the content of the question. In most cases the first letter of the question name 
indicates in which module the question was asked. Questions are identified by their bold 
formatting. The text of the question that should be read out by the interviewer is displayed in 
lower case, with the end of the question usually indicated by a question mark. This may involve 
the interviewer reading through a list of pre-defined answers. At most questions, the respondent 
chooses his/her answer(s) from a pre-defined list which is either read out to him/her by the 
interviewer or which he/she reads from a card given to him/her by the interviewer. At other 
times the respondent is not offered a pre-defined choice of answer categories, instead the 
interviewer codes his/her spontaneous response to a pre-defined list of answers. Alternatively the 
interviewer may be asked to type in the answer given verbatim. At ‘text’ questions of this type, the 
number of characters allowed is limited (although interviewers can, where necessary, enter more 
characters in an electronic memo). At other ‘OPEN’ questions, there is no limit on the number of 
characters. Interviewer may also be asked to enter answers in the form of a date, time or number. 
Notes to help or instruct the interviewer are shown in upper case.    
 
Questions at which a pre-defined list of answers is given can be split into two types: single-coded 
and multi-coded. Single-coded questions allow only one answer category to be chosen – unless 
otherwise stated, assume that the question is single-coded. Multi-coded questions are usually 
identified by a note to the interviewer to ‘CODE ALL THAT APPLY’. At some multi-coded 
questions, the maximum number of codes allowed is less than the number offered. If this is the 
case, the maximum is stated below e.g. ‘[maximum 4 codes]’. In addition, at some multi-coded 
questions, one of the answers may be an ‘exclusive code’. This means that if this answer is chosen, 
no others may be. These codes are identified by ‘[exclusive code]’ after the answer category. Unless 
otherwise stated all questions also allow ‘Don’t Know’ and ‘Refusal’ answers to be entered. Where 
these are not allowed, it is stated below the answers.  
 
At various points in the CAPI program it was necessary to derive variables in order primarily to 
route respondents correctly to subsequent questions. In particular, several variables were derived 
from answers given to the household grid. These are listed at the end of the section which 
documents the household module. In addition, there are some derived variables listed at the 
beginning of the Module A which indicate key information about the main/partner respondent 
which is needed for routing through the questionnaire. Derived variables also occur in a number 
of other modules and are identified as such. 
 
Routing instructions 
Routing instructions are fully detailed in italics at appropriate points. The routing condition is 
both explained in words and given in terms of the logical command. The expressions ‘<’ ,’=’, ‘>’ 
are used to denote ‘less than’, ‘equal to’ and ‘more than’. The term ‘<>’ means ‘not equal to’. The 
routing condition is displayed immediately before the first question to which it applies and is 
indicated by an ‘IF’ statement. The end of the influence of a particular routing condition is 
indicated by an ‘ENDIF’. In some cases, where the routing is more complex a routing box is used 
instead of ‘IF’ and ‘ENDIF’ statements to explain when the questions are asked.  For example: 

Example: Millennium Cohort Study

https://cls.ucl.ac.uk/wp-content/uploads/2017/07/MCS2_CAPI_Questionnaire_Documentation_June_2006_v1-2.pdf

Environment

Web Harmony

How to Reproduce

  1. Take a CAPI file such as that for Millennium Cohort Study and upload it into Harmony
  2. Harmony does not find the questions

Expected Behavior

Harmony should find the correct questions

Maybe we can write some Python code to do this instead of using the standard machine learning approach since CAPI is a special case

@woodthom2 woodthom2 added the bug Something isn't working label Oct 26, 2024
@woodthom2
Copy link
Contributor Author

@bmoltrecht tagging you as this is the issue you have flagged for me

@woodthom2
Copy link
Contributor Author

I have an idea of how to handle CAPI files which is just to look for the capitalised variable names, and take the text following it. But this needs some refinement and testing on many different files. Potentially we can handle CAPI without needing the usual machine learning component...

import re

with open(capi_file_converted_to_text, "r", encoding="utf-8") as f:
    file_content = f.read()
    
lines = file_content.split("\n")
re_code = re.compile(r'^[A-Z][A-Z0-9][A-Z0-9]+ ')
capi_code_lookup = {}

for idx, line in enumerate(list(lines)):
    matches = re_code.findall(line)
    if len(matches) > 0:
        match  = matches[0]
        if match not in capi_code_lookup:
            capi_code_lookup[match] = []
        capi_code_lookup[match].append(idx)


lines_to_start_question = set()
for a, b in sorted (capi_code_lookup.items(), key=lambda x : len(x[1])):
    print (a, b)
    if len(b) == 1:
        lines_to_start_question.add(list(b)[0] + 1)
        
for idx, line in enumerate(list(lines)):
    if idx in lines_to_start_question and idx not in capi_code_lookup:
        print (line)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants