Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

what are your train.json formats of genia in datasets? #8

Open
fantao766 opened this issue Jun 14, 2023 · 5 comments
Open

what are your train.json formats of genia in datasets? #8

fantao766 opened this issue Jun 14, 2023 · 5 comments

Comments

@fantao766
Copy link

Hello, I'm so shocked by your brilliant insight and hugely interested in you model and innovation point of this paper. There are some errors When I ran these codes from github.com. But I have no ideas to cope with these mistakes.Therefore, I would like to request one question: what are your train.json formats of genia in datasets? Because I can't run data_preprocess in the begining, so I try to use my own data_preprocess code to generate train.json ... , but unfortunately it'doesn't work. Hope to get an answer.
Thanks~!

@kristinlindquist
Copy link

kristinlindquist commented Jul 11, 2023

Hi! If I understand your question, you ran into the same problem that I did (all the links to the mysteriously preprocessed genia dataset are 404s). I found some files here - https://github.com/yhcc/CNN_Nested_NER/tree/master/preprocess/outputs/genia - which I was able to convert with this adjusted method:

def format_data(input, output):
    os.makedirs(os.path.dirname(output), exist_ok=True)
    entities, docs = 0, 0
    with open(output, 'w', encoding='utf-8') as fw, open(input, encoding='utf-8') as fr:
        ids = set()
        for idx, ln in enumerate(fr):
            if ln == '\n':
                continue
            example = json.loads(ln)
            try:
              example = convert(example) 
            except Exception as e:
              print(f"Could not convert line {idx}; skipping.")
              continue
            entities += len(example["entity_types"])
            docs += 1
            assert example['id'] not in ids
            ids.add(example['id'])
            fw.write(json.dumps(example) + '\n')
    print(f"Entities: {entities}")
    print(f"Docs: {docs}")

(Edit: i was off-by-one on the previous code; there are some flawed rows in that dataset but now i'm just skipping them)

@Veranchos
Copy link

Hi @kristinlindquist ! Thank you for your help! I have a question: what was your convert function for GENIA .jsonlines in that case?

@helleuch
Copy link

Hello, thank you very much for your help, I would also like to ask about the function used to convert the jsonlines, please.
Thank you again !

@Veranchos
Copy link

Veranchos commented Jul 17, 2023

@helleuch
The following function worked for me:

def convert_genia(example: Dict) -> Dict:
     offset_mapping = []
     text = ''
     for token in example['tokens']:
         if text == '':
             offset_mapping.append((0, len(token)))
             text += token
         else:
             text += ' ' + token
             offset_mapping.append((len(text) - len(token), len(text)))
     entity_types, entity_start_chars, entity_end_chars = [], [], []
     for ann in example['entity_mentions']:
         start = ann["start"]
         end = ann["end"]
         entity_type = ann["entity_type"]
         start, end = offset_mapping[start - 1][0], offset_mapping[end - 1][1]
         entity_types.append(entity_type)
         entity_start_chars.append(start)
         entity_end_chars.append(end)
     start_words, end_words= zip(*offset_mapping)
     return {
         'text': text,
         'entity_types': entity_types,
         'entity_start_chars': entity_start_chars,
         'entity_end_chars': entity_end_chars,
         'id': example['sent_id'],
         'word_start_chars': start_words,
         'word_end_chars': end_words
    }

... and the according changes in the main function:

def main(args):
    if args.task == "conll2003":
        convert = convert_conll2003
    elif args.task == "genia":
        convert = convert_genia
    else:
        convert = convert_default
    os.makedirs(os.path.dirname(args.output), exist_ok=True)
    entities, docs = 0, 0
    with open(args.output, 'w', encoding='utf-8') as fw, open(args.input, encoding='utf-8') as fr:
        ids = set()
        for idx, ln in enumerate(fr):
            if ln == '\n':
                continue
            example = json.loads(ln)
            print(example)
            try:
              example = convert(example) 
            except Exception as e:
              print(f"Could not convert line {idx}; skipping.")
              continue
            entities += len(example["entity_types"])
            docs += 1
            assert example['id'] not in ids
            ids.add(example['id'])
            fw.write(json.dumps(example) + '\n')
    print(f"Entities: {entities}")
    print(f"Docs: {docs}")

@helleuch
Copy link

@Veranchos Thank you very much !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants