unified function to check if a character is valid in an identifier #149

s311354 · 2025-01-01T08:44:30Z

Problem 

The current implementation of is_ident1 and is_ident2 functions works without errors, but both cases obviously handle valid identifier characters dynamically based on whether it’s checking the first or subsequent characters.

// This function returns true if a given character is acceptable as the first character of an identifier.

bool is_ident1(uint32_t c) {
  static uint32_t range[] = {
    '_', '_', 'a', 'z', 'A', 'Z', '$', '$',
    0x00A8, 0x00A8, 0x00AA, 0x00AA, 0x00AD, 0x00AD, 0x00AF, 0x00AF,
    0x00B2, 0x00B5, 0x00B7, 0x00BA, 0x00BC, 0x00BE, 0x00C0, 0x00D6,
    0x00D8, 0x00F6, 0x00F8, 0x00FF, 0x0100, 0x02FF, 0x0370, 0x167F,
    0x1681, 0x180D, 0x180F, 0x1DBF, 0x1E00, 0x1FFF, 0x200B, 0x200D,
    0x202A, 0x202E, 0x203F, 0x2040, 0x2054, 0x2054, 0x2060, 0x206F,
    0x2070, 0x20CF, 0x2100, 0x218F, 0x2460, 0x24FF, 0x2776, 0x2793,
    0x2C00, 0x2DFF, 0x2E80, 0x2FFF, 0x3004, 0x3007, 0x3021, 0x302F,
    0x3031, 0x303F, 0x3040, 0xD7FF, 0xF900, 0xFD3D, 0xFD40, 0xFDCF,
    0xFDF0, 0xFE1F, 0xFE30, 0xFE44, 0xFE47, 0xFFFD,
    0x10000, 0x1FFFD, 0x20000, 0x2FFFD, 0x30000, 0x3FFFD, 0x40000, 0x4FFFD,
    0x50000, 0x5FFFD, 0x60000, 0x6FFFD, 0x70000, 0x7FFFD, 0x80000, 0x8FFFD,
    0x90000, 0x9FFFD, 0xA0000, 0xAFFFD, 0xB0000, 0xBFFFD, 0xC0000, 0xCFFFD,
    0xD0000, 0xDFFFD, 0xE0000, 0xEFFFD, -1,
  };

  return in_range(range, c);
}

// Check based on the context (first or subsequent character)

bool is_ident2(uint32_t c) {
  static uint32_t range[] = {
    '0', '9', '$', '$', 0x0300, 0x036F, 0x1DC0, 0x1DFF, 0x20D0, 0x20FF,
    0xFE20, 0xFE2F, -1,
  };

  return is_ident1(c) || in_range(range, c);
}

Proposal

I combined exactly is_ident1 and is_ident2 chunks of code into a single logic block in tokenize.c. This change would streamline the process of identifying valid characters while reading identifiers.

 Retaining the functionality to distinguish valid identifier characters is also a critical part, and this will reduce code duplication and improve maintainability. we should probably enhance caching efficiency and CPU branch prediction due to consolidated logic, too. I would even propose to eliminate redundant calls to separate functions, but this really depends on your focus in this project.

s311354 added 3 commits January 1, 2025 16:12

unified function to check if a character is valid in an identifier

a327851

handle NULL safety

77a55ba

avoid behave unpredictably

2898ce7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unified function to check if a character is valid in an identifier #149

unified function to check if a character is valid in an identifier #149

s311354 commented Jan 1, 2025

unified function to check if a character is valid in an identifier #149

Are you sure you want to change the base?

unified function to check if a character is valid in an identifier #149

Conversation

s311354 commented Jan 1, 2025

Problem

Proposal

Problem