What is the difference between preprocessing tokenizer and input_features tokenizer? #2230

farazk86 · 2022-07-04T11:33:24Z

farazk86
Jul 4, 2022

Hi,

I'm trying to understand the difference between the following two configurations:

"input_features": [
        {
            "name": "both_text",
            "type": "text",
            "preprocessing": {
                "tokenizer": "space",
            },
            "encoder": "rnn",
            "cell_type": "lstm",
            "num_layers": 8,
            "reduce_output": None,
        }
    ],
    "preprocessing": {
        "split_probabilities": [0.8, 0.1, 0.1],
    },

and

"input_features": [
        {
            "name": "both_text",
            "type": "text",
            "encoder": "rnn",
            "cell_type": "lstm",
            "num_layers": 8,
            "reduce_output": None,
        }
    ],
    "preprocessing": {
        "split_probabilities": [0.8, 0.1, 0.1],
        "text": {
          "tokenizer": "space",
        },
    },

They both appear to be tokenizing the input_text based on space. However both produce different results?

Can anyone please tell me whats the difference between using tokenizer in preprocessing or input_features? If I want my input text to be space tokenized which one should I use?

Also, if I use tokenizer setting in both input_features and preprocessing I get an error at evaluation stage:

'Cannot mix NumPy dtypes float64 and float32', 'Conversion failed for column labels_probabilities with type object'

Thanks

Answered by justinxzhao

Jul 14, 2022

Hi @farazk86!

A couple of questions for you regarding the error you are running into:

What type of output feature are you using?
Are you be able to reproduce this error in Ludwig 0.5.4?

The two configurations you provided should be the same regarding preprocessing. Feature-specific preprocessing configuration parameters can override global preprocessing configuration parameters. This is documented here.

View full answer

justinxzhao · 2022-07-14T20:24:30Z

justinxzhao
Jul 14, 2022
Maintainer

Hi @farazk86!

A couple of questions for you regarding the error you are running into:

What type of output feature are you using?
Are you be able to reproduce this error in Ludwig 0.5.4?

The two configurations you provided should be the same regarding preprocessing. Feature-specific preprocessing configuration parameters can override global preprocessing configuration parameters. This is documented here.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the difference between preprocessing tokenizer and input_features tokenizer? #2230

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

What is the difference between preprocessing tokenizer and input_features tokenizer? #2230

farazk86 Jul 4, 2022

Replies: 1 comment

justinxzhao Jul 14, 2022 Maintainer

farazk86
Jul 4, 2022

justinxzhao
Jul 14, 2022
Maintainer