The language used for tokenization is specified at the column level, or can be identified within 用于标记化的语言将在列级指定,或者也可以通过筛选器组件在
tokenization is important because it defines the units within the data that are compared to each other 因为标记化在要进行比较的数据内定义相关单元,所以标记化是非常重要的操作。
If you are unsure, a general best bet is to use the neutral word breaker, which performs its tokenization purely on white space and punctuation 如果您对此不能确定,常用的最好办法是使用非特定语言断字符,这种断字符只基于空格和标点进行标记化。
The transformation provides a default set of delimiters used to tokenize the data, but you can add new delimiters that improve the tokenization of your data 该转换提供了一组用于标记化数据的默认分隔符,但您可以添加新的分隔符以改善您数据的标记化。