support `DefineTokens` with AloneTokenOption and IgnoreCaseTokenOption by einsitang · Pull Request #38 · bzick/tokenizer

einsitang · 2025-06-30T11:49:34Z

`AloneTokenOption`

After defining the token, if there are consecutive letters after the token value, it will be split independently.

...
tokenizer.DefineTokens(HelloKey,[]string{"hello"})
...
input:="helloworld"
stream:=tokenizer.ParseString(input)

for stream.IsVail() {
  token:=stream.CurrentToken()
  // hello,world
  stream.GoNext()
}

after AloneTokenOption

...
tokenizer.DefineTokens(HelloKey,[]string{"hello"},AloneTokenOption)
...
input:="helloworld"
stream:=tokenizer.ParseString(input)

for stream.IsVail() {
  token:=stream.CurrentToken()
  // helloworld
  stream.GoNext()
}

Only supports independent match hello

`IgnoreCaseTokenOption`

make token value case-insensitive match (#12 )

use example:

tokenizer.DefineTokens(HelloKey,[]string{"hello"},IgnoreCaseTokenOption)
...

Non-breaking Change: tokenizer.DefineTokens API

refactor: rename ignoreCaseAlphabet to upperCaseAlphabet for clarity

bzick · 2025-06-30T12:22:44Z

It doesn't work with unicode, but unicode one of the main feature.

einsitang · 2025-06-30T12:55:51Z

It doesn't work with unicode, but unicode one of the main feature.

you mean IgnoreCase with unicode not work？

I checked the encoding table. Both unicode and ascii have lowercase letters, and the difference is 32

einsitang · 2025-07-07T13:24:36Z

but unicode one of the main feature.

How is the progress of the Unicode feature currently?

bzick · 2025-07-07T20:27:03Z

I mean:
• that the shift by 32 doesn’t work with alphabets of other languages
• you’re working with only 1 byte and as a result…
• … isAlphabet checks 1 byte, but it should work with runes (multi-byte)
and so on and so forth
See https://symbl.cc/en/unicode-table/

einsitang · 2025-07-09T10:43:02Z

The problem you mentioned does indeed exist.

however, when the tokenizer performs the defineToken operation, it uses the first byte for indexing, and during parsing, it also moves in byte units. You need to modify the indexing and moving method to use runes instead so that I can correctly check.

einsitang · 2025-07-09T14:03:23Z

There is a new implementation method.

IgnoreCaseTokenOption only support alphabet token , define token with special word will panic.

tokenizer.DefineTokens(HelloKey,[]string{"hello","哈喽"}, IgnoreCaseTokenOption) // panic , because "哈喽" is not alphabet

n-peugnet · 2025-08-03T09:18:51Z

Non-breaking Change: tokenizer.DefineTokens API

This is a Breaking API change though (see: https://go.dev/blog/module-compatibility#adding-to-a-function).

You should add a new function like DefineTokensOptions() and make DefineTokens() call it underneath for example.

einsitang added 4 commits June 30, 2025 02:04

feat:token case insensitivity and support for separate splitting

f28e465

Initial commit

296e054

feat: add ignore case support for token parsing and enhance tests

982a7a9

refactor: rename ignoreCaseAlphabet to upperCaseAlphabet for clarity

update test case

547b4ff

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support `DefineTokens` with AloneTokenOption and IgnoreCaseTokenOption#38

support `DefineTokens` with AloneTokenOption and IgnoreCaseTokenOption#38
einsitang wants to merge 4 commits intobzick:masterfrom
einsitang:codespace-improved-waffle-r74qv7vjx7c5v54

einsitang commented Jun 30, 2025 •

edited

Loading

Uh oh!

bzick commented Jun 30, 2025 •

edited

Loading

Uh oh!

einsitang commented Jun 30, 2025

Uh oh!

einsitang commented Jul 7, 2025

Uh oh!

bzick commented Jul 7, 2025 •

edited

Loading

Uh oh!

einsitang commented Jul 9, 2025

Uh oh!

einsitang commented Jul 9, 2025

Uh oh!

n-peugnet commented Aug 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

einsitang commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AloneTokenOption

IgnoreCaseTokenOption

Uh oh!

bzick commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

einsitang commented Jun 30, 2025

Uh oh!

einsitang commented Jul 7, 2025

Uh oh!

bzick commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

einsitang commented Jul 9, 2025

Uh oh!

einsitang commented Jul 9, 2025

Uh oh!

n-peugnet commented Aug 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

einsitang commented Jun 30, 2025 •

edited

Loading

`AloneTokenOption`

`IgnoreCaseTokenOption`

bzick commented Jun 30, 2025 •

edited

Loading

bzick commented Jul 7, 2025 •

edited

Loading