Lexical analysis
Conversion of character sequences into token sequences in computer science / From Wikipedia, the free encyclopedia
Dear Wikiwand AI, let's keep it short by simply answering these key questions:
Can you list the top facts and stats about Lexeme (computer science)?
Summarize this article for a 10 year old
Lexical tokenization is conversion of a text into (semantically or syntactically) meaningful lexical tokens belonging to categories defined by a "lexer" program. In case of a natural language, those categories include nouns, verbs, adjectives, punctuations etc. In case of a programming language, the categories include identifiers, operators, grouping symbols and data types. Lexical tokenization is related to the type of tokenization used in Large language models (LLMs), but with two differences. First, lexical tokenization is usually based on a lexical grammar, whereas LLM tokenizers are usually probability-based. Second, LLM tokenizers perform a second step that converts the tokens into numerical values.