Lexer

The lexer (tokenizer) is responsible for converting raw source code into a sequence of tokens. These tokens form the input to the parser.

Overview

The Mars lexer is implemented as a single-pass, regex-based tokenizer. It scans the source code from left to right and matches patterns using a compiled master regular expression.

Each match produces a Token object:

@dataclass
class Token:
    type: str
    value: str
    position: int

Tokenization Strategy

All token patterns are defined in TOKEN_SPEC and combined into a single regular expression:

MASTER_REGEX = re.compile(
    "|".join(f"(?P<{name}>{pattern})" for name, pattern in TOKEN_SPEC)
)

The lexer repeatedly:

Matches the next token at the current position
Identifies the token type via named groups
Advances the cursor
Appends the token to the output list

Token Categories

The lexer recognizes several categories of tokens:

Keywords

Types: int, float, bool, string, void, dict
Control flow: if, else, while, for, break, continue
Structural: class, component, extends

Literals

Integers: 123
Floats: 3.14
Strings: "hello"
Booleans: true, false

Operators

Arithmetic: +, -, *, /, ^
Assignment: =, +=, -=, *=, /=
Comparison: ==, !=, <, >, <=, >=
Logical: &&, ||, !

Punctuation

Braces: { }
Parentheses: ( )
Separators: ;, ,, :
Access: .

Identifiers

Pattern: [A-Za-z_][A-Za-z0-9_]*
Used for variable names, function names, etc.

Order Matters

Token patterns are matched in order, which is critical for correctness.

Examples:

FLOAT must appear before INT
→ ensures 3.14 is not parsed as 3 and .14
== must appear before =
→ ensures equality is not split into two tokens
Keywords must appear before ID
→ ensures if is not treated as an identifier

Skipped Tokens

The lexer ignores:

Whitespace (SKIP)
Line comments (# comment)
Block comments (/* comment */)

These tokens are matched but not emitted.

Error Handling

If no token matches the current input:

if not m:
    raise SyntaxError(...)

The lexer:

Stops immediately
Reports the exact position
Uses format_source_error for detailed diagnostics

End of File

After processing all input, the lexer appends:

Token("EOF", "", pos)

This marks the end of the token stream for the parser.

Example

Input:

int x = 5;

Output:

INT_KW(int)
ID(x)
ASSIGN(=)
INT(5)
SEMI(;)
EOF

Design Notes

The lexer is deterministic and single-pass
It does not perform semantic validation
All structure is deferred to the parser and later stages

Future Improvements

Track line/column numbers (currently only position)
Support escape sequences in strings
Improve error recovery (continue after failure)

Overview​

Tokenization Strategy​

Token Categories​

Keywords​

Literals​

Operators​

Punctuation​

Identifiers​

Order Matters​

Skipped Tokens​

Error Handling​

End of File​

Example​

Design Notes​

Future Improvements​