fix(markdown): improve CommonMark conformance and parser structure #8908

jfmcdowell · 2026-01-29T12:43:05Z

Note

AI Assistance Disclosure: This PR was developed with assistance from Claude Code.

Summary

This PR significantly improves the markdown parser's CommonMark spec conformance and reorganizes the parser module structure for better maintainability.

Conformance improvements:

Fixed link/reference parsing edge cases (nested brackets, escaped characters, parenthesis depth limits)
Enhanced emphasis parsing with proper delimiter stack algorithm and flanking rules
Improved block structure handling (setext headings, list interruption rules, lazy continuation)
Added Unicode case folding for link reference matching (CommonMark requirement)
Fixed HTML block detection and inline HTML parsing

Structural improvements:

Split monolithic inline.rs (2600+ lines) into focused submodules:
- inline/code_span.rs — backtick-delimited code
- inline/emphasis.rs — bold/italic with delimiter stack
- inline/links.rs — links and images
- inline/html.rs — autolinks and inline HTML
- inline/entities.rs — character references
Adopted syntax/mod.rs convention (consistent with HTML/CSS parsers)
Extracted helpers to simplify the main parsing loop (handle_inline_newline, classify_quote_break_after_newline)

Dependency:
CommonMark requires full Unicode case folding for reference label matching (e.g., ß → ss, ẞ → ss).
caseless (v0.2.2) is:

Pure Rust, WASM-compatible
Provides full case folding via default_case_fold_str()
MIT licensed, maintained by unicode-rs (same org as unicode-normalization)

Test Plan

- just test-crate biome_markdown_parser
- just test-markdown-conformance
- just f && just l

Snapshot tests updated for improved parsing

#3718

…nderer Decode HTML entities and percent-encode URLs in destinations/titles for CommonMark conformance. Apply the same normalization to autolink hrefs, unescape info strings, and preserve tab/NBSP characters in paragraph trimming. Add percent-encoding as a test_utils dependency.

This commit consolidates several fixes related to link parsing, reference definitions, and URL validation to align with CommonMark specifications. Changes include: - Enforce strict parenthesis depth counting in link destinations to correctly handle nested parens. - Tighten link title parsing to handle edge cases with escapes and mismatched delimiters. - Improve validation of link destinations, specifically handling whitespace and control characters. - Fix inline link parsing fallback mechanisms when patterns do not match. - Correctly handle whitespace in inline links and truncated destinations. - Resolve multiple CommonMark compliance failures related to link reference definitions and general link rendering. Includes regression tests for percent-encoding and complex link structures. diff --git a/Cargo.lock b/Cargo.lock index bfc6c1c8c5..72c683148d 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -1295,6 +1295,7 @@ dependencies = [ "biome_markdown_syntax", "biome_parser", "biome_rowan", + "biome_string_case", "biome_test_utils", "biome_unicode_table", "htmlize", diff --git a/crates/biome_markdown_parser/Cargo.toml b/crates/biome_markdown_parser/Cargo.toml index c3087a08e7..3f100686af 100644 --- a/crates/biome_markdown_parser/Cargo.toml +++ b/crates/biome_markdown_parser/Cargo.toml @@ -22,6 +22,7 @@ biome_markdown_factory = { workspace = true } biome_markdown_syntax = { workspace = true } biome_parser = { workspace = true } biome_rowan = { workspace = true } +biome_string_case = { workspace = true } biome_unicode_table = { workspace = true } # Optional dependency for test_utils feature (HTML rendering for spec tests) htmlize = { version = "1.0.6", features = ["unescape"], optional = true } diff --git a/crates/biome_markdown_parser/src/link_reference.rs b/crates/biome_markdown_parser/src/link_reference.rs index 7edfb0b9db..9bd74f48f7 100644 --- a/crates/biome_markdown_parser/src/link_reference.rs +++ b/crates/biome_markdown_parser/src/link_reference.rs @@ -1,5 +1,7 @@ use std::collections::HashSet; +use biome_string_case::StrOnlyExtension; + use biome_markdown_syntax::{MdLinkLabel, MdLinkReferenceDefinition}; use biome_rowan::{AstNode, Direction}; @@ -29,7 +31,7 @@ pub(crate) fn normalize_reference_label(text: &str) -> String { push_normalized_char(&mut out, c, &mut saw_whitespace); } - out + out.as_str().to_lowercase_cow().to_uppercase() } fn push_normalized_char(out: &mut String, c: char, saw_whitespace: &mut bool) { @@ -37,9 +39,7 @@ fn push_normalized_char(out: &mut String, c: char, saw_whitespace: &mut bool) { out.push(' '); } *saw_whitespace = false; - for lower in c.to_lowercase() { - out.push(lower); - } + out.push(c); } pub(crate) fn collect_link_reference_definitions( diff --git a/crates/biome_markdown_parser/src/syntax.rs b/crates/biome_markdown_parser/src/syntax.rs index 1c9bdac152..9ceac84a0f 100644 --- a/crates/biome_markdown_parser/src/syntax.rs +++ b/crates/biome_markdown_parser/src/syntax.rs @@ -60,6 +60,9 @@ use thematic_break_block::{at_thematic_break_block, parse_thematic_break_block}; use crate::MarkdownParser; +/// Maximum paren nesting allowed in link destinations per CommonMark. +pub(crate) const MAX_LINK_DESTINATION_PAREN_DEPTH: i32 = 32; + /// CommonMark requires 4 or more spaces for indented code blocks. const INDENT_CODE_BLOCK_SPACES: usize = 4; @@ -71,6 +74,98 @@ pub(crate) fn parse_document(p: &mut MarkdownParser) { m.complete(p, MD_DOCUMENT); } +/// Result of updating parenthesis depth when scanning link destinations. +#[derive(Debug, Clone, Copy, PartialEq, Eq)] +pub(crate) enum ParenDepthResult { + /// Depth updated successfully, contains new depth value + Ok(i32), + /// Depth would exceed the maximum (too many nested opening parens). + /// Per cmark, this truncates the destination at this point. + DepthExceeded, + /// Unmatched closing paren (would go below 0). + /// This typically means the `)` belongs to the enclosing construct. + UnmatchedClose, +} + +pub(crate) fn try_update_paren_depth(text: &str, depth: i32, max: i32) -> ParenDepthResult { + let mut depth = depth; + let mut chars = text.chars().peekable(); + + while let Some(c) = chars.next() { + if c == '\\' && matches!(chars.peek(), Some('(' | ')')) { + chars.next(); + continue; + } + + if c == '(' { + if depth == max { + return ParenDepthResult::DepthExceeded; + } + depth += 1; + } else if c == ')' { + if depth == 0 { + return ParenDepthResult::UnmatchedClose; + } + depth -= 1; + } + } + + ParenDepthResult::Ok(depth) +} + +pub(crate) enum LinkDestinationKind { + Enclosed, + Raw, +} + +pub(crate) fn validate_link_destination_text( + text: &str, + kind: LinkDestinationKind, + pending_escape: &mut bool, +) -> bool { + for c in text.chars() { + if *pending_escape { + if c.is_ascii_punctuation() { + *pending_escape = false; + continue; + } + *pending_escape = false; + } + + if c == '\\' { + *pending_escape = true; + continue; + } + + if c.is_ascii_control() { + return false; + } + + if matches!(kind, LinkDestinationKind::Enclosed) && c == '<' { + return false; + } + } + + true +} + +pub(crate) fn ends_with_unescaped_close(text: &str, close_char: char) -> bool { + if !text.ends_with(close_char) { + return false; + } + + let mut backslashes = 0; + for c in text.chars().rev().skip(1) { + if c == '\\' { + backslashes += 1; + } else { + break; + } + } + + backslashes % 2 == 0 +} + pub(crate) fn parse_block_list(p: &mut MarkdownParser) -> ParsedSyntax { let mut list = DocumentBlockList; Present(list.parse_list(p)) diff --git a/crates/biome_markdown_parser/src/syntax/inline.rs b/crates/biome_markdown_parser/src/syntax/inline.rs index 275a09ddb6..44f0eed223 100644 --- a/crates/biome_markdown_parser/src/syntax/inline.rs +++ b/crates/biome_markdown_parser/src/syntax/inline.rs @@ -611,15 +611,35 @@ pub(crate) fn parse_inline_italic(p: &mut MarkdownParser) -> ParsedSyntax { parse_emphasis_from_context(p, false) } -fn parse_inline_item_list_until_no_links(p: &mut MarkdownParser, stop: MarkdownSyntaxKind) { +fn parse_inline_item_list_until_no_links(p: &mut MarkdownParser, stop: MarkdownSyntaxKind) -> bool { let m = p.start(); let prev_context = set_inline_emphasis_context_until(p, stop); + let mut bracket_depth = 0usize; + let mut has_nested_link = false; loop { - if p.at(stop) || p.at_inline_end() { + if p.at_inline_end() { break; } + if p.at(stop) { + if bracket_depth == 0 { + break; + } + bracket_depth = bracket_depth.saturating_sub(1); + let _ = super::parse_textual(p); + continue; + } + + if p.at(L_BRACK) { + if !has_nested_link && nested_link_starts_here(p) { + has_nested_link = true; + } + bracket_depth += 1; + let _ = super::parse_textual(p); + continue; + } + if parse_any_inline_no_links(p).is_absent() { break; } @@ -627,13 +647,53 @@ fn parse_inline_item_list_until_no_links(p: &mut MarkdownParser, stop: MarkdownS m.complete(p, MD_INLINE_ITEM_LIST); p.set_emphasis_context(prev_context); + has_nested_link +} + +fn nested_link_starts_here(p: &mut MarkdownParser) -> bool { + p.lookahead(|p| { + if !p.at(L_BRACK) { + return false; + } + + p.bump(L_BRACK); + let mut depth = 0usize; + + loop { + if p.at(EOF) || p.at_inline_end() { + return false; + } + + if p.at(L_BRACK) { + depth += 1; + p.bump(L_BRACK); + continue; + } + + if p.at(R_BRACK) { + if depth > 0 { + depth -= 1; + p.bump(R_BRACK); + continue; + } + p.bump(R_BRACK); + return p.at(L_PAREN) || p.at(L_BRACK); + } + + p.bump(p.cur()); + } + }) } fn parse_any_inline_no_links(p: &mut MarkdownParser) -> ParsedSyntax { - if (p.at(BANG) && p.nth_at(1, L_BRACK)) || p.at(L_BRACK) { + if p.at(L_BRACK) { return super::parse_textual(p); } + if p.at(BANG) && p.nth_at(1, L_BRACK) { + return parse_inline_image(p); + } + parse_any_inline(p) } @@ -826,10 +886,15 @@ fn parse_link_or_image(p: &mut MarkdownParser, kind: LinkParseKind) -> ParsedSyn kind.bump_opening(p); // Link text / alt text - parse_inline_item_list_until_no_links(p, R_BRACK); + let has_nested_link = parse_inline_item_list_until_no_links(p, R_BRACK); // ] - if missing at inline end, emit diagnostic; otherwise rewind if !p.eat(R_BRACK) { + if matches!(kind, LinkParseKind::Link) && has_nested_link { + m.abandon(p); + p.rewind(checkpoint); + return Absent; + } if p.at_inline_end() { // Unclosed link/image at end of inline content - emit diagnostic // Expand range to include the text content, not just the opening bracket @@ -843,19 +908,50 @@ fn parse_link_or_image(p: &mut MarkdownParser, kind: LinkParseKind) -> ParsedSyn p.rewind(checkpoint); return Absent; } + let text_end_offset = p.cur_range().start(); + + if matches!(kind, LinkParseKind::Link) && has_nested_link { + m.abandon(p); + p.rewind(checkpoint); + return Absent; + } // Now decide based on what follows ] - if p.at(L_PAREN) { + let link_validation = if p.at(L_PAREN) { + inline_link_is_valid(p) + } else { + InlineLinkValidation::Invalid + }; + + if matches!( + link_validation, + InlineLinkValidation::Valid | InlineLinkValidation::DepthExceeded + ) { // Inline link/image: [text](url) or ![alt](url) // Bump past ( and lex the following tokens in LinkDefinition context // so whitespace separates destination and title. p.expect_with_context(L_PAREN, crate::lexer::MarkdownLexContext::LinkDefinition); let destination = p.start(); - parse_inline_link_destination_tokens(p); + let destination_result = parse_inline_link_destination_tokens(p); + + // When depth exceeded, destination is truncated but link is still valid. + // Complete the destination and link immediately without looking for closing paren. + if destination_result == DestinationScanResult::DepthExceeded { + destination.complete(p, MD_INLINE_ITEM_LIST); + return Present(m.complete(p, kind.inline_kind())); + } + let has_title = inline_title_starts_after_whitespace_tokens(p); - while is_whitespace_token(p) { - bump_textual_link_def(p); + while is_title_separator_token(p) { + bump_link_def_separator(p); + } + if destination_result == DestinationScanResult::Invalid { + destination.abandon(p); + m.abandon(p); + p.rewind(checkpoint); + p.force_relex_regular(); + return Absent; } destination.complete(p, MD_INLINE_ITEM_LIST); @@ -867,8 +963,18 @@ fn parse_link_or_image(p: &mut MarkdownParser, kind: LinkParseKind) -> ParsedSyn title_m.complete(p, MD_LINK_TITLE); } + while is_title_separator_token(p) { + bump_link_def_separator(p); + } + if !p.eat(R_PAREN) { - kind.report_unclosed_destination(p, opening_range); + if p.at_inline_end() { + kind.report_unclosed_destination(p, opening_range); + } + m.abandon(p); + p.rewind(checkpoint); + p.force_relex_regular(); + return Absent; } Present(m.complete(p, kind.inline_kind())) @@ -888,7 +994,7 @@ fn parse_link_or_image(p: &mut MarkdownParser, kind: LinkParseKind) -> ParsedSyn { m.abandon(p); p.rewind(checkpoint); - return consume_textual_until_offset(p, reference.end_offset); + return consume_textual_until_offset(p, text_end_offset); } Present(m.complete(p, kind.reference_kind())) @@ -901,14 +1007,13 @@ fn parse_link_or_image(p: &mut MarkdownParser, kind: LinkParseKind) -> ParsedSyn { m.abandon(p); p.rewind(checkpoint); - return consume_textual_until_offset(p, reference.end_offset); + return consume_textual_until_offset(p, text_end_offset); } Present(m.complete(p, kind.reference_kind())) } } struct ReferenceLinkLookahead { - end_offset: TextSize, label_raw: String, is_shortcut: bool, } @@ -947,7 +1052,6 @@ fn lookahead_reference_common( p.bump(L_BRACK); let link_text = collect_bracket_text(p)?; - let end_offset = p.cur_range().end(); p.bump(R_BRACK); if p.at(L_PAREN) { @@ -963,10 +1067,8 @@ fn lookahead_reference_common( } else { label_text }; - let end_offset = p.cur_range().end(); p.bump(R_BRACK); return Some(ReferenceLinkLookahead { - end_offset, label_raw: label, is_shortcut: false, }); @@ -974,7 +1076,6 @@ fn lookahead_reference_common( } Some(ReferenceLinkLookahead { - end_offset, label_raw: link_text, is_shortcut: true, }) @@ -1025,48 +1126,296 @@ fn is_whitespace_token(p: &MarkdownParser) -> bool { fn inline_title_starts_after_whitespace_tokens(p: &mut MarkdownParser) -> bool { p.lookahead(|p| { - while is_whitespace_token(p) { - bump_textual_link_def(p); + let mut saw_whitespace = false; + while is_title_separator_token(p) { + bump_link_def_separator(p); + saw_whitespace = true; } - get_title_close_char(p).is_some() + saw_whitespace && get_title_close_char(p).is_some() }) } -fn parse_inline_link_destination_tokens(p: &mut MarkdownParser) { +/// Result of validating an inline link. +#[derive(Debug, Clone, Copy, PartialEq, Eq)] +enum InlineLinkValidation { + /// Link is valid with complete destination + Valid, + /// Link is invalid + Invalid, + /// Link is valid but destination was truncated due to paren depth limit. + /// The link should be closed immediately without looking for `)`. + DepthExceeded, +} + +fn inline_link_is_valid(p: &mut MarkdownParser) -> InlineLinkValidation { + p.lookahead(|p| { + if !p.at(L_PAREN) { + return InlineLinkValidation::Invalid; + } + + p.bump(L_PAREN); + p.re_lex_link_definition(); + + let destination_result = scan_inline_link_destination_tokens(p); + + // If depth exceeded, link is valid but truncated - no need to check for closing paren + if destination_result == DestinationScanResult::DepthExceeded { + return InlineLinkValidation::DepthExceeded; + } + + if destination_result == DestinationScanResult::Invalid { + return InlineLinkValidation::Invalid; + } + + let mut saw_separator = false; + while is_title_separator_token(p) { + skip_link_def_separator_tokens(p); + saw_separator = true; + } + let has_title = saw_separator && get_title_close_char(p).is_some(); + while is_title_separator_token(p) { + skip_link_def_separator_tokens(p); + } + + if has_title { + scan_title_content(p, get_title_close_char(p)); + } + + while is_title_separator_token(p) { + skip_link_def_separator_tokens(p); + } + + if p.at(R_PAREN) { + InlineLinkValidation::Valid + } else { + InlineLinkValidation::Invalid + } + }) +} + +/// Result of scanning a link destination. +#[derive(Debug, Clone, Copy, PartialEq, Eq)] +enum DestinationScanResult { + /// Destination is valid and complete + Valid, + /// Destination is invalid (contains invalid characters, etc.) + Invalid, + /// Destination was truncated because paren depth exceeded the limit. + /// In this case, the link is considered valid but closed at the truncation point. + DepthExceeded, +} + +fn scan_inline_link_destination_tokens(p: &mut MarkdownParser) -> DestinationScanResult { + const MAX_PAREN_DEPTH: i32 = super::MAX_LINK_DESTINATION_PAREN_DEPTH; + if p.at(L_ANGLE) { + p.bump_link_definition(); + let mut pending_escape = false; + loop { + if p.at(EOF) || p.at(NEWLINE) { + return DestinationScanResult::Invalid; + } + if p.at(R_ANGLE) { + if pending_escape { + if !super::validate_link_destination_text( + p.cur_text(), + super::LinkDestinationKind::Enclosed, + &mut pending_escape, + ) { + return DestinationScanResult::Invalid; + } + p.bump_link_definition(); + continue; + } + p.bump_link_definition(); + return DestinationScanResult::Valid; + } + if !super::validate_link_destination_text( + p.cur_text(), + super::LinkDestinationKind::Enclosed, + &mut pending_escape, + ) { + return DestinationScanResult::Invalid; + } + p.bump_link_definition(); + } + } + + let mut paren_depth: i32 = 0; + let mut pending_escape = false; + while !p.at(EOF) && !p.at(NEWLINE) { + if is_whitespace_token(p) { + break; + } + let text = p.cur_text(); + if !super::validate_link_destination_text( + text, + super::LinkDestinationKind::Raw, + &mut pending_escape, + ) { + return DestinationScanResult::Invalid; + } + match super::try_update_paren_depth(text, paren_depth, MAX_PAREN_DEPTH) { + super::ParenDepthResult::Ok(next_depth) => { + paren_depth = next_depth; + p.bump_link_definition(); + } + super::ParenDepthResult::DepthExceeded => { + // Paren depth exceeded - destination is truncated at this point. + // Per CommonMark/cmark, the link is still valid but closed here. + return DestinationScanResult::DepthExceeded; + } + super::ParenDepthResult::UnmatchedClose => { + // Unmatched closing paren - destination ends here normally. + // The `)` belongs to the enclosing construct (inline link closer). + break; + } + } + } + if p.at(EOF) { + return DestinationScanResult::Invalid; + } + if p.at(NEWLINE) { + return if p.at_blank_line() { + DestinationScanResult::Invalid + } else { + DestinationScanResult::Valid + }; + } + DestinationScanResult::Valid +} + +fn scan_title_content(p: &mut MarkdownParser, close_char: Option<char>) { + let Some(close_char) = close_char else { + return; + }; + + let text = p.cur_text(); + let is_complete = text.len() >= 2 && super::ends_with_unescaped_close(text, close_char); + + p.bump_link_definition(); + if is_complete { + return; + } + + loop { + if p.at(EOF) || p.at(NEWLINE) { + return; + } + + let text = p.cur_text(); + if super::ends_with_unescaped_close(text, close_char) { + p.bump_link_definition(); + return; + } + + p.bump_link_definition(); + } +} + +fn skip_link_def_separator_tokens(p: &mut MarkdownParser) { + if p.at(NEWLINE) { + p.bump(NEWLINE); + } else { + p.bump_link_definition(); + } +} + +fn is_title_separator_token(p: &MarkdownParser) -> bool { + is_whitespace_token(p) || (p.at(NEWLINE) && !p.at_blank_line()) +} + +fn bump_link_def_separator(p: &mut MarkdownParser) { + if p.at(NEWLINE) { + let item = p.start(); + p.bump_remap(MD_TEXTUAL_LITERAL); + item.complete(p, MD_TEXTUAL); + } else { + bump_textual_link_def(p); + } +} + +fn parse_inline_link_destination_tokens(p: &mut MarkdownParser) -> DestinationScanResult { p.re_lex_link_definition(); + const MAX_PAREN_DEPTH: i32 = super::MAX_LINK_DESTINATION_PAREN_DEPTH; if p.at(L_ANGLE) { bump_textual_link_def(p); - while !p.at(EOF) && !p.at(NEWLINE) { + let mut pending_escape = false; + loop { + if p.at(EOF) || p.at(NEWLINE) { + return DestinationScanResult::Invalid; + } if p.at(R_ANGLE) { + if pending_escape { + if !super::validate_link_destination_text( + p.cur_text(), + super::LinkDestinationKind::Enclosed, + &mut pending_escape, + ) { + return DestinationScanResult::Invalid; + } + bump_textual_link_def(p); + continue; + } bump_textual_link_def(p); - break; + return DestinationScanResult::Valid; } - if is_whitespace_token(p) { - break; + if !super::validate_link_destination_text( + p.cur_text(), + super::LinkDestinationKind::Enclosed, + &mut pending_escape, + ) { + return DestinationScanResult::Invalid; } bump_textual_link_def(p); } - return; } let mut paren_depth: i32 = 0; + let mut pending_escape = false; + while is_title_separator_token(p) { + bump_link_def_separator(p); + } while !p.at(EOF) && !p.at(NEWLINE) { if is_whitespace_token(p) { break; } - if p.at(L_PAREN) { - paren_depth += 1; - } else if p.at(R_PAREN) { - if paren_depth == 0 { + let text = p.cur_text(); + if !super::validate_link_destination_text( + text, + super::LinkDestinationKind::Raw, + &mut pending_escape, + ) { + return DestinationScanResult::Invalid; + } + match super::try_update_paren_depth(text, paren_depth, MAX_PAREN_DEPTH) { + super::ParenDepthResult::Ok(next_depth) => { + paren_depth = next_depth; + bump_textual_link_def(p); + } + super::ParenDepthResult::DepthExceeded => { + // Paren depth exceeded - destination is truncated at this point. + return DestinationScanResult::DepthExceeded; + } + super::ParenDepthResult::UnmatchedClose => { + // Unmatched closing paren - destination ends here normally. + // The `)` belongs to the enclosing construct (inline link closer). break; } - paren_depth -= 1; } - - bump_textual_link_def(p); } + if p.at(EOF) { + return DestinationScanResult::Invalid; + } + if p.at(NEWLINE) { + return if p.at_blank_line() { + DestinationScanResult::Invalid + } else { + DestinationScanResult::Valid + }; + } + DestinationScanResult::Valid } fn get_title_close_char(p: &MarkdownParser) -> Option<char> { @@ -1088,9 +1437,7 @@ fn parse_title_content(p: &mut MarkdownParser, close_char: Option<char>) { }; let text = p.cur_text(); - let is_complete = text.len() >= 2 - && ((close_char == ')' && text.ends_with(')')) - || (close_char != ')' && text.ends_with(close_char))); + let is_complete = text.len() >= 2 && super::ends_with_unescaped_close(text, close_char); bump_textual_link_def(p); if is_complete { @@ -1103,7 +1450,7 @@ fn parse_title_content(p: &mut MarkdownParser, close_char: Option<char>) { } let text = p.cur_text(); - if text.ends_with(close_char) { + if super::ends_with_unescaped_close(text, close_char) { bump_textual_link_def(p); return; } diff --git a/crates/biome_markdown_parser/src/syntax/link_block.rs b/crates/biome_markdown_parser/src/syntax/link_block.rs index 86d9f57354..7422a055fb 100644 --- a/crates/biome_markdown_parser/src/syntax/link_block.rs +++ b/crates/biome_markdown_parser/src/syntax/link_block.rs @@ -184,15 +184,35 @@ fn skip_destination_tokens(p: &mut MarkdownParser) -> bool { if p.at(L_ANGLE) { // Angle-bracketed destination p.bump_link_definition(); + let mut pending_escape = false; loop { if p.at(EOF) || p.at(NEWLINE) { return false; // Unterminated angle bracket } if p.at(R_ANGLE) { - p.bump_link_definition(); - // Consume separator whitespace into destination - skip_whitespace_tokens(p); - return true; + if pending_escape { + if !crate::syntax::validate_link_destination_text( + p.cur_text(), + crate::syntax::LinkDestinationKind::Enclosed, + &mut pending_escape, + ) { + return false; + } + p.bump_link_definition(); + continue; + } else { + p.bump_link_definition(); + // Consume separator whitespace into destination + skip_whitespace_tokens(p); + return true; + } + } + if !crate::syntax::validate_link_destination_text( + p.cur_text(), + crate::syntax::LinkDestinationKind::Enclosed, + &mut pending_escape, + ) { + return false; } p.bump_link_definition(); } @@ -201,6 +221,7 @@ fn skip_destination_tokens(p: &mut MarkdownParser) -> bool { let mut paren_depth = 0i32; let mut has_content = false; let mut saw_separator = false; + let mut pending_escape = false; while !p.at(EOF) && !p.at(NEWLINE) { let text = p.cur_text(); @@ -217,19 +238,31 @@ fn skip_destination_tokens(p: &mut MarkdownParser) -> bool { break; } - if p.at(L_PAREN) { - paren_depth += 1; - } else if p.at(R_PAREN) { - if paren_depth > 0 { - paren_depth -= 1; - } else { - break; // Unbalanced ) ends destination - } + if !crate::syntax::validate_link_destination_text( + text, + crate::syntax::LinkDestinationKind::Raw, + &mut pending_escape, + ) { + return false; } - has_content = true; - saw_separator = false; - p.bump_link_definition(); + match crate::syntax::try_update_paren_depth( + text, + paren_depth, + crate::syntax::MAX_LINK_DESTINATION_PAREN_DEPTH, + ) { + crate::syntax::ParenDepthResult::Ok(next_depth) => { + has_content = true; + saw_separator = false; + paren_depth = next_depth; + p.bump_link_definition(); + } + crate::syntax::ParenDepthResult::DepthExceeded + | crate::syntax::ParenDepthResult::UnmatchedClose => { + // For link reference definitions, both cases end the destination + break; + } + } } has_content } @@ -249,17 +282,10 @@ fn skip_title_tokens(p: &mut MarkdownParser) -> bool { // Check if first token is complete (e.g., `"title"`) let first_text = p.cur_text(); - if first_text.len() >= 2 { - let is_complete = if close_char == ')' { - first_text.ends_with(')') - } else { - first_text.ends_with(close_char) - }; - if is_complete { - p.bump_link_definition(); - skip_whitespace_tokens(p); - return p.at(EOF) || p.at(NEWLINE); - } + if first_text.len() >= 2 && crate::syntax::ends_with_unescaped_close(first_text, close_char) { + p.bump_link_definition(); + skip_whitespace_tokens(p); + return p.at(EOF) || p.at(NEWLINE); } p.bump_link_definition(); @@ -271,11 +297,7 @@ fn skip_title_tokens(p: &mut MarkdownParser) -> bool { } // Check for closing delimiter - let is_close = if close_char == ')' { - p.at(R_PAREN) - } else { - p.cur_text().ends_with(close_char) - }; + let is_close = crate::syntax::ends_with_unescaped_close(p.cur_text(), close_char); if is_close { p.bump_link_definition(); @@ -393,17 +415,21 @@ fn parse_link_destination(p: &mut MarkdownParser) { break; // Bare destination stops at first whitespace } - if p.at(L_PAREN) { - paren_depth += 1; - } else if p.at(R_PAREN) { - if paren_depth > 0 { - paren_depth -= 1; - } else { - break; // Unbalanced ) ends bare destination + let text = p.cur_text(); + match crate::syntax::try_update_paren_depth( + text, + paren_depth, + crate::syntax::MAX_LINK_DESTINATION_PAREN_DEPTH, + ) { + crate::syntax::ParenDepthResult::Ok(next_depth) => { + paren_depth = next_depth; + bump_textual_link_def(p); + } + crate::syntax::ParenDepthResult::DepthExceeded + | crate::syntax::ParenDepthResult::UnmatchedClose => { + break; } } - - bump_textual_link_def(p); } } diff --git a/crates/biome_markdown_parser/src/to_html.rs b/crates/biome_markdown_parser/src/to_html.rs index f3dee25a4a..0512db55fe 100644 --- a/crates/biome_markdown_parser/src/to_html.rs +++ b/crates/biome_markdown_parser/src/to_html.rs @@ -447,10 +447,8 @@ fn render_paragraph( } // Trim both ends - leading whitespace can appear from parser including // the space after list markers in the paragraph content - let content = strip_paragraph_indent( - content - .trim_matches(|c| c == ' ' || c == '\n' || c == '\r') - ); + let content = + strip_paragraph_indent(content.trim_matches(|c| c == ' ' || c == '\n' || c == '\r')); if in_tight_list { // In tight lists, paragraphs are rendered without tags @@ -1160,7 +1158,11 @@ where { if let Some(node) = label_node { let text = label_text(&node); - (text.clone(), Some(text)) + if text.trim().is_empty() { + (fallback, None) + } else { + (text.clone(), Some(text)) + } } else { (fallback, None) } @@ -1602,4 +1604,64 @@ mod tests { // U+0000 should become replacement character assert_eq!(decode_entity(""), Some("\u{FFFD}".to_string())); } + + #[test] + fn test_percent_encode_uri() { + let input = format!("https://a{}b.c/%20/%", '\u{1F44D}'); + let encoded = percent_encode_uri(&input); + assert_eq!(encoded, "https://a%F0%9F%91%8Db.c/%20/%25"); + } + + #[test] + fn test_process_link_destination_decodes_entities() { + let encoded = process_link_destination("https://example.com/<"); + assert_eq!(encoded, "https://example.com/%3C"); + } + + #[test] + fn test_paren_depth_limit_in_destination() { + let dest = format!("x{}y{}", "(".repeat(32), ")".repeat(32)); + let input = format!("[a]({dest})\n"); + let parsed = parse_markdown(&input); + let html = document_to_html( + &parsed.tree(), + parsed.list_tightness(), + parsed.list_item_indents(), + parsed.quote_indents(), + ); + let expected = format!("<a href=\"{dest}\">a</a>\n"); + assert_eq!(html, expected); + } + + #[test] + fn test_paren_depth_limit_exceeded_in_destination() { + let dest = format!("x{}y{}", "(".repeat(33), ")".repeat(33)); + let input = format!("[a]({dest})\n"); + let parsed = parse_markdown(&input); + let html = document_to_html( + &parsed.tree(), + parsed.list_tightness(), + parsed.list_item_indents(), + parsed.quote_indents(), + ); + let expected_dest = format!("x{}", "(".repeat(32)); + let trailing = ")".repeat(34); + let expected = format!("<a href=\"{expected_dest}\">a</a>(y{trailing}\n"); + assert_eq!(html, expected); + } + + #[test] + fn test_title_with_escaped_closing_quote() { + let parsed = parse_markdown("[a](/url \"title with \\\" quote\")\n"); + let html = document_to_html( + &parsed.tree(), + parsed.list_tightness(), + parsed.list_item_indents(), + parsed.quote_indents(), + ); + assert_eq!( + html, + "<a href=\"/url\" title=\"title with " quote\">a</a>\n" + ); + } } diff --git a/crates/biome_markdown_parser/tests/md_test_suite/ok/paren_depth_limit.md b/crates/biome_markdown_parser/tests/md_test_suite/ok/paren_depth_limit.md new file mode 100644 index 0000000000..3cbf1f91d3 --- /dev/null +++ b/crates/biome_markdown_parser/tests/md_test_suite/ok/paren_depth_limit.md @@ -0,0 +1 @@ +[a](x((((((((((((((((((((((((((((((((y))))))))))))))))))))))))))))))))) diff --git a/crates/biome_markdown_parser/tests/md_test_suite/ok/paren_depth_limit.md.snap b/crates/biome_markdown_parser/tests/md_test_suite/ok/paren_depth_limit.md.snap new file mode 100644 index 0000000000..236bd2046f --- /dev/null +++ b/crates/biome_markdown_parser/tests/md_test_suite/ok/paren_depth_limit.md.snap @@ -0,0 +1,399 @@ +--- +source: crates/biome_markdown_parser/tests/spec_test.rs +expression: snapshot +--- +## Input + +``` +[a](x((((((((((((((((((((((((((((((((y))))))))))))))))))))))))))))))))) + +``` + + +## AST + +``` +MdDocument { + bom_token: missing (optional), + value: MdBlockList [ + MdParagraph { + list: MdInlineItemList [ + MdInlineLink { + l_brack_token: L_BRACK@0..1 "[" [] [], + text: MdInlineItemList [ + MdTextual { + value_token: MD_TEXTUAL_LITERAL@1..2 "a" [] [], + }, + ], + r_brack_token: R_BRACK@2..3 "]" [] [], + l_paren_token: L_PAREN@3..4 "(" [] [], + destination: MdInlineItemList [ + MdTextual { + value_token: MD_TEXTUAL_LITERAL@4..5 "x" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@5..6 "(" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@6..7 "(" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@7..8 "(" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@8..9 "(" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@9..10 "(" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@10..11 "(" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@11..12 "(" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@12..13 "(" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@13..14 "(" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@14..15 "(" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@15..16 "(" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@16..17 "(" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@17..18 "(" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@18..19 "(" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@19..20 "(" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@20..21 "(" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@21..22 "(" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@22..23 "(" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@23..24 "(" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@24..25 "(" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@25..26 "(" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@26..27 "(" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@27..28 "(" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@28..29 "(" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@29..30 "(" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@30..31 "(" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@31..32 "(" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@32..33 "(" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@33..34 "(" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@34..35 "(" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@35..36 "(" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@36..37 "(" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@37..38 "y" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@38..39 ")" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@39..40 ")" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@40..41 ")" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@41..42 ")" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@42..43 ")" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@43..44 ")" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@44..45 ")" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@45..46 ")" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@46..47 ")" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@47..48 ")" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@48..49 ")" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@49..50 ")" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@50..51 ")" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@51..52 ")" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@52..53 ")" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@53..54 ")" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@54..55 ")" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@55..56 ")" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@56..57 ")" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@57..58 ")" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@58..59 ")" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@59..60 ")" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@60..61 ")" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@61..62 ")" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@62..63 ")" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@63..64 ")" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@64..65 ")" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@65..66 ")" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@66..67 ")" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@67..68 ")" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@68..69 ")" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@69..70 ")" [] [], + }, + ], + title: missing (optional), + r_paren_token: R_PAREN@70..71 ")" [] [], + }, + MdTextual { + value_token: MD_TEXTUAL_LITERAL@71..72 "\n" [] [], + }, + ], + hard_line: missing (optional), + }, + ], + eof_token: EOF@72..72 "" [] [], +} +``` + +## CST + +``` +0: MD_DOCUMENT@0..72 + 0: (empty) + 1: MD_BLOCK_LIST@0..72 + 0: MD_PARAGRAPH@0..72 + 0: MD_INLINE_ITEM_LIST@0..72 + 0: MD_INLINE_LINK@0..71 + 0: L_BRACK@0..1 "[" [] [] + 1: MD_INLINE_ITEM_LIST@1..2 + 0: MD_TEXTUAL@1..2 + 0: MD_TEXTUAL_LITERAL@1..2 "a" [] [] + 2: R_BRACK@2..3 "]" [] [] + 3: L_PAREN@3..4 "(" [] [] + 4: MD_INLINE_ITEM_LIST@4..70 + 0: MD_TEXTUAL@4..5 + 0: MD_TEXTUAL_LITERAL@4..5 "x" [] [] + 1: MD_TEXTUAL@5..6 + 0: MD_TEXTUAL_LITERAL@5..6 "(" [] [] + 2: MD_TEXTUAL@6..7 + 0: MD_TEXTUAL_LITERAL@6..7 "(" [] [] + 3: MD_TEXTUAL@7..8 + 0: MD_TEXTUAL_LITERAL@7..8 "(" [] [] + 4: MD_TEXTUAL@8..9 + 0: MD_TEXTUAL_LITERAL@8..9 "(" [] [] + 5: MD_TEXTUAL@9..10 + 0: MD_TEXTUAL_LITERAL@9..10 "(" [] [] + 6: MD_TEXTUAL@10..11 + 0: MD_TEXTUAL_LITERAL@10..11 "(" [] [] + 7: MD_TEXTUAL@11..12 + 0: MD_TEXTUAL_LITERAL@11..12 "(" [] [] + 8: MD_TEXTUAL@12..13 + 0: MD_TEXTUAL_LITERAL@12..13 "(" [] [] + 9: MD_TEXTUAL@13..14 + 0: MD_TEXTUAL_LITERAL@13..14 "(" [] [] + 10: MD_TEXTUAL@14..15 + 0: MD_TEXTUAL_LITERAL@14..15 "(" [] [] + 11: MD_TEXTUAL@15..16 + 0: MD_TEXTUAL_LITERAL@15..16 "(" [] [] + 12: MD_TEXTUAL@16..17 + 0: MD_TEXTUAL_LITERAL@16..17 "(" [] [] + 13: MD_TEXTUAL@17..18 + 0: MD_TEXTUAL_LITERAL@17..18 "(" [] [] + 14: MD_TEXTUAL@18..19 + 0: MD_TEXTUAL_LITERAL@18..19 "(" [] [] + 15: MD_TEXTUAL@19..20 + 0: MD_TEXTUAL_LITERAL@19..20 "(" [] [] + 16: MD_TEXTUAL@20..21 + 0: MD_TEXTUAL_LITERAL@20..21 "(" [] [] + 17: MD_TEXTUAL@21..22 + 0: MD_TEXTUAL_LITERAL@21..22 "(" [] [] + 18: MD_TEXTUAL@22..23 + 0: MD_TEXTUAL_LITERAL@22..23 "(" [] [] + 19: MD_TEXTUAL@23..24 + 0: MD_TEXTUAL_LITERAL@23..24 "(" [] [] + 20: MD_TEXTUAL@24..25 + 0: MD_TEXTUAL_LITERAL@24..25 "(" [] [] + 21: MD_TEXTUAL@25..26 + 0: MD_TEXTUAL_LITERAL@25..26 "(" [] [] + 22: MD_TEXTUAL@26..27 + 0: MD_TEXTUAL_LITERAL@26..27 "(" [] [] + 23: MD_TEXTUAL@27..28 + 0: MD_TEXTUAL_LITERAL@27..28 "(" [] [] + 24: MD_TEXTUAL@28..29 + 0: MD_TEXTUAL_LITERAL@28..29 "(" [] [] + 25: MD_TEXTUAL@29..30 + 0: MD_TEXTUAL_LITERAL@29..30 "(" [] [] + 26: MD_TEXTUAL@30..31 + 0: MD_TEXTUAL_LITERAL@30..31 "(" [] [] + 27: MD_TEXTUAL@31..32 + 0: MD_TEXTUAL_LITERAL@31..32 "(" [] [] + 28: MD_TEXTUAL@32..33 + 0: MD_TEXTUAL_LITERAL@32..33 "(" [] [] + 29: MD_TEXTUAL@33..34 + 0: MD_TEXTUAL_LITERAL@33..34 "(" [] [] + 30: MD_TEXTUAL@34..35 + 0: MD_TEXTUAL_LITERAL@34..35 "(" [] [] + 31: MD_TEXTUAL@35..36 + 0: MD_TEXTUAL_LITERAL@35..36 "(" [] [] + 32: MD_TEXTUAL@36..37 + 0: MD_TEXTUAL_LITERAL@36..37 "(" [] [] + 33: MD_TEXTUAL@37..38 + 0: MD_TEXTUAL_LITERAL@37..38 "y" [] [] + 34: MD_TEXTUAL@38..39 + 0: MD_TEXTUAL_LITERAL@38..39 ")" [] [] + 35: MD_TEXTUAL@39..40 + 0: MD_TEXTUAL_LITERAL@39..40 ")" [] [] + 36: MD_TEXTUAL@40..41 + 0: MD_TEXTUAL_LITERAL@40..41 ")" [] [] + 37: MD_TEXTUAL@41..42 + 0: MD_TEXTUAL_LITERAL@41..42 ")" [] [] + 38: MD_TEXTUAL@42..43 + 0: MD_TEXTUAL_LITERAL@42..43 ")" [] [] + 39: MD_TEXTUAL@43..44 + 0: MD_TEXTUAL_LITERAL@43..44 ")" [] [] + 40: MD_TEXTUAL@44..45 + 0: MD_TEXTUAL_LITERAL@44..45 ")" [] [] + 41: MD_TEXTUAL@45..46 + 0: MD_TEXTUAL_LITERAL@45..46 ")" [] [] + 42: MD_TEXTUAL@46..47 + 0: MD_TEXTUAL_LITERAL@46..47 ")" [] [] + 43: MD_TEXTUAL@47..48 + 0: MD_TEXTUAL_LITERAL@47..48 ")" [] [] + 44: MD_TEXTUAL@48..49 + 0: MD_TEXTUAL_LITERAL@48..49 ")" [] [] + 45: MD_TEXTUAL@49..50 + 0: MD_TEXTUAL_LITERAL@49..50 ")" [] [] + 46: MD_TEXTUAL@50..51 + 0: MD_TEXTUAL_LITERAL@50..51 ")" [] [] + 47: MD_TEXTUAL@51..52 + 0: MD_TEXTUAL_LITERAL@51..52 ")" [] [] + 48: MD_TEXTUAL@52..53 + 0: MD_TEXTUAL_LITERAL@52..53 ")" [] [] + 49: MD_TEXTUAL@53..54 + 0: MD_TEXTUAL_LITERAL@53..54 ")" [] [] + 50: MD_TEXTUAL@54..55 + 0: MD_TEXTUAL_LITERAL@54..55 ")" [] [] + 51: MD_TEXTUAL@55..56 + 0: MD_TEXTUAL_LITERAL@55..56 ")" [] [] + 52: MD_TEXTUAL@56..57 + 0: MD_TEXTUAL_LITERAL@56..57 ")" [] [] + 53: MD_TEXTUAL@57..58 + 0: MD_TEXTUAL_LITERAL@57..58 ")" [] [] + 54: MD_TEXTUAL@58..59 + 0: MD_TEXTUAL_LITERAL@58..59 ")" [] [] + 55: MD_TEXTUAL@59..60 + 0: MD_TEXTUAL_LITERAL@59..60 ")" [] [] + 56: MD_TEXTUAL@60..61 + 0: MD_TEXTUAL_LITERAL@60..61 ")" [] [] + 57: MD_TEXTUAL@61..62 + 0: MD_TEXTUAL_LITERAL@61..62 ")" [] [] + 58: MD_TEXTUAL@62..63 + 0: MD_TEXTUAL_LITERAL@62..63 ")" [] [] + 59: MD_TEXTUAL@63..64 + 0: MD_TEXTUAL_LITERAL@63..64 ")" [] [] + 60: MD_TEXTUAL@64..65 + 0: MD_TEXTUAL_LITERAL@64..65 ")" [] [] + 61: MD_TEXTUAL@65..66 + 0: MD_TEXTUAL_LITERAL@65..66 ")" [] [] + 62: MD_TEXTUAL@66..67 + 0: MD_TEXTUAL_LITERAL@66..67 ")" [] [] + 63: MD_TEXTUAL@67..68 + 0: MD_TEXTUAL_LITERAL@67..68 ")" [] [] + 64: MD_TEXTUAL@68..69 + 0: MD_TEXTUAL_LITERAL@68..69 ")" [] [] + 65: MD_TEXTUAL@69..70 + 0: MD_TEXTUAL_LITERAL@69..70 ")" [] [] + 5: (empty) + 6: R_PAREN@70..71 ")" [] [] + 1: MD_TEXTUAL@71..72 + 0: MD_TEXTUAL_LITERAL@71..72 "\n" [] [] + 1: (empty) + 2: EOF@72..72 "" [] [] + +``` diff --git a/Cargo.lock b/Cargo.lock index bfc6c1c8c5..72c683148d 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -1295,6 +1295,7 @@ dependencies = [ "biome_markdown_syntax", "biome_parser", "biome_rowan", + "biome_string_case", "biome_test_utils", "biome_unicode_table", "htmlize", diff --git a/crates/biome_markdown_parser/Cargo.toml b/crates/biome_markdown_parser/Cargo.toml index c3087a08e7..3f100686af 100644 --- a/crates/biome_markdown_parser/Cargo.toml +++ b/crates/biome_markdown_parser/Cargo.toml @@ -22,6 +22,7 @@ biome_markdown_factory = { workspace = true } biome_markdown_syntax = { workspace = true } biome_parser = { workspace = true } biome_rowan = { workspace = true } +biome_string_case = { workspace = true } biome_unicode_table = { workspace = true } # Optional dependency for test_utils feature (HTML rendering for spec tests) htmlize = { version = "1.0.6", features = ["unescape"], optional = true } diff --git a/crates/biome_markdown_parser/src/link_reference.rs b/crates/biome_markdown_parser/src/link_reference.rs index 7edfb0b9db..e67f38759a 100644 --- a/crates/biome_markdown_parser/src/link_reference.rs +++ b/crates/biome_markdown_parser/src/link_reference.rs @@ -1,5 +1,7 @@ use std::collections::HashSet; +use biome_string_case::StrOnlyExtension; + use biome_markdown_syntax::{MdLinkLabel, MdLinkReferenceDefinition}; use biome_rowan::{AstNode, Direction}; @@ -8,19 +10,20 @@ use crate::MarkdownParseOptions; use crate::parser::MarkdownParser; use crate::syntax::parse_document; +/// Normalize a reference label per CommonMark spec. +/// +/// Per CommonMark, label normalization involves: +/// 1. Collapsing consecutive whitespace into a single space +/// 2. Case-folding (case-insensitive matching) +/// +/// IMPORTANT: Backslash escapes are NOT stripped during normalization. +/// This means `[foo\!]` does NOT match `[foo!]` - the backslash is preserved. +/// This matches cmark's reference implementation behavior. pub(crate) fn normalize_reference_label(text: &str) -> String { let mut out = String::new(); - let mut chars = text.chars().peekable(); let mut saw_whitespace = false; - while let Some(c) = chars.next() { - if c == '\\' { - if let Some(next) = chars.next() { - push_normalized_char(&mut out, next, &mut saw_whitespace); - } - continue; - } - + for c in text.chars() { if c.is_whitespace() { saw_whitespace = true; continue; @@ -29,7 +32,7 @@ pub(crate) fn normalize_reference_label(text: &str) -> String { push_normalized_char(&mut out, c, &mut saw_whitespace); } - out + out.as_str().to_lowercase_cow().to_uppercase() } fn push_normalized_char(out: &mut String, c: char, saw_whitespace: &mut bool) { @@ -37,9 +40,7 @@ fn push_normalized_char(out: &mut String, c: char, saw_whitespace: &mut bool) { out.push(' '); } *saw_whitespace = false; - for lower in c.to_lowercase() { - out.push(lower); - } + out.push(c); } pub(crate) fn collect_link_reference_definitions( diff --git a/crates/biome_markdown_parser/src/syntax.rs b/crates/biome_markdown_parser/src/syntax.rs index 1c9bdac152..9ceac84a0f 100644 --- a/crates/biome_markdown_parser/src/syntax.rs +++ b/crates/biome_markdown_parser/src/syntax.rs @@ -60,6 +60,9 @@ use thematic_break_block::{at_thematic_break_block, parse_thematic_break_block}; use crate::MarkdownParser; +/// Maximum paren nesting allowed in link destinations per CommonMark. +pub(crate) const MAX_LINK_DESTINATION_PAREN_DEPTH: i32 = 32; + /// CommonMark requires 4 or more spaces for indented code blocks. const INDENT_CODE_BLOCK_SPACES: usize = 4; @@ -71,6 +74,98 @@ pub(crate) fn parse_document(p: &mut MarkdownParser) { m.complete(p, MD_DOCUMENT); } +/// Result of updating parenthesis depth when scanning link destinations. +#[derive(Debug, Clone, Copy, PartialEq, Eq)] +pub(crate) enum ParenDepthResult { + /// Depth updated successfully, contains new depth value + Ok(i32), + /// Depth would exceed the maximum (too many nested opening parens). + /// Per cmark, this truncates the destination at this point. + DepthExceeded, + /// Unmatched closing paren (would go below 0). + /// This typically means the `)` belongs to the enclosing construct. + UnmatchedClose, +} + +pub(crate) fn try_update_paren_depth(text: &str, depth: i32, max: i32) -> ParenDepthResult { + let mut depth = depth; + let mut chars = text.chars().peekable(); + + while let Some(c) = chars.next() { + if c == '\\' && matches!(chars.peek(), Some('(' | ')')) { + chars.next(); + continue; + } + + if c == '(' { + if depth == max { + return ParenDepthResult::DepthExceeded; + } + depth += 1; + } else if c == ')' { + if depth == 0 { + return ParenDepthResult::UnmatchedClose; + } + depth -= 1; + } + } + + ParenDepthResult::Ok(depth) +} + +pub(crate) enum LinkDestinationKind { + Enclosed, + Raw, +} + +pub(crate) fn validate_link_destination_text( + text: &str, + kind: LinkDestinationKind, + pending_escape: &mut bool, +) -> bool { + for c in text.chars() { + if *pending_escape { + if c.is_ascii_punctuation() { + *pending_escape = false; + continue; + } + *pending_escape = false; + } + + if c == '\\' { + *pending_escape = true; + continue; + } + + if c.is_ascii_control() { + return false; + } + + if matches!(kind, LinkDestinationKind::Enclosed) && c == '<' { + return false; + } + } + + true +} + +pub(crate) fn ends_with_unescaped_close(text: &str, close_char: char) -> bool { + if !text.ends_with(close_char) { + return false; + } + + let mut backslashes = 0; + for c in text.chars().rev().skip(1) { + if c == '\\' { + backslashes += 1; + } else { + break; + } + } + + backslashes % 2 == 0 +} + pub(crate) fn parse_block_list(p: &mut MarkdownParser) -> ParsedSyntax { let mut list = DocumentBlockList; Present(list.parse_list(p)) diff --git a/crates/biome_markdown_parser/src/syntax/inline.rs b/crates/biome_markdown_parser/src/syntax/inline.rs index 275a09ddb6..f336b37b33 100644 --- a/crates/biome_markdown_parser/src/syntax/inline.rs +++ b/crates/biome_markdown_parser/src/syntax/inline.rs @@ -611,15 +611,46 @@ pub(crate) fn parse_inline_italic(p: &mut MarkdownParser) -> ParsedSyntax { parse_emphasis_from_context(p, false) } -fn parse_inline_item_list_until_no_links(p: &mut MarkdownParser, stop: MarkdownSyntaxKind) { +fn parse_inline_item_list_until_no_links(p: &mut MarkdownParser, stop: MarkdownSyntaxKind) -> bool { let m = p.start(); let prev_context = set_inline_emphasis_context_until(p, stop); + let mut bracket_depth = 0usize; + let mut has_nested_link = false; loop { - if p.at(stop) || p.at_inline_end() { + // Per CommonMark, link text can span lines, but blank lines end the link. + // Check for blank line (NEWLINE followed by NEWLINE or EOF after optional whitespace) + if p.at(NEWLINE) { + if p.at_blank_line() { + break; // Blank line ends link text + } + // Single newline inside link text - consume and continue + let _ = super::parse_textual(p); + continue; + } + + if p.at(T![EOF]) { break; } + if p.at(stop) { + if bracket_depth == 0 { + break; + } + bracket_depth = bracket_depth.saturating_sub(1); + let _ = super::parse_textual(p); + continue; + } + + if p.at(L_BRACK) { + if !has_nested_link && nested_link_starts_here(p) { + has_nested_link = true; + } + bracket_depth += 1; + let _ = super::parse_textual(p); + continue; + } + if parse_any_inline_no_links(p).is_absent() { break; } @@ -627,13 +658,53 @@ fn parse_inline_item_list_until_no_links(p: &mut MarkdownParser, stop: MarkdownS m.complete(p, MD_INLINE_ITEM_LIST); p.set_emphasis_context(prev_context); + has_nested_link +} + +fn nested_link_starts_here(p: &mut MarkdownParser) -> bool { + p.lookahead(|p| { + if !p.at(L_BRACK) { + return false; + } + + p.bump(L_BRACK); + let mut depth = 0usize; + + loop { + if p.at(EOF) || p.at_inline_end() { + return false; + } + + if p.at(L_BRACK) { + depth += 1; + p.bump(L_BRACK); + continue; + } + + if p.at(R_BRACK) { + if depth > 0 { + depth -= 1; + p.bump(R_BRACK); + continue; + } + p.bump(R_BRACK); + return p.at(L_PAREN) || p.at(L_BRACK); + } + + p.bump(p.cur()); + } + }) } fn parse_any_inline_no_links(p: &mut MarkdownParser) -> ParsedSyntax { - if (p.at(BANG) && p.nth_at(1, L_BRACK)) || p.at(L_BRACK) { + if p.at(L_BRACK) { return super::parse_textual(p); } + if p.at(BANG) && p.nth_at(1, L_BRACK) { + return parse_inline_image(p); + } + parse_any_inline(p) } @@ -826,10 +897,15 @@ fn parse_link_or_image(p: &mut MarkdownParser, kind: LinkParseKind) -> ParsedSyn kind.bump_opening(p); // Link text / alt text - parse_inline_item_list_until_no_links(p, R_BRACK); + let has_nested_link = parse_inline_item_list_until_no_links(p, R_BRACK); // ] - if missing at inline end, emit diagnostic; otherwise rewind if !p.eat(R_BRACK) { + if matches!(kind, LinkParseKind::Link) && has_nested_link { + m.abandon(p); + p.rewind(checkpoint); + return Absent; + } if p.at_inline_end() { // Unclosed link/image at end of inline content - emit diagnostic // Expand range to include the text content, not just the opening bracket @@ -843,19 +919,50 @@ fn parse_link_or_image(p: &mut MarkdownParser, kind: LinkParseKind) -> ParsedSyn p.rewind(checkpoint); return Absent; } + let text_end_offset = p.cur_range().start(); + + if matches!(kind, LinkParseKind::Link) && has_nested_link { + m.abandon(p); + p.rewind(checkpoint); + return Absent; + } // Now decide based on what follows ] - if p.at(L_PAREN) { + let link_validation = if p.at(L_PAREN) { + inline_link_is_valid(p) + } else { + InlineLinkValidation::Invalid + }; + + if matches!( + link_validation, + InlineLinkValidation::Valid | InlineLinkValidation::DepthExceeded + ) { // Inline link/image: [text](url) or ![alt](url) // Bump past ( and lex the following tokens in LinkDefinition context // so whitespace separates destination and title. p.expect_with_context(L_PAREN, crate::lexer::MarkdownLexContext::LinkDefinition); let destination = p.start(); - parse_inline_link_destination_tokens(p); + let destination_result = parse_inline_link_destination_tokens(p); + + // When depth exceeded, destination is truncated but link is still valid. + // Complete the destination and link immediately without looking for closing paren. + if destination_result == DestinationScanResult::DepthExceeded { + destination.complete(p, MD_INLINE_ITEM_LIST); + return Present(m.complete(p, kind.inline_kind())); + } + let has_title = inline_title_starts_after_whitespace_tokens(p); - while is_whitespace_token(p) { - bump_textual_link_def(p); + while is_title_separator_token(p) { + bump_link_def_separator(p); + } + if destination_result == DestinationScanResult::Invalid { + destination.abandon(p); + m.abandon(p); + p.rewind(checkpoint); + p.force_relex_regular(); + return Absent; } destination.complete(p, MD_INLINE_ITEM_LIST); @@ -867,8 +974,20 @@ fn parse_link_or_image(p: &mut MarkdownParser, kind: LinkParseKind) -> ParsedSyn title_m.complete(p, MD_LINK_TITLE); } + // Skip trailing whitespace/newlines before closing …

Refines the parsing logic for inline elements, focusing on emphasis delimiters, code spans, and line breaks. Changes include: - Refactor emphasis matching to strictly follow the "rule of 3" (left-flanking/right-flanking) logic. - Allow emphasis and code spans to span multiple lines correctly. - Treat unmatched code span openers strictly as literal text. - Introduce a shared CommonMark punctuation table for consistent character handling. - Fix edge cases for hard line breaks and whitespace handling within inline elements. - Address specific CommonMark examples (518–603, 636-646) related to inline nesting and precedence. diff --git a/crates/biome_markdown_parser/src/lexer/mod.rs b/crates/biome_markdown_parser/src/lexer/mod.rs index ec5a9ae0f2..8122d861de 100644 --- a/crates/biome_markdown_parser/src/lexer/mod.rs +++ b/crates/biome_markdown_parser/src/lexer/mod.rs @@ -20,6 +20,8 @@ use biome_unicode_table::lookup_byte; /// - `FencedCodeBlock`: Inside fenced code block, no markdown parsing /// - `HtmlBlock`: Inside HTML block, minimal markdown parsing /// - `LinkDefinition`: Inside link reference definition, whitespace separates tokens +/// - `CodeSpan`: Inside inline code span, backslashes are literal (no escapes) +/// - `EmphasisInline`: Emit single STAR/UNDERSCORE tokens for partial delimiter consumption #[derive(Debug, Copy, Clone, Eq, PartialEq, Default)] pub enum MarkdownLexContext { /// Normal markdown parsing with full inline element detection. @@ -39,6 +41,16 @@ pub enum MarkdownLexContext { /// In this context, whitespace is significant and separates destination from title. /// Text tokens stop at whitespace to allow proper parsing. LinkDefinition, + /// Inside an inline code span. + /// Per CommonMark §6.1, backslash escapes are not processed inside code spans. + /// Backslash is treated as a literal character, not an escape. + CodeSpan, + /// Inside emphasis delimiter processing. + /// In this context, `*` and `_` are always emitted as single-character tokens + /// (STAR, UNDERSCORE) rather than double tokens (DOUBLE_STAR, DOUBLE_UNDERSCORE). + /// This allows partial consumption of delimiter runs when the match algorithm + /// determines only 1 char should be used from a 2-char run. + EmphasisInline, } impl LexContext for MarkdownLexContext { @@ -57,6 +69,10 @@ pub enum MarkdownReLexContext { Regular, /// Re-lex for link definition context where whitespace is significant. LinkDefinition, + /// Re-lex for emphasis inline context where `*` and `_` emit single tokens. + /// Used when the emphasis matching algorithm needs to partially consume + /// a DOUBLE_STAR or DOUBLE_UNDERSCORE token. + EmphasisInline, } /// An extremely fast, lookup table based, lossless Markdown lexer @@ -230,9 +246,14 @@ impl<'src> MarkdownLexer<'src> { // - In middle of line: whitespace is just text content, include in textual token // - Exception: 2+ spaces before newline is a hard line break // - In LinkDefinition context: whitespace is always significant (separates destination from title) + // - In CodeSpan context: whitespace is literal content, no hard-line-break detection WHS => { if current == b'\n' || current == b'\r' { self.consume_newline() + } else if matches!(context, MarkdownLexContext::CodeSpan) { + // In code span context, whitespace is literal content. + // No hard-line-break detection - the renderer normalizes line endings to spaces. + self.consume_textual(context) } else if matches!(context, MarkdownLexContext::LinkDefinition) { // In link definition context, whitespace separates tokens. // We consume it as textual literal so it's not treated as trivia by the parser. @@ -267,7 +288,15 @@ impl<'src> MarkdownLexer<'src> { PNC => self.consume_byte(R_PAREN), COL => self.consume_byte(COLON), AMP => self.consume_entity_or_textual(context), - BSL => self.consume_escape(), + BSL => { + // Per CommonMark §6.1, backslash escapes are NOT processed inside code spans. + // Backslash is literal, so `\`` produces a literal backslash followed by backtick. + if matches!(context, MarkdownLexContext::CodeSpan) { + self.consume_textual(context) + } else { + self.consume_escape() + } + } // = at line start could be setext heading underline EQL if self.after_newline => self.consume_setext_underline_or_textual(), _ => { @@ -753,6 +782,19 @@ impl<'src> MarkdownLexer<'src> { // Not a thematic break - restore position and consume as emphasis marker self.position = start_position; + // In EmphasisInline context, always emit single tokens for * and _. + // This allows partial consumption of delimiter runs when the match algorithm + // determines only 1 char should be used from a 2-char run. + if matches!(context, MarkdownLexContext::EmphasisInline) { + self.advance(1); + return match start_char { + b'*' => STAR, + b'_' => UNDERSCORE, + b'-' => MINUS, + _ => unreachable!(), + }; + } + // Check for double emphasis markers (**, __) // Note: -- is not valid markdown emphasis, so we don't check for it if start_char != b'-' && self.peek_byte() == Some(start_char) { @@ -1200,6 +1242,7 @@ impl<'src> ReLexer<'src> for MarkdownLexer<'src> { let lex_context = match context { MarkdownReLexContext::Regular => MarkdownLexContext::Regular, MarkdownReLexContext::LinkDefinition => MarkdownLexContext::LinkDefinition, + MarkdownReLexContext::EmphasisInline => MarkdownLexContext::EmphasisInline, }; let re_lexed_kind = match self.current_byte() { diff --git a/crates/biome_markdown_parser/src/parser.rs b/crates/biome_markdown_parser/src/parser.rs index 2ac966faa0..877750fa61 100644 --- a/crates/biome_markdown_parser/src/parser.rs +++ b/crates/biome_markdown_parser/src/parser.rs @@ -208,6 +208,28 @@ impl<'source> MarkdownParser<'source> { .force_relex_in_context(crate::lexer::MarkdownLexContext::Regular); } + /// Force re-lex the current token in CodeSpan context. + /// In this context, backslash is literal (not an escape character). + /// Used for autolinks where `\>` should be `\` + `>` as separate tokens. + pub(crate) fn force_relex_code_span(&mut self) { + self.source + .force_relex_in_context(crate::lexer::MarkdownLexContext::CodeSpan); + } + + /// Re-lex the current token as single-char emphasis delimiter. + /// + /// Use this when the emphasis matching algorithm needs to partially consume + /// a DOUBLE_STAR or DOUBLE_UNDERSCORE token. After re-lexing, the token will + /// be either STAR or UNDERSCORE (single char). + /// + /// # Safety + /// Only call on the current token, NOT inside lookahead closures. + /// This invalidates any buffered lookahead, so ensure no lookahead is active. + pub(crate) fn force_relex_emphasis_inline(&mut self) -> MarkdownSyntaxKind { + self.source + .re_lex(crate::lexer::MarkdownReLexContext::EmphasisInline) + } + pub(crate) fn set_force_ordered_list_marker(&mut self, value: bool) { self.source.set_force_ordered_list_marker(value); } @@ -218,6 +240,7 @@ impl<'source> MarkdownParser<'source> { self.source.bump_link_definition(); } + pub fn checkpoint(&self) -> MarkdownParserCheckpoint { MarkdownParserCheckpoint { context: self.context.checkpoint(), diff --git a/crates/biome_markdown_parser/src/syntax/inline.rs b/crates/biome_markdown_parser/src/syntax/inline.rs index f336b37b33..87d649e360 100644 --- a/crates/biome_markdown_parser/src/syntax/inline.rs +++ b/crates/biome_markdown_parser/src/syntax/inline.rs @@ -41,8 +41,9 @@ use biome_markdown_syntax::T; use biome_markdown_syntax::kind::MarkdownSyntaxKind::*; use biome_parser::Parser; use biome_parser::prelude::ParsedSyntax::{self, *}; +use biome_unicode_table::is_unicode_punctuation; -use biome_rowan::{TextRange, TextSize}; +use biome_rowan::TextRange; use crate::MarkdownParser; use crate::link_reference::normalize_reference_label; @@ -71,6 +72,10 @@ struct DelimRun { can_close: bool, /// Byte offset in the source where this run starts start_offset: usize, + /// Bracket nesting depth for scoping emphasis within link text. + /// Delimiters inside brackets (links) should only match with each other, + /// not with delimiters outside the brackets. 0 = outside brackets. + label_id: usize, } /// A matched emphasis span (opener + closer) @@ -89,45 +94,14 @@ fn is_whitespace(c: char) -> bool { c.is_whitespace() } +fn is_emphasis_marker(c: char) -> bool { + matches!(c, '*' | '_') +} + /// Check if a character is Unicode punctuation for flanking rules. /// Per CommonMark spec, this includes ASCII punctuation and Unicode punctuation categories. fn is_punctuation(c: char) -> bool { - // ASCII punctuation + Unicode punctuation categories - matches!( - c, - '!' | '"' - | '#' - | '$' - | '%' - | '&' - | '\'' - | '(' - | ')' - | '*' - | '+' - | ',' - | '-' - | '.' - | '/' - | ':' - | ';' - | '<' - | '=' - | '>' - | '?' - | '@' - | '[' - | '\\' - | ']' - | '^' - | '_' - | '`' - | '{' - | '|' - | '}' - | '~' - ) || c.is_ascii_punctuation() - || matches!(c, '\u{2000}'..='\u{206F}' | '\u{2E00}'..='\u{2E7F}') + is_unicode_punctuation(c) } /// Check if an opening delimiter is left-flanking per CommonMark rules. @@ -138,6 +112,7 @@ fn is_left_flanking_delimiter(char_after: Option<char>, char_before: Option<char match char_after { None => false, // At end of input, can't be left-flanking Some(c) if is_whitespace(c) => false, // Followed by whitespace + Some(c) if is_emphasis_marker(c) => true, Some(c) if is_punctuation(c) => { // Followed by punctuation - only left-flanking if preceded by whitespace or punctuation match char_before { @@ -157,6 +132,7 @@ fn is_right_flanking_delimiter(char_before: Option<char>, char_after: Option<cha match char_before { None => false, // At start of input, can't be right-flanking Some(c) if is_whitespace(c) => false, // Preceded by whitespace + Some(c) if is_emphasis_marker(c) => true, Some(c) if is_punctuation(c) => { // Preceded by punctuation - only right-flanking if followed by whitespace or punctuation match char_after { @@ -209,14 +185,147 @@ fn can_underscore_close(char_before: Option<char>, char_after: Option<char>) -> /// This is the first pass of the CommonMark emphasis algorithm. It scans /// the source text and identifies all potential delimiter runs (sequences /// of `*` or `_`), computing their flanking status. -fn collect_delimiter_runs(source: &str) -> Vec<DelimRun> { +/// Result of checking if a bracket forms a valid link. +/// Contains the closing bracket position if found. +struct BracketCheckResult { + /// Position of the closing `]` (or 0 if not found) + close_pos: usize, + /// Whether this is a valid inline link `[...](` or full reference `[...][` + is_inline_or_full_ref: bool, +} + +/// Check if a bracket at position `start` forms a valid link pattern. +/// Returns the closing bracket position and whether it's an inline link or full reference. +fn check_bracket_pattern(bytes: &[u8], start: usize) -> Option<BracketCheckResult> { + if start >= bytes.len() || bytes[start] != b'[' { + return None; + } + + // Find matching ] with proper nesting + let mut depth = 1; + let mut i = start + 1; + while i < bytes.len() && depth > 0 { + match bytes[i] { + b'[' => depth += 1, + b']' => depth -= 1, + b'\\' if i + 1 < bytes.len() => i += 1, // Skip escaped char + b'`' => { + // Skip code spans + let backtick_count = { + let mut c = 1; + while i + c < bytes.len() && bytes[i + c] == b'`' { + c += 1; + } + c + }; + i += backtick_count; + while i < bytes.len() { + if bytes[i] == b'`' { + let close_count = { + let mut c = 1; + while i + c < bytes.len() && bytes[i + c] == b'`' { + c += 1; + } + c + }; + i += close_count; + if close_count == backtick_count { + break; + } + } else { + i += 1; + } + } + continue; + } + b'<' => { + // Skip potential HTML/autolinks + i += 1; + while i < bytes.len() && bytes[i] != b'>' && bytes[i] != b'\n' { + i += 1; + } + if i < bytes.len() && bytes[i] == b'>' { + i += 1; + } + continue; + } + _ => {} + } + i += 1; + } + + if depth != 0 { + return None; + } + + // i now points to position after `]` + let close_pos = i - 1; + let is_inline_or_full_ref = i < bytes.len() && (bytes[i] == b'(' || bytes[i] == b'['); + + Some(BracketCheckResult { + close_pos, + is_inline_or_full_ref, + }) +} + +/// Extract label text from a bracket pattern for reference lookup. +fn extract_label_text(source: &str, start: usize, close_pos: usize) -> &str { + if start < close_pos && close_pos <= source.len() { + &source[start + 1..close_pos] + } else { + "" + } +} + +fn collect_delimiter_runs(source: &str, reference_checker: impl Fn(&str) -> bool) -> Vec<DelimRun> { let mut runs = Vec::new(); let bytes = source.as_bytes(); let mut i = 0; + // Pre-compute valid link bracket positions. + // A bracket is considered a valid link if: + // 1. It's followed by `(` (inline link) or `[` (full reference), OR + // 2. It's a shortcut reference with a defined reference (checked via reference_checker) + let mut link_bracket_starts = Vec::new(); + for pos in 0..bytes.len() { + if bytes[pos] == b'[' + && let Some(result) = check_bracket_pattern(bytes, pos) + { + if result.is_inline_or_full_ref { + // Inline link or full reference link + link_bracket_starts.push(pos); + } else { + // Could be a shortcut reference - check if definition exists + let label = extract_label_text(source, pos, result.close_pos); + let normalized = normalize_reference_label(label); + if !normalized.is_empty() && reference_checker(&normalized) { + link_bracket_starts.push(pos); + } + } + } + } + + // Track bracket depth, but only for valid link brackets + let mut bracket_depth = 0usize; + let mut active_link_brackets: Vec<usize> = Vec::new(); + while i < bytes.len() { let b = bytes[i]; + // Track bracket depth for valid links only + if b == b'[' && link_bracket_starts.contains(&i) { + bracket_depth += 1; + active_link_brackets.push(i); + i += 1; + continue; + } + if b == b']' && !active_link_brackets.is_empty() { + bracket_depth = bracket_depth.saturating_sub(1); + active_link_brackets.pop(); + i += 1; + continue; + } + // Check for delimiter characters if b == b'*' || b == b'_' { let kind = if b == b'*' { @@ -265,6 +374,10 @@ fn collect_delimiter_runs(source: &str) -> Vec<DelimRun> { can_open, can_close, start_offset, + // Only scope by bracket depth when inside a valid link pattern. + // This prevents emphasis from spanning link boundaries, but allows + // emphasis to span brackets that don't form valid links. + label_id: bracket_depth, }); i = end_offset; @@ -330,41 +443,38 @@ fn match_delimiters(runs: &mut [DelimRun]) -> Vec<EmphasisMatch> { if runs[idx].can_close && runs[idx].count > 0 { loop { let mut opener_stack_pos = None; - let prefer_strong = runs[idx].count >= 2; - - for pass in 0..2 { - for (pos, &opener_idx) in opener_stack.iter().enumerate().rev() { - let opener = &runs[opener_idx]; - let closer = &runs[idx]; - if opener.kind != closer.kind || !opener.can_open || opener.count == 0 { - continue; - } - - if prefer_strong && pass == 0 && opener.count < 2 { - continue; - } - - // Rule of 3: if (opener_count + closer_count) % 3 == 0 and - // the closer can open or the opener can close, skip unless - // both counts are divisible by 3 - let opener_count = opener.count; - let closer_count = closer.count; - if ((opener.can_open && opener.can_close) - || (closer.can_open && closer.can_close)) - && (opener_count + closer_count).is_multiple_of(3) - && (!opener_count.is_multiple_of(3) || !closer_count.is_multiple_of(3)) - { - continue; - } + // Search backward for the closest matching opener. + // Per CommonMark spec, we find any matching opener first, + // then determine strong vs regular based on both counts. + for (pos, &opener_idx) in opener_stack.iter().enumerate().rev() { + let opener = &runs[opener_idx]; + let closer = &runs[idx]; + + // Only match within same bracket scope (label_id). + // This prevents emphasis from spanning link boundaries. + if opener.label_id != closer.label_id { + continue; + } - opener_stack_pos = Some(pos); - break; + if opener.kind != closer.kind || !opener.can_open || opener.count == 0 { + continue; } - if opener_stack_pos.is_some() { - break; + // Rule of 3: if (opener_count + closer_count) % 3 == 0 and + // the closer can open or the opener can close, skip unless + // both counts are divisible by 3 + let opener_count = opener.count; + let closer_count = closer.count; + if (opener.can_close || closer.can_open) + && !closer_count.is_multiple_of(3) + && (opener_count + closer_count).is_multiple_of(3) + { + continue; } + + opener_stack_pos = Some(pos); + break; } let Some(pos) = opener_stack_pos else { break }; @@ -375,7 +485,11 @@ fn match_delimiters(runs: &mut [DelimRun]) -> Vec<EmphasisMatch> { 1 }; - let opener_start = runs[opener_idx].start_offset; + // Openers consume from END of run (leftover stays at beginning). + // This ensures for `***foo***`, the inner `**` is consumed leaving `*` at start. + let opener_start = + runs[opener_idx].start_offset + runs[opener_idx].count - use_count; + // Closers consume from BEGINNING of what remains. let closer_start = runs[idx].start_offset; matches.push(EmphasisMatch { @@ -384,8 +498,9 @@ fn match_delimiters(runs: &mut [DelimRun]) -> Vec<EmphasisMatch> { is_strong: use_count == 2, }); + // Opener: reduce count but keep start_offset (leftover is at beginning) runs[opener_idx].count -= use_count; - runs[opener_idx].start_offset += use_count; + // Closer: reduce count and advance start_offset (leftover is at end) runs[idx].count -= use_count; runs[idx].start_offset += use_count; @@ -395,10 +510,10 @@ fn match_delimiters(runs: &mut [DelimRun]) -> Vec<EmphasisMatch> { opener_stack.pop(); } - if use_count == 2 && runs[opener_idx].count > 0 && runs[idx].count > 0 { - // Avoid crossing matches from odd-length runs (e.g. ***foo***). - break; - } + // Note: With the "consume from END" algorithm for openers, + // crossing matches are no longer an issue because the leftover + // chars end up at the beginning of the opener run (wrapping + // around the inner match), not at the end (which would cross). if runs[idx].count == 0 { break; @@ -426,10 +541,26 @@ pub(crate) struct EmphasisContext { base_offset: usize, } +/// Information about a match found within a token's range. +/// Used when the opener doesn't start at the exact token boundary. +#[derive(Debug)] +struct OpenerMatch<'a> { + /// The matched emphasis span + matched: &'a EmphasisMatch, + /// How many chars before opener_start (literal prefix to emit) + prefix_len: usize, +} + impl EmphasisContext { - /// Create a new emphasis context by analyzing the source text - pub(crate) fn new(source: &str, base_offset: usize) -> Self { - let mut runs = collect_delimiter_runs(source); + /// Create a new emphasis context by analyzing the source text. + /// The reference_checker function is used to determine if a bracket pattern + /// is a valid shortcut reference link. + pub(crate) fn new( + source: &str, + base_offset: usize, + reference_checker: impl Fn(&str) -> bool, + ) -> Self { + let mut runs = collect_delimiter_runs(source, reference_checker); let matches = match_delimiters(&mut runs); Self { matches, @@ -437,12 +568,44 @@ impl EmphasisContext { } } - /// Check if there's an emphasis opener at the given offset - fn opener_at(&self, offset: usize) -> Option<&EmphasisMatch> { - let abs_offset = offset; - self.matches - .iter() - .find(|m| m.opener_start + self.base_offset == abs_offset) + /// Find the *earliest* match whose opener_start is within [token_start, token_end) + /// and matches the expected `is_strong` value. + /// Returns None if no match found, or the match plus prefix length. + /// + /// This is used instead of exact offset matching because with the "consume from END" + /// algorithm, an opener might start in the middle of a DOUBLE_STAR token. + fn opener_within( + &self, + token_start: usize, + token_len: usize, + expect_strong: bool, + ) -> Option<OpenerMatch<'_>> { + let token_end = token_start + token_len; + let mut best: Option<OpenerMatch<'_>> = None; + + for m in &self.matches { + // Filter by expected emphasis type + if m.is_strong != expect_strong { + continue; + } + + let abs_opener = m.opener_start + self.base_offset; + if abs_opener >= token_start && abs_opener < token_end { + let candidate = OpenerMatch { + matched: m, + prefix_len: abs_opener - token_start, + }; + // Pick the earliest match (smallest prefix_len) + if best + .as_ref() + .is_none_or(|b| candidate.prefix_len < b.prefix_len) + { + best = Some(candidate); + } + } + } + + best } } @@ -458,72 +621,143 @@ pub(crate) fn parse_hard_line(p: &mut MarkdownParser) -> ParsedSyntax { return Absent; } + let ends_block = p.lookahead(|p| { + p.bump(MD_HARD_LINE_LITERAL); + p.at(NEWLINE) || p.at(EOF) + }); + + if ends_block { + return super::parse_textual(p); + } + let m = p.start(); p.bump(MD_HARD_LINE_LITERAL); Present(m.complete(p, MD_HARD_LINE)) } +/// Check if there's a matching closing backtick sequence before EOF/blank line. +/// +/// Per CommonMark §6.1, a code span opener must have a matching closer with the +/// same number of backticks. If no match exists, the opener should be treated +/// as literal text, not an unclosed code span. +/// +/// Returns false if no match found (opener should become literal text). + +fn has_matching_code_span_closer(p: &mut MarkdownParser, opening_count: usize) -> bool { + use crate::lexer::MarkdownLexContext; + + p.lookahead(|p| { + // Skip the opening backticks + p.bump(BACKTICK); + + loop { + // EOF = no matching closer found + if p.at(T![EOF]) { + return false; + } + + // Blank line = paragraph boundary, terminates search + if p.at(NEWLINE) && p.at_blank_line() { + return false; + } + + // Per CommonMark §4.3, setext heading underlines take priority over + // inline code spans. If crossing a newline would land on a setext + // underline, the code span is invalid — the underline forms a heading. + if p.at(NEWLINE) { + p.bump_remap_with_context(MD_TEXTUAL_LITERAL, MarkdownLexContext::CodeSpan); + if crate::syntax::at_setext_underline_after_newline(p).is_some() { + return false; + } + continue; + } + + // Found backticks - check if they match + if p.at(BACKTICK) { + let closing_count = p.cur_text().len(); + if closing_count == opening_count { + return true; + } + // Not matching - continue searching + p.bump(BACKTICK); + continue; + } + + // Consume token and continue (use CodeSpan context for proper backslash handling) + p.bump_remap_with_context(MD_TEXTUAL_LITERAL, MarkdownLexContext::CodeSpan); + } + }) +} + /// Parse inline code span (`` `code` `` or ``` `` `code` `` ```). /// /// Grammar: MdInlineCode = l_tick: '`' content: MdInlineItemList r_tick: '`' /// -/// Per CommonMark, code spans can use multiple backticks to allow literal -/// backticks inside: ``` `` `code` `` ``` wraps around code containing backticks. -/// The opening and closing backtick strings must be the same length. +/// Per CommonMark §6.1: +/// - Code spans can use multiple backticks to allow literal backticks inside +/// - The opening and closing backtick strings must be the same length +/// - Backslash escapes are NOT processed inside code spans (\` is literal `\``) +/// - If no matching closer exists, the opener is treated as literal text pub(crate) fn parse_inline_code(p: &mut MarkdownParser) -> ParsedSyntax { + use crate::lexer::MarkdownLexContext; + if !p.at(BACKTICK) { return Absent; } - let m = p.start(); - - // Count opening backticks from token text let opening_count = p.cur_text().len(); - let opening_range = p.cur_range(); + + // DESIGN PRINCIPLE #2 & #4: Check for matching closer BEFORE creating any nodes. + // If no match exists, return Absent so backticks become literal text. + // This avoids synthesizing MD_INLINE_CODE with missing r_tick_token. + if !has_matching_code_span_closer(p, opening_count) { + return Absent; // Caller will treat backtick as literal MD_TEXTUAL + } + + // We have a valid code span - now parse it + let m = p.start(); // Opening backtick(s) p.bump(BACKTICK); - // Content - parse until we find a BACKTICK with matching count, or EOF + // Content - parse until we find matching closing backticks + // Per CommonMark, code spans can span multiple lines (newlines become spaces in output) + // All content is lexed in CodeSpan context to keep backslash literal and avoid + // hard-line-break detection. let content = p.start(); - let mut found_closing = false; loop { - if p.at_inline_end() { + // EOF should not happen (lookahead guaranteed a closer), but handle defensively + if p.at(T![EOF]) { break; } - // Check for matching closing backticks - if p.at(BACKTICK) { - let closing_count = p.cur_text().len(); - if closing_count == opening_count { - // Found matching closing backticks - found_closing = true; - break; + // DESIGN PRINCIPLE #3: Terminate on blank line (paragraph boundary) + if p.at(NEWLINE) { + if p.at_blank_line() { + break; // Paragraph boundary - stop } - // Not matching - consume as content + // Soft line break - consume NEWLINE as content and continue + // Use CodeSpan context so next token is also lexed without escape processing let text_m = p.start(); - p.bump_remap(MD_TEXTUAL_LITERAL); + p.bump_remap_with_context(MD_TEXTUAL_LITERAL, MarkdownLexContext::CodeSpan); text_m.complete(p, MD_TEXTUAL); continue; } - // Regular content + // Found matching closing backticks + if p.at(BACKTICK) && p.cur_text().len() == opening_count { + break; + } + + // DESIGN PRINCIPLE #1: Use CodeSpan context so backslash is literal let text_m = p.start(); - p.bump_remap(MD_TEXTUAL_LITERAL); + p.bump_remap_with_context(MD_TEXTUAL_LITERAL, MarkdownLexContext::CodeSpan); text_m.complete(p, MD_TEXTUAL); } content.complete(p, MD_INLINE_ITEM_LIST); - // Closing backtick(s) - emit custom diagnostic if missing - if found_closing { - p.bump(BACKTICK); - } else { - p.error(super::parse_error::unclosed_code_span( - p, - opening_range, - opening_count, - )); - } + // Closing backticks (guaranteed to exist due to lookahead check) + p.bump(BACKTICK); Present(m.complete(p, MD_INLINE_CODE)) } @@ -535,46 +769,95 @@ fn parse_emphasis_from_context(p: &mut MarkdownParser, expect_strong: bool) -> P None => return Absent, }; - let offset = u32::from(p.cur_range().start()) as usize; - let matched = match context.opener_at(offset) { - Some(matched) => matched, + // Must be at an emphasis token + if !p.at(DOUBLE_STAR) && !p.at(DOUBLE_UNDERSCORE) && !p.at(T![*]) && !p.at(UNDERSCORE) { + return Absent; + } + + // Get current token info BEFORE any re-lex + let token_start = u32::from(p.cur_range().start()) as usize; + let token_len: usize = p.cur_range().len().into(); + + // Find match within current token's range that has the expected is_strong value + let opener_match = match context.opener_within(token_start, token_len, expect_strong) { + Some(m) => m, None => return Absent, }; - if matched.is_strong != expect_strong { + // If the opener doesn't start at the exact token boundary, return Absent. + // The caller (parse_any_inline) will emit literal text, advancing the parser position. + // On subsequent calls, we'll eventually be at the correct position with prefix_len == 0. + if opener_match.prefix_len > 0 { return Absent; } - let (opener_kind, closer_kind, opener_text) = if expect_strong { - if p.at(DOUBLE_STAR) { - (DOUBLE_STAR, DOUBLE_STAR, "**") - } else if p.at(DOUBLE_UNDERSCORE) { - (DOUBLE_UNDERSCORE, DOUBLE_UNDERSCORE, "__") - } else { - return Absent; - } - } else if p.at(T![*]) { - (T![*], T![*], "*") - } else if p.at(UNDERSCORE) { - (UNDERSCORE, UNDERSCORE, "_") - } else { - return Absent; + // Extract values before dropping the borrow on context + let use_count = if expect_strong { 2 } else { 1 }; + let closer_offset = opener_match.matched.closer_start + context.base_offset; + // Use the correct delimiter character for error messages + let is_underscore = p.at(DOUBLE_UNDERSCORE) || p.at(UNDERSCORE); + let opener_text = match (expect_strong, is_underscore) { + (true, true) => "__", + (true, false) => "**", + (false, true) => "_", + (false, false) => "*", }; - let closer_offset = matched.closer_start + context.base_offset; let m = p.start(); let opening_range = p.cur_range(); - p.bump(opener_kind); + // Consume opener tokens + // For strong emphasis (use_count=2), we can bump DOUBLE_* directly if at one. + // Only re-lex when we need to consume a partial token or single chars. + if use_count == 2 && (p.at(DOUBLE_STAR) || p.at(DOUBLE_UNDERSCORE)) { + // Bump the double token as a single unit + p.bump_any(); + } else { + // Consume individual tokens + for _ in 0..use_count { + if p.at(DOUBLE_STAR) || p.at(DOUBLE_UNDERSCORE) { + p.force_relex_emphasis_inline(); + } + p.bump_any(); + } + } + // Parse content until we reach the closer let content = p.start(); loop { - if p.at_inline_end() { + // EOF always ends content + if p.at(T![EOF]) { break; } let current_offset = u32::from(p.cur_range().start()) as usize; - if current_offset == closer_offset { + let current_len: usize = p.cur_range().len().into(); + + // Check if closer is AT or WITHIN current token + if closer_offset >= current_offset && closer_offset < current_offset + current_len { + break; + } + + // Check if we've passed the closer (can happen when link parsing consumes past it) + if current_offset > closer_offset { + break; + } + + // Handle NEWLINE: emphasis can span multiple lines per CommonMark + // But blank lines end paragraphs, so stop there + if p.at(NEWLINE) { + if p.at_blank_line() { + // Blank line = paragraph boundary, emphasis is unclosed + break; + } + if closer_offset > current_offset { + // Soft line break - consume NEWLINE as textual content and continue + let text_m = p.start(); + p.bump_remap(MD_TEXTUAL_LITERAL); + text_m.complete(p, MD_TEXTUAL); + continue; + } + // Closer should have been at or before this newline - stop break; } @@ -584,9 +867,45 @@ fn parse_emphasis_from_context(p: &mut MarkdownParser, expect_strong: bool) -> P } content.complete(p, MD_INLINE_ITEM_LIST); - if p.at(closer_kind) && u32::from(p.cur_range().start()) as usize == closer_offset { - p.bump(closer_kind); + // Consume closer tokens (1 or 2) + // Handle partial closer consumption (e.g., `*foo**` where closer might be at offset 4 + // but token DOUBLE_STAR spans 4-6) + let current_offset = u32::from(p.cur_range().start()) as usize; + let closer_prefix_len = closer_offset.saturating_sub(current_offset); + + if closer_prefix_len > 0 { + // Closer starts AFTER token start - emit prefix as literal + if p.at(DOUBLE_STAR) || p.at(DOUBLE_UNDERSCORE) { + p.force_relex_emphasis_inline(); + } + for _ in 0..closer_prefix_len { + let text_m = p.start(); + p.bump_remap(MD_TEXTUAL_LITERAL); + text_m.complete(p, MD_TEXTUAL); + } + } + + // Now consume actual closer delimiters + // For strong emphasis (use_count=2), we can bump DOUBLE_* directly if at one. + let mut consumed_closer = 0; + if use_count == 2 && (p.at(DOUBLE_STAR) || p.at(DOUBLE_UNDERSCORE)) { + p.bump_any(); + consumed_closer = 2; } else { + for _ in 0..use_count { + if p.at(DOUBLE_STAR) || p.at(DOUBLE_UNDERSCORE) { + p.force_relex_emphasis_inline(); + } + if p.at(T![*]) || p.at(UNDERSCORE) || p.at(DOUBLE_STAR) || p.at(DOUBLE_UNDERSCORE) { + p.bump_any(); + consumed_closer += 1; + } else { + break; + } + } + } + + if consumed_closer < use_count { p.error(super::parse_error::unclosed_emphasis( p, opening_range, @@ -633,6 +952,31 @@ fn parse_inline_item_list_until_no_links(p: &mut MarkdownParser, stop: MarkdownS break; } + // IMPORTANT: Parse constructs that can contain `]` BEFORE checking for stop token. + // Per CommonMark, `]` inside code spans, autolinks, and HTML doesn't terminate links. + + // Code spans can contain `]` + if p.at(BACKTICK) { + if parse_inline_code(p).is_present() { + continue; + } + let _ = super::parse_textual(p); + continue; + } + + // Autolinks and inline HTML can contain `]` + if p.at(L_ANGLE) { + if parse_autolink(p).is_present() { + continue; + } + if parse_inline_html(p).is_present() { + continue; + } + let _ = super::parse_textual(p); + continue; + } + + // NOW check for stop token (after constructs that can contain it) if p.at(stop) { if bracket_depth == 0 { break; @@ -661,6 +1005,86 @@ fn parse_inline_item_list_until_no_links(p: &mut MarkdownParser, stop: MarkdownS has_nested_link } +/// Parse inline items until `stop` token, allowing full inline parsing including links. +/// Used for image alt text where nested links/images should be fully parsed +/// so their text content can be extracted for the alt attribute. +fn parse_inline_item_list_until(p: &mut MarkdownParser, stop: MarkdownSyntaxKind) { + let m = p.start(); + let prev_context = set_inline_emphasis_context_until(p, stop); + let mut bracket_depth = 0usize; + + loop { + if p.at(NEWLINE) { + if p.at_blank_line() { + break; + } + let _ = super::parse_textual(p); + continue; + } + + if p.at(T![EOF]) { + break; + } + + // Code spans can contain `]` + if p.at(BACKTICK) { + if parse_inline_code(p).is_present() { + continue; + } + let _ = super::parse_textual(p); + continue; + } + + // Autolinks and inline HTML can contain `]` + if p.at(L_ANGLE) { + if parse_autolink(p).is_present() { + continue; + } + if parse_inline_html(p).is_present() { + continue; + } + let _ = super::parse_textual(p); + continue; + } + + if p.at(stop) { + if bracket_depth == 0 { + break; + } + bracket_depth = bracket_depth.saturating_sub(1); + let _ = super::parse_textual(p); + continue; + } + + // For image alt: allow full inline parsing including links and images + if p.at(L_BRACK) { + let result = parse_link_or_image(p, LinkParseKind::Link); + if result.is_present() { + continue; + } + bracket_depth += 1; + let _ = super::parse_textual(p); + continue; + } + + if p.at(BANG) && p.nth_at(1, L_BRACK) { + let result = parse_link_or_image(p, LinkParseKind::Image); + if result.is_present() { + continue; + } + let _ = super::parse_textual(p); + continue; + } + + if parse_any_inline(p).is_absent() { + break; + } + } + + m.complete(p, MD_INLINE_ITEM_LIST); + p.set_emphasis_context(prev_context); +} + fn nested_link_starts_here(p: &mut MarkdownParser) -> bool { p.lookahead(|p| { if !p.at(L_BRACK) { @@ -720,7 +1144,10 @@ fn set_inline_emphasis_context_until( source }; let base_offset = u32::from(p.cur_range().start()) as usize; - let context = EmphasisContext::new(inline_source, base_offset); + // Create a reference checker closure that uses the parser's link reference definitions + let context = EmphasisContext::new(inline_source, base_offset, |label| { + p.has_link_reference_definition(label) + }); p.set_emphasis_context(Some(context)) } @@ -851,21 +1278,6 @@ impl LinkParseKind { } } - fn report_unclosed_text(self, p: &mut MarkdownParser, opening_range: TextRange) { - match self { - Self::Link => p.error(super::parse_error::unclosed_link( - p, - opening_range, - "expected `]` to close link text", - )), - Self::Image => p.error(super::parse_error::unclosed_image( - p, - opening_range, - "expected `]` to close alt text", - )), - } - } - fn report_unclosed_destination(self, p: &mut MarkdownParser, opening_range: TextRange) { match self { Self::Link => p.error(super::parse_error::unclosed_link( @@ -897,30 +1309,28 @@ fn parse_link_or_image(p: &mut MarkdownParser, kind: LinkParseKind) -> ParsedSyn kind.bump_opening(p); // Link text / alt text - let has_nested_link = parse_inline_item_list_until_no_links(p, R_BRACK); + let has_nested_link = if matches!(kind, LinkParseKind::Image) { + // For images, allow full inline parsing (including links) in alt text. + // This lets nested links/images be parsed so their text can be extracted for alt. + parse_inline_item_list_until(p, R_BRACK); + false + } else { + parse_inline_item_list_until_no_links(p, R_BRACK) + }; - // ] - if missing at inline end, emit diagnostic; otherwise rewind + // ] - if missing, rewind and treat [ as literal text. + // Per CommonMark, if there's no valid ] to close the link (e.g., all ] + // characters are inside code spans or HTML), the [ is literal text. + // NOTE: We intentionally do NOT emit an "unclosed link" diagnostic here. + // CommonMark treats unmatched `[` as literal text, not an error. if !p.eat(R_BRACK) { - if matches!(kind, LinkParseKind::Link) && has_nested_link { - m.abandon(p); - p.rewind(checkpoint); - return Absent; - } - if p.at_inline_end() { - // Unclosed link/image at end of inline content - emit diagnostic - // Expand range to include the text content, not just the opening bracket - let full_range = TextRange::new(opening_range.start(), p.cur_range().start()); - kind.report_unclosed_text(p, full_range); - // Return as reference link/image (shortcut) with missing closing bracket - return Present(m.complete(p, kind.reference_kind())); - } - // Not at inline end but missing ] - rewind and treat as text m.abandon(p); p.rewind(checkpoint); return Absent; } - let text_end_offset = p.cur_range().start(); + // Per CommonMark, a link (not image) whose text contains another link must fail. + // The inner link wins and the outer `[` becomes literal text. if matches!(kind, LinkParseKind::Link) && has_nested_link { m.abandon(p); p.rewind(checkpoint); @@ -1007,7 +1417,10 @@ fn parse_link_or_image(p: &mut MarkdownParser, kind: LinkParseKind) -> ParsedSyn { m.abandon(p); p.rewind(checkpoint); - return consume_textual_until_offset(p, text_end_offset); + // Return Absent - the caller will treat `[` as textual. + // Don't consume the whole bracket sequence to avoid consuming + // past emphasis closers. + return Absent; } Present(m.complete(p, kind.reference_kind())) @@ -1020,7 +1433,10 @@ fn parse_link_or_image(p: &mut MarkdownParser, kind: LinkParseKind) -> ParsedSyn { m.abandon(p); p.rewind(checkpoint); - return consume_textual_until_offset(p, text_end_offset); + // Return Absent - the caller will treat `[` as textual. + // Don't consume the whole bracket sequence to avoid consuming + // past emphasis closers. + return Absent; } Present(m.complete(p, kind.reference_kind())) } @@ -1064,7 +1480,7 @@ fn lookahead_reference_common( p.bump(L_BRACK); - let link_text = collect_bracket_text(p)?; + let link_text = collect_link_text(p)?; // Link text must be non-empty after normalization (e.g., `[\n ]` normalizes to empty) let normalized_link = normalize_reference_label(&link_text); @@ -1080,7 +1496,7 @@ fn lookahead_reference_common( if p.at(L_BRACK) { p.bump(L_BRACK); - let label_text = collect_bracket_text(p); + let label_text = collect_label_text_simple(p); if let Some(label_text) = label_text { let label = if label_text.is_empty() { link_text.clone() @@ -1107,13 +1523,31 @@ fn lookahead_reference_common( }) } -fn collect_bracket_text(p: &mut MarkdownParser) -> Option<String> { +/// Collect text for a link label (e.g., the `label` in `[text][label]`). +/// +/// Per CommonMark §4.7, link labels have specific rules: +/// - Unescaped square brackets are NOT allowed inside labels (see example 555) +/// - Backslash escapes ARE allowed (e.g., `\]` is a literal `]` in the label) +/// - No inline parsing (backticks, HTML, etc. are literal characters) +/// +/// We stop at the first R_BRACK token (unescaped `]`). Escaped brackets like `\]` +/// are lexed as MD_TEXTUAL_LITERAL, not R_BRACK, so they're included in the label. +fn collect_label_text_simple(p: &mut MarkdownParser) -> Option<String> { let mut text = String::new(); + loop { if p.at(T![EOF]) || p.at_inline_end() { return None; } + // Blank lines terminate + if p.at(NEWLINE) && p.at_blank_line() { + return None; + } + + // R_BRACK token = unescaped `]` closes the label. + // Note: Escaped brackets (`\]`) are lexed as MD_TEXTUAL_LITERAL, + // not R_BRACK, so they're correctly included in the label text. if p.at(R_BRACK) { return Some(text); } @@ -1123,18 +1557,94 @@ fn collect_bracket_text(p: &mut MarkdownParser) -> Option<String> { } } -fn consume_textual_until_offset(p: &mut MarkdownParser, end_offset: TextSize) -> ParsedSyntax { - let mut last = Absent; +/// Collect text for link text (e.g., the `text` in `[text](url)` or `[text][label]`). +/// Per CommonMark, link text CAN contain inline elements - code spans, autolinks, HTML. +/// `]` inside these constructs does NOT close the link text. +fn collect_link_text(p: &mut MarkdownParser) -> Option<String> { + let mut text = String::new(); + let mut bracket_depth = 0usize; - while !p.at(T![EOF]) { - let end = p.cur_range().end(); - last = super::parse_textual(p); - if end >= end_offset { - break; + loop { + if p.at(T![EOF]) || p.at_inline_end() { + return None; } - } - last + // Per CommonMark, blank lines terminate link text + if p.at(NEWLINE) && p.at_blank_line() { + return None; + } + + // Code spans can contain `]` - skip them entirely. + // Per CommonMark, `]` inside code spans doesn't terminate link text. + if p.at(BACKTICK) { + let opening_count = p.cur_text().len(); + text.push_str(p.cur_text()); + p.bump(p.cur()); + + // Find matching closing backticks + let mut found_close = false; + while !p.at(T![EOF]) && !p.at_inline_end() { + if p.at(NEWLINE) && p.at_blank_line() { + break; // Blank line terminates + } + if p.at(BACKTICK) && p.cur_text().len() == opening_count { + text.push_str(p.cur_text()); + p.bump(p.cur()); + found_close = true; + break; + } + text.push_str(p.cur_text()); + p.bump(p.cur()); + } + if !found_close { + // Unclosed code span - treat opening backticks as literal + // (already added to text, continue normally) + } + continue; + } + + // Autolinks and inline HTML can contain `]` - skip them entirely. + // Per CommonMark, `]` inside `<...>` constructs doesn't terminate link text. + if p.at(L_ANGLE) { + text.push_str(p.cur_text()); + p.bump(p.cur()); + + // Consume until `>` or newline + while !p.at(T![EOF]) && !p.at_inline_end() && !p.at(R_ANGLE) { + if p.at(NEWLINE) { + // Newlines end autolinks/HTML tags + break; + } + text.push_str(p.cur_text()); + p.bump(p.cur()); + } + if p.at(R_ANGLE) { + text.push_str(p.cur_text()); + p.bump(p.cur()); + } + continue; + } + + if p.at(L_BRACK) { + bracket_depth += 1; + text.push_str(p.cur_text()); + p.bump(p.cur()); + continue; + } + + if p.at(R_BRACK) { + if bracket_depth == 0 { + return Some(text); + } + bracket_depth -= 1; + text.push_str(p.cur_text()); + p.bump(p.cur()); + continue; + } + + text.push_str(p.cur_text()); + p.bump(p.cur()); + } } fn bump_textual_link_def(p: &mut MarkdownParser) { @@ -1522,19 +2032,30 @@ pub(crate) fn parse_inline_image(p: &mut MarkdownParser) -> ParsedSyntax { /// - Processing instructions: `<? ... ?>` /// - Declarations: `<! ... >` /// - CDATA: `<![CDATA[ ... ]]>` -fn is_inline_html(text: &str) -> Option<usize> { +pub(crate) fn is_inline_html(text: &str) -> Option<usize> { let bytes = text.as_bytes(); if bytes.len() < 2 || bytes[0] != b'<' { return None; } // HTML comment:  + // Per CommonMark 0.31.2 §6.8, an HTML comment consists of ``, + // where text does not start with `>` or `->`, and does not end with `-`. + // Additionally, `` are valid (degenerate) comments. if bytes.starts_with(b" + let rest = &bytes[4..]; + // Handle degenerate comments:  + if rest.starts_with(b">") { + return Some(5); //  + } + // Find closing --> after ") { let body = &text[4..4 + pos]; - // CommonMark: comment cannot start with '>' or '->', and must not contain "--" - if body.starts_with('>') || body.starts_with("->") || body.contains("--") { + // Body must not end with '-' + if body.ends_with('-') { return None; } return Some(4 + pos + 3); @@ -1658,8 +2179,17 @@ fn is_inline_html(text: &str) -> Option<usize> { let is_attr_name_continue = |b: u8| b.is_ascii_alphanumeric() || b == b'_' || b == b':' || b == b'.' || b == b'-'; + let mut need_space = true; + // We already know the boundary char was whitespace, so first iteration has space. + let mut had_space = true; + loop { - let had_space = skip_spaces(&mut i)?; + if need_space { + let s = skip_spaces(&mut i)?; + had_space = had_space || s; + } + need_space = true; + if i >= bytes.len() { return None; } @@ -1690,7 +2220,7 @@ fn is_inline_html(text: &str) -> Option<usize> { } // Optional whitespace and value - skip_spaces(&mut i)?; + had_space = skip_spaces(&mut i)?; if i < bytes.len() && bytes[i] == b'=' { i += 1; skip_spaces(&mut i)?; @@ -1740,7 +2270,11 @@ fn is_inline_html(text: &str) -> Option<usize> { } } } + // After value, need to find whitespace at top of loop + had_space = false; } + // If no '=' was found, `had_space` from skip_spaces above carries over + // as the separator for the next attribute (boolean attribute case). } } @@ -1785,6 +2319,12 @@ pub(crate) fn parse_inline_html(p: &mut MarkdownParser) -> ParsedSyntax { None => return Absent, }; + // Per CommonMark §4.3, setext heading underlines take priority over inline HTML. + // If this HTML tag spans across a line that is a setext underline, treat `<` as literal. + if crate::syntax::inline_span_crosses_setext(p, html_len) { + return Absent; + } + // Valid inline HTML - create the node // Use checkpoint so we can rewind if token boundaries don't align let checkpoint = p.checkpoint(); @@ -1957,18 +2497,29 @@ pub(crate) fn parse_autolink(p: &mut MarkdownParser) -> ParsedSyntax { // < p.bump(L_ANGLE); - // Content as inline item list containing textual nodes - let content = p.start(); - while !p.at(R_ANGLE) && !p.at_inline_end() { + // Content as inline item list containing textual nodes. + // Autolinks don't process backslash escapes, but the lexer may combine + // `\>` into a single escape token. We re-lex in CodeSpan context where + // backslash is literal, so `\` and `>` are separate tokens. + p.force_relex_code_span(); + + let content_m = p.start(); + while !p.at(R_ANGLE) && !p.at(T![EOF]) && !p.at_inline_end() { let text_m = p.start(); - p.bump_remap(MD_TEXTUAL_LITERAL); + p.bump_remap_with_context( + MD_TEXTUAL_LITERAL, + crate::lexer::MarkdownLexContext::CodeSpan, + ); text_m.complete(p, MD_TEXTUAL); } - content.complete(p, MD_INLINE_ITEM_LIST); + content_m.complete(p, MD_INLINE_ITEM_LIST); // > p.expect(R_ANGLE); + // Re-lex back to regular context + p.force_relex_regular(); + Present(m.complete(p, MD_AUTOLINK)) } @@ -1977,15 +2528,29 @@ pub(crate) fn parse_any_inline(p: &mut MarkdownParser) -> ParsedSyntax { if p.at(MD_HARD_LINE_LITERAL) { parse_hard_line(p) } else if p.at(BACKTICK) { - parse_inline_code(p) - } else if p.at(DOUBLE_STAR) || p.at(DOUBLE_UNDERSCORE) { - // Try emphasis, fall back to literal text if flanking rules fail - let result = parse_inline_emphasis(p); + // Try code span, fall back to literal text if no matching closer exists + let result = parse_inline_code(p); if result.is_absent() { super::parse_textual(p) } else { result } + } else if p.at(DOUBLE_STAR) || p.at(DOUBLE_UNDERSCORE) { + // For cases like `***foo***`, the em match starts at the exact token boundary + // (prefix_len=0) while the strong match starts at offset 1 (prefix_len=1). + // Try italic first to handle nested emphasis correctly, then try strong. + let result = parse_inline_italic(p); + if result.is_present() { + return result; + } + let result = parse_inline_emphasis(p); + if result.is_present() { + return result; + } + // Neither matched - re-lex to single token and emit just one char as literal. + // This handles cases like `**foo*` where opener is at offset 1. + p.force_relex_emphasis_inline(); + super::parse_textual(p) } else if p.at(T![*]) || p.at(UNDERSCORE) { // Try italic, fall back to literal text if flanking rules fail let result = parse_inline_italic(p); diff --git a/crates/biome_markdown_parser/src/syntax/parse_error.rs b/crates/biome_markdown_parser/src/syntax/parse_error.rs index 8f864cc0ac..97ce5288ac 100644 --- a/crates/biome_markdown_parser/src/syntax/parse_error.rs +++ b/crates/biome_markdown_parser/src/syntax/parse_error.rs @@ -29,26 +29,6 @@ pub(crate) fn unclosed_emphasis( )) } -/// Unclosed inline code span. -/// -/// ```markdown -/// `code -/// ^ expected closing ` -/// ``` -pub(crate) fn unclosed_code_span( - p: &MarkdownParser, - opening_range: TextRange, - backtick_count: usize, -) -> ParseDiagnostic { - let backticks = "`".repeat(backtick_count); - p.err_builder( - format!("Unclosed code span, expected closing {backticks}."), - opening_range, - ) - .with_detail(opening_range, "code span started here") - .with_hint(format!("Add closing {backticks} to close the code span.")) -} - /// Unclosed inline link. /// /// ```markdown diff --git a/crates/biome_markdown_parser/src/to_html.rs b/crates/biome_markdown_parser/src/to_html.rs index 0512db55fe..90d75ad412 100644 --- a/crates/biome_markdown_parser/src/to_html.rs +++ b/crates/biome_markdown_parser/src/to_html.rs @@ -1047,9 +1047,7 @@ fn render_inline_link(link: &MdInlineLink, ctx: &HtmlRenderContext, out: &mut St /// Render an inline image. fn render_inline_image(img: &MdInlineImage, ctx: &HtmlRenderContext, out: &mut String) { - let alt = render_inline_list(&img.alt(), ctx); - // Strip HTML tags from alt text - let alt = strip_html_tags(&alt); + let alt = extract_alt_text(&img.alt(), ctx); let dest = collect_inline_text(&img.destination()); let dest = process_link_destination(&dest); @@ -1109,8 +1107,7 @@ fn render_reference_link(link: &MdReferenceLink, ctx: &HtmlRenderContext, out: & /// Render a reference image. fn render_reference_image(img: &MdReferenceImage, ctx: &HtmlRenderContext, out: &mut String) { - let alt = render_inline_list(&img.alt(), ctx); - let alt = strip_html_tags(&alt); + let alt = extract_alt_text(&img.alt(), ctx); let alt_raw = collect_inline_text(&img.alt()); render_reference_common( @@ -1198,10 +1195,12 @@ fn render_autolink(autolink: &MdAutolink, out: &mut String) { // Check if it's an email autolink let is_email = content.contains('@') && !content.contains(':'); + // Autolinks must NOT process backslash escapes or entity decoding. + // Only percent-encode for URL safety. let href = if is_email { format!("mailto:{}", content) } else { - process_link_destination(&content) + percent_encode_uri(&content) }; out.push_str("<a href=\""); @@ -1460,22 +1459,74 @@ fn escape_html_attribute(text: &str) -> String { escape_html(text) } -/// Strip HTML tags from text (for image alt text). -fn strip_html_tags(text: &str) -> String { +/// Extract plain text for image alt attribute. +/// Per CommonMark, the alt text is the content with inline formatting stripped +/// but text from nested links/images preserved (recursively extracting their text). +fn extract_alt_text( + list: &biome_markdown_syntax::MdInlineItemList, + ctx: &HtmlRenderContext, +) -> String { let mut result = String::new(); - let mut in_tag = false; + for item in list.iter() { + extract_alt_text_inline(&item, ctx, &mut result); + } + result +} - for c in text.chars() { - if c == '<' { - in_tag = true; - } else if c == '>' { - in_tag = false; - } else if !in_tag { - result.push(c); +fn extract_alt_text_inline(inline: &AnyMdInline, ctx: &HtmlRenderContext, out: &mut String) { + match inline { + AnyMdInline::MdTextual(text) => { + render_textual(text, out); + } + AnyMdInline::MdInlineEmphasis(em) => { + out.push_str(&extract_alt_text(&em.content(), ctx)); + } + AnyMdInline::MdInlineItalic(italic) => { + out.push_str(&extract_alt_text(&italic.content(), ctx)); + } + AnyMdInline::MdInlineCode(code) => { + // Plain text only — no <code> tags for alt attribute + let content = collect_raw_inline_text(&code.content()); + let content = content.replace('\n', " "); + let content = if content.starts_with(' ') + && content.ends_with(' ') + && content.len() > 2 + && content.chars().any(|c| c != ' ') + { + content[1..content.len() - 1].to_string() + } else { + content + }; + out.push_str(&escape_html(&content)); + } + AnyMdInline::MdInlineLink(link) => { + // Extract text content from link text + out.push_str(&extract_alt_text(&link.text(), ctx)); + } + AnyMdInline::MdInlineImage(img) => { + // Recursively extract alt text from nested image + out.push_str(&extract_alt_text(&img.alt(), ctx)); + } + AnyMdInline::MdReferenceLink(link) => { + out.push_str(&extract_alt_text(&link.text(), ctx)); + } + AnyMdInline::MdReferenceImage(img) => { + out.push_str(&extract_alt_text(&img.alt(), ctx)); + } + AnyMdInline::MdAutolink(autolink) => { + let content = collect_raw_inline_text(&autolink.value()); + out.push_str(&escape_html(&content)); + } + AnyMdInline::MdHardLine(_) | AnyMdInline::MdSoftBreak(_) => { + out.push(' '); + } + AnyMdInline::MdEntityReference(entity) => { + render_entity_reference(entity, out); + } + AnyMdInline::MdInlineHtml(_) | AnyMdInline::MdHtmlBlock(_) => { + // HTML tags are stripped in alt text } } - - result } // ============================================================================ @@ -1590,6 +1641,80 @@ mod tests { assert_eq!(html, "italic and bold\n"); } + #[test] + fn test_emphasis_complex_cases() { + // Test: Nested + let parsed = parse_markdown("**bold *and italic* text**\n"); + assert_eq!( + parsed.syntax().kind(), + biome_markdown_syntax::MarkdownSyntaxKind::MD_DOCUMENT, + "Nested failed: {}", + parsed.syntax() + ); + + // Test: Rule of 3 + let parsed = parse_markdown("***bold italic***\n"); + assert_eq!( + parsed.syntax().kind(), + biome_markdown_syntax::MarkdownSyntaxKind::MD_DOCUMENT, + "Rule of 3 failed: {}", + parsed.syntax() + ); + + // Test: Multiple runs + let parsed = parse_markdown("*a **b** c*\n"); + assert_eq!( + pars…

Overhauls the handling of block-level elements, with a major focus on list "tightness" and HTML block detection. Changes include: - Align HTML block detection conditions with CommonMark specs (tightening start/end conditions). - Fix logic for "tight" vs "loose" lists, ensuring correct paragraph wrapping in HTML output. - Correct indentation handling for nested lists and empty list items. - Improve detection of blockquotes when they appear on the first line of a list item. - Fix Setext heading edge cases (examples 086–091). - Prevent hard line breaks from appearing incorrectly at the end of blocks.

Final set of polish fixes to achieve 100% CommonMark conformance. Changes include: - Update render_textual to use trimmed text where appropriate to avoid excess whitespace. - Fix specific HTML rendering edge cases (examples 174, 616, 619-626) related to entities and raw HTML. - Resolve remaining edge cases for examples 042, 066, 073, 093, and 223. - Ensure all remaining CommonMark tests pass.

Introduce a utf8proc-backed case fold helper and use it for Markdown link label normalization to align with CommonMark Unicode folding.\n\nTests: just f\nLint: just l\nConformance: cargo run -p xtask_coverage -- --suites=markdown/commonmark --detailed (100%)

Move inline parsing helpers into inline/ submodules (code spans, emphasis, entities, html, links) and rewire the inline dispatch and link parsing. Keep strong emphasis closer tokens ("**", "__") intact when closing matches to avoid MD_BOGUS nodes in emphasis fixtures. Tests: cargo test -p biome_markdown_parser --test spec_tests ok::ok::emphasis_complex_md -- --nocapture

Refactors newline/quote handling into helpers to clarify control flow. Consolidates stop-token inline parsing and shared lookahead logic.

changeset-bot · 2026-01-29T12:43:10Z

⚠️ No Changeset found

Latest commit: e63d4f8

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

coderabbitai · 2026-01-30T00:55:56Z

Walkthrough

The PR restructures the Markdown parser: it splits and reintroduces inline parsing into modular submodules (code spans, emphasis, links, HTML, entities), adds lexer contexts CodeSpan and EmphasisInline, and implements a CommonMark-style delimiter/emphasis engine. HTML block detection/termination logic was rewritten with explicit kinds. List, header and quote parsing gained virtual-line/indent handling and emphasis scoping. The HTML renderer was replaced by a visitor-buffered HtmlRenderer. Utilities added: reference-label normalization, Unicode punctuation table, and case-folding support; small Cargo deps updated.

Possibly related PRs

biomejs/biome PR 8525: Large, overlapping refactor touching the same lexer, parser and inline parsing modules (lexer/mod.rs, parser.rs, syntax/inline.rs, fenced_code_block.rs, header.rs, list.rs, quote.rs).
biomejs/biome PR 8857: Related work moving CommonMark conformance tests and test-suite handling (coverage/xtask integration).

Suggested reviewers

dyc3
ematipico

🚥 Pre-merge checks | ✅ 2

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarises the main objective: improving CommonMark conformance and restructuring the parser, which aligns with the substantial changes across the codebase.
Description check	✅ Passed	The description comprehensively covers the conformance improvements, structural refactoring, dependency additions, and test plan, clearly relating to the extensive changeset across parser modules and dependencies.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

crates/biome_markdown_parser/src/syntax/list.rs (2)

307-440: Indent-aware blank-line lookahead is still missing for nested lists.
skip_blank_lines_between_items uses a non-indent-aware predicate, so nested lists with blank lines can terminate early when the next marker is indented >3 spaces. Consider threading the marker indent into the predicate (and using the new indent-aware helper).
🔧 Suggested fix
-fn parse_list_element_common<M, FMarker, FParse>(
+fn parse_list_element_common<M, FMarker, FParse, FHas>(
 p: &mut MarkdownParser,
 marker_state: &mut Option<M>,
 current_marker: FMarker,
 parse_item: FParse,
- has_item_after_blank_lines: fn(&mut MarkdownParser) -> bool,
+ has_item_after_blank_lines: FHas,
 is_tight: &mut bool,
 last_item_ends_with_blank: &mut bool,
 ) -> ParsedSyntax
 where
 FMarker: Fn(&mut MarkdownParser) -> Option<M>,
 FParse: Fn(&mut MarkdownParser) -> (ParsedSyntax, ListItemBlankInfo),
+ FHas: Fn(&mut MarkdownParser) -> bool,
 {
- parse_list_element_common(
+ parse_list_element_common(
 p,
 &mut self.marker_kind,
 current_bullet_marker,
 parse_bullet,
- has_bullet_item_after_blank_lines,
+ |p| has_bullet_item_after_blank_lines_at_indent(p, self.marker_indent),
 &mut self.is_tight,
 &mut self.last_item_ends_with_blank,
 )
1478-1683: Guard first-line block parsing when the item should be an indented code block.
When spaces_after_marker >= 4, CommonMark treats the line as an indented code block; the current first-line checks can still parse ATX headings/quotes/thematic breaks. Skipping those checks in that case prevents misclassification (e.g., - # heading should be code, not a heading).
🛠️ Suggested guard
- if first_line {
+ if first_line && spaces_after_marker <= INDENT_CODE_BLOCK_SPACES {
 enum NestedListMarker {
 Bullet,
 Ordered,
 }
 ...
 }

crates/biome_markdown_parser/src/syntax/fenced_code_block.rs (1)

266-339: Closing fences inside block quotes won’t be recognised.
Quote prefixes are stripped before at_closing_fence, but line_has_closing_fence insists the prefix is whitespace‑only. That means > markers make the check fail, so quoted fences never close and will emit unterminated diagnostics. Please allow block‑quote markers in the prefix or skip them in line_has_closing_fence.

💡 One way to adjust `line_has_closing_fence` (below this block)

 fn line_has_closing_fence(p: &MarkdownParser, is_tilde_fence: bool, fence_len: usize) -> bool {
     let Some((start, source)) = get_source_context(p) else {
         return false;
     };

     let line_start = find_line_start(&source[..start]);

-    if !is_whitespace_prefix(source, start, line_start) {
+    let quote_depth = p.state().block_quote_depth;
+    if quote_depth == 0 && !is_whitespace_prefix(source, start, line_start) {
         return false;
     }

     let list_indent = p.state().list_item_required_indent;

     // Skip required list indent (must have enough whitespace)
     let Some(idx) = consume_indent(source.as_bytes(), line_start, list_indent, true) else {
         return false;
     };

     // Skip optional extra indent (up to 3 spaces per CommonMark)
     // This always succeeds since required=false
-    let idx = consume_indent(source.as_bytes(), idx, 3, false).unwrap();
+    let mut idx = consume_indent(source.as_bytes(), idx, 3, false).unwrap();

+    // Skip block-quote markers
+    let bytes = source.as_bytes();
+    for _ in 0..quote_depth {
+        if bytes.get(idx) != Some(&b'>') {
+            return false;
+        }
+        idx += 1;
+        if matches!(bytes.get(idx), Some(b' ' | b'\t')) {
+            idx += 1;
+        }
+    }

     let fence_char = if is_tilde_fence { b'~' } else { b'`' };

🤖 Fix all issues with AI agents

In `@Cargo.toml`:
- Line 164: The utf8proc dependency breaks wasm32-unknown-unknown because it
wraps C; make it optional and feature-gated or replace it with a pure-Rust
crate: in Cargo.toml mark utf8proc = { version = "0.1.2", optional = true } and
add a feature like "with-utf8proc", or move utf8proc under a non-wasm target
section (e.g. [target.'cfg(not(target_arch = "wasm32"))'.dependencies]); then
update any code that directly calls into utf8proc to be behind cfg(feature =
"with-utf8proc")/cfg(not(target_arch = "wasm32")) guards and provide an
implementation using a pure-Rust alternative (e.g. unicode-normalization or
unicode-width) when the feature is disabled or on wasm. Ensure Cargo features
and conditional compilation are consistent so wasm builds never require the
utf8proc C library.

In `@crates/biome_markdown_parser/src/syntax/thematic_break_block.rs`:
- Around line 41-109: is_thematic_break_pattern currently only accepts
MD_TEXTUAL_LITERAL when its text length is exactly 1, which rejects tokens like
"***" or "* * *"; update is_thematic_break_pattern to treat multi-char
MD_TEXTUAL_LITERAL tokens as sequences of break characters and optional spaces:
when the initial token is MD_TEXTUAL_LITERAL with length>1, verify its
characters are only the candidate break char or whitespace and increment the
count by the number of break characters it contains (treating internal spaces as
skipped), and in the main loop when encountering MD_TEXTUAL_LITERAL with
length>1, scan its chars to accumulate more break characters (and ensure no
invalid chars appear) before calling p.bump_any(); keep existing checks for
T![*], T![-], UNDERSCORE and ensure final condition still requires count >= 3
and line end (NEWLINE or EOF).

🧹 Nitpick comments (14)

crates/biome_markdown_parser/tests/md_test_suite/ok/paren_depth_limit.md (1)

1-1: Nice stress test for parenthesis nesting.

This effectively exercises the depth limit (32 levels) mentioned in the PR objectives. Consider adding a brief comment at the top explaining the test's intent, and perhaps a companion test in the err/ folder that exceeds the limit to validate the boundary behaviour.
crates/biome_string_case/Cargo.toml (1)
16-16: Consider feature-gating utf8proc.

Since utf8proc adds a native C library dependency, you might want to make it optional behind a feature flag (e.g., casefold) so consumers who don't need Unicode case folding can avoid the dependency.
♻️ Suggested feature-gating
 [dependencies]
 biome_rowan = { workspace = true, optional = true }
-utf8proc = { workspace = true }
+utf8proc = { workspace = true, optional = true }

 [features]
 biome_rowan = ["dep:biome_rowan"]
+casefold = ["dep:utf8proc"]
crates/biome_string_case/src/lib.rs (1)
668-669: Add a doctest for to_casefold_cow.

The trait method would benefit from inline documentation with a doctest example, consistent with other methods in this trait.
📝 Suggested documentation
+ /// Returns Unicode case-folded text as a Cow, allocating only when needed.
+ ///
+ /// ```
+ /// use biome_string_case::StrOnlyExtension;
+ ///
+ /// assert_eq!("ss", "ß".to_casefold_cow());
+ /// assert_eq!("test", "test".to_casefold_cow());
+ /// ```
 fn to_casefold_cow(&self) -> Cow<'_, Self>;
crates/biome_unicode_table/src/punctuation.rs (1)
24-374: Add a reference comment for table verification.

Per the PR discussion, this table would benefit from a comment indicating how to regenerate or verify it against the source (e.g., Unicode version, markdown-rs commit, or a script).
📝 Suggested comment
+// Generated from Unicode 15.1 General_Category P* and S* ranges.
+// Source: markdown-rs punctuation table (commit: <hash>)
+// To regenerate, see: <link or instructions>
 const PUNCTUATION_RANGES: &[(char, char)] = &[
crates/biome_markdown_parser/src/syntax/reference.rs (1)
23-38: Avoid an extra allocation when case‑folding.
to_casefold_cow().into_owned() clones even when the case‑fold is unchanged. You can keep the existing out when it’s already folded.
♻️ Suggested tweak
- out.as_str().to_casefold_cow().into_owned()
+ match out.as_str().to_casefold_cow() {
+ std::borrow::Cow::Borrowed(_) => out,
+ std::borrow::Cow::Owned(folded) => folded,
+ }
crates/biome_markdown_parser/src/syntax/html_block.rs (4)

40-49: Consider adding brief doc comments to enum variants.

Whilst the type names (Type1–Type7) map to CommonMark §4.6 directly, a one-liner per variant would help readers unfamiliar with the spec. Optional, but helpful.

151-153: Minor: split_once returns the separator-stripped halves.

This is correct for your use case. Just noting that CRLF (\r\n) lines will leave a trailing \r on line because \n is matched first. If you want to strip that too, consider adding .trim_end_matches('\r') or splitting on ['\n', '\r'] more carefully. Likely a non-issue for the block-kind check, but worth a thought.

325-349: line is rebuilt per-token, allocating repeatedly.

Per reviewer feedback in the PR summary, storing owned Strings in parser state or loops is a concern. Here, the loop accumulates tokens into line and clears it at each newline. For large HTML blocks, this can be allocation-heavy.

A lighter alternative is to track byte ranges and slice the source directly when checking for the terminator. Not blocking, but worth considering for a follow-up optimisation.

351-373: Manual case-insensitive search is correct but could be simplified.

Rust's str::to_ascii_lowercase() or using the unicase crate would make this more readable. The current byte-level iteration is fine for ASCII terminators, but if you ever need Unicode awareness, this will need revisiting.

crates/biome_markdown_parser/src/syntax/inline/code_span.rs (1)

93-161: Potential code duplication with list detection in syntax/mod.rs.

at_list_marker_after_newline replicates logic from at_bullet_list_item, at_order_list_item, and textual_looks_like_list_marker. Consider extracting a shared helper to avoid drift between these implementations.
crates/biome_markdown_parser/src/syntax/inline/html.rs (1)
147-159: skip_spaces closure always returns Some(bool) — the Option wrapper is unnecessary.

The closure never returns None; it always yields Some(skipped). You could simplify the signature to just return bool and remove the ? operator at call sites.
Suggested simplification
- let skip_spaces = |i: &mut usize| -> Option<bool> {
+ let skip_spaces = |i: &mut usize| -> bool {
 let mut skipped = false;
 while *i < bytes.len() {
 match bytes[*i] {
 b' ' | b'\t' | b'\n' | b'\r' | b'\x0c' => {
 skipped = true;
 *i += 1;
 }
 _ => break,
 }
 }
- Some(skipped)
+ skipped
 };
crates/biome_markdown_parser/src/syntax/header.rs (1)

191-263: header_content_source_len lookahead is thorough but dense.

The function handles trailing hashes, whitespace, and hard-line literals. Consider adding a brief inline comment at the start of each major branch to aid readability.

crates/biome_markdown_parser/src/syntax/mod.rs (1)

1123-1182: is_quote_only_blank_line_from_source scans raw source for quote markers.

This is necessary because token-based checks can't see consumed trivia. The byte-level iteration is correct, though somewhat verbose. Consider a brief comment explaining why source scanning is required here.
crates/biome_markdown_parser/src/to_html.rs (1)
1520-1562: percent_encode_uri preserves existing percent-encoded sequences.

The logic at lines 1525–1537 detects %XX patterns and passes them through unmodified. This prevents double-encoding. However, the check bytes[i + 1].is_ascii_alphanumeric() is slightly loose — valid hex digits are [0-9A-Fa-f], not all alphanumerics. You could use bytes[i + 1].is_ascii_hexdigit() for stricter validation.
Stricter hex digit check
- && bytes[i + 1].is_ascii_alphanumeric()
- && bytes[i + 2].is_ascii_alphanumeric()
+ && bytes[i + 1].is_ascii_hexdigit()
+ && bytes[i + 2].is_ascii_hexdigit()

Cargo.toml

crates/biome_markdown_parser/src/syntax/thematic_break_block.rs

ematipico · 2026-01-30T08:55:57Z

to_html.rs: This exists specifically as a test harness for CommonMark spec compliance. Commonmark provides test cases as pairs (markdown, expected_html), so that's how we validate correctness. Would you rather we move this to tests/ or gate it behind a feature flag?

That's insightful. I didn't know that and thank you for letting me know. I mean, eventually we will need to away to transform markdown to HTML, so maybe for now we can gate it behind a feature flag of the crate biome_markdown_parser, so that in the future when we will implement proper transformation, we can still validate it against the conformance.

An alternative could be to store TextRange and normalize on lookup. Do you have a preference here?

I think yes, we can normalise on lookup. We will need to set up some benchmarks, so we can see if these optimisations have an actual impact. Unfortunatel,y our current benchmarks don't test memory used.

We could feature-gate it for non-WASM, write a pure Rust crate, or make a simplified inline table but that may break on edge cases. Do you have a preference?

Coderabbit here #8908 (comment) suggests the use of crates that are WASM friendly. We already use unicode-width internally, however it's possible we will need the other crate.

Switch the test HTML renderer to a preorder visitor and consolidate list/quote handling in a stateful renderer.

Extract list-marker follow checks and document marker rules.

Avoid an extra allocation in reference label normalization by reusing the already-built string when casefolding is a no-op.

Move reference definition collection into a dedicated module and use trimmed label text before normalization.

Match lookahead validation with parser behavior by accepting single-token break patterns and reusing the minimum-break constant.

Use utf8proc only on non-wasm builds and fall back to Unicode lowercasing on wasm to keep builds working.

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

crates/biome_markdown_parser/src/syntax/link_block.rs (1)

238-336: ⚠️ Potential issue | 🟠 Major

Bare destination lookahead currently allows whitespace inside raw URLs.
When a whitespace token appears after content in the bare-destination path, the scan continues and can swallow trailing text. That makes inputs like [label]: foo bar look valid in lookahead, while parse_link_destination stops at the first space — the two paths diverge.

A safer approach is to stop the destination scan after the first separator whitespace and let the outer logic decide whether a title starts or trailing text invalidates the definition.

🛠️ Suggested fix

-            if text.chars().all(|c| c == ' ' || c == '\t') && !text.is_empty() {
-                if has_content {
-                    saw_separator = true;
-                }
-                p.bump_link_definition();
-                continue;
-            }
+            if text.chars().all(|c| c == ' ' || c == '\t') && !text.is_empty() {
+                if has_content {
+                    saw_separator = true;
+                    // Consume separator whitespace, then stop scanning destination
+                    p.bump_link_definition();
+                    while !p.at(EOF)
+                        && !p.at(NEWLINE)
+                        && p.cur_text().chars().all(|c| c == ' ' || c == '\t')
+                        && !p.cur_text().is_empty()
+                    {
+                        p.bump_link_definition();
+                    }
+                    break;
+                }
+                p.bump_link_definition();
+                continue;
+            }

🤖 Fix all issues with AI agents

In `@crates/biome_markdown_parser/src/syntax/html_block.rs`:
- Around line 249-265: at_html_block_interrupt currently lacks the ≤3-space
indent guard, so indented code can be misdetected as an HTML block; update
at_html_block_interrupt to early-return false when the current line is indented
more than 3 spaces (mirror the guard used in at_html_block) before calling
html_block_kind, i.e. add the same indentation check used in at_html_block
(using the parser's indentation/column API) so html_block_kind is only consulted
for lines with ≤3 leading spaces.

In `@crates/biome_markdown_parser/src/syntax/mod.rs`:
- Around line 763-784: The function real_line_indent_from_source currently finds
the line start by rfind('\n') only, which fails for CR-only or CRLF line
endings; update the logic that computes line_start to look for the last line
break char (either '\n' or '\r')—e.g., rfind on a predicate c == '\n' || c ==
'\r' (or compute the last index of either and take the greater), and ensure CRLF
is treated as a single break when computing the start (if you find '\n' and the
preceding byte is '\r', use the '\r' index + 1 consistently); keep the remaining
column counting over source[line_start..].chars() unchanged and reference
p.source().source_text() and p.cur_range().start() to locate the position.

In `@crates/biome_markdown_parser/src/syntax/thematic_break_block.rs`:
- Around line 59-83: The fast-path that handles a single MD_TEXTUAL_LITERAL
currently returns true whenever the token text contains only break
chars/whitespace, but it must also verify the token ends the line; update the
logic in the thematic break detector (the block around the
p.at(MD_TEXTUAL_LITERAL) check) to perform a lookahead: consume the current
MD_TEXTUAL_LITERAL token, skip any following SPACE/TAB tokens, and assert the
next token is NEWLINE or EOF before returning true, mirroring the same
end-of-line guard used in the remap fast path (also update the analogous
fast-path at the later block around lines ~168-176).

In `@crates/biome_markdown_parser/src/to_html.rs`:
- Around line 1531-1546: The percent-encoding validator erroneously accepts
non-hex chars because it uses bytes[i + 1].is_ascii_alphanumeric() and bytes[i +
2].is_ascii_alphanumeric(); update both checks to bytes[i +
1].is_ascii_hexdigit() and bytes[i + 2].is_ascii_hexdigit() in the loop that
iterates value.char_indices(), keeping the surrounding logic that pushes the
preserved 3-byte slice (result.push_str(&value[i..i + 3])) and updates last = i
+ 3; (refer to variables value, bytes, i, last, and result) so only valid %HH
sequences are preserved.

🧹 Nitpick comments (6)

crates/biome_markdown_parser/src/to_html.rs (4)
237-240: Consider adding a clarifying comment for this edge case.

The condition quote_indent > 1 && idx > 0 && remaining.len() > 1 followed by stripping a single character is non-obvious. A brief comment explaining when this path is taken (presumably for lazy continuation lines in nested quotes?) would aid future maintainers.

500-509: Suppressed node check is O(n) per enter call.

The linear scan through suppressed_inline_nodes on every enter() could become expensive for documents with many inline links. For a test harness this is acceptable, but production use would benefit from a HashSet<SyntaxNode> for O(1) lookups.

1866-1871: Test doesn't catch the hex digit validation bug.

The test uses %20 which contains valid hex digits, so it passes despite the is_ascii_alphanumeric() bug. Consider adding a test case with an invalid percent sequence like %GH to verify correct encoding.
🧪 Suggested additional test case
 #[test]
 fn test_percent_encode_uri() {
 let input = format!("https://a{}b.c/%20/%", '\u{1F44D}');
 let encoded = percent_encode_uri(&input);
 assert_eq!(encoded, "https://a%F0%9F%91%8Db.c/%20/%25");
+
+ // Invalid percent sequence should have % encoded
+ let invalid = "https://example.com/%GH";
+ let encoded = percent_encode_uri(invalid);
+ assert_eq!(encoded, "https://example.com/%25GH");
 }
1314-1318: Minor: Redundant clone.

text.clone() at line 1317 could be avoided since text is already owned. You could restructure to return (text.clone(), Some(text)) or use (text, None) patterns, but this is a minor nit.
crates/biome_markdown_parser/src/syntax/inline/emphasis.rs (1)

238-277: Potential quadratic scan in the link‑bracket prepass.
The per‑position check_bracket_pattern plus linear contains lookups can go O(n²) on long inputs. Consider a single‑pass bracket matcher with O(1) start lookups to keep this linear.

crates/biome_markdown_parser/src/syntax/inline/links.rs (1)

494-847: Consider de‑duplicating link helper routines.
bump_textual_link_def, is_whitespace_token, get_title_close_char, and parse_title_content mirror link_block.rs. A shared helper module would reduce drift long‑term.

crates/biome_markdown_parser/src/syntax/html_block.rs

coderabbitai · 2026-01-30T23:52:33Z

crates/biome_markdown_parser/src/syntax/mod.rs

+/// Compute the real leading indent of the current line from source text.
+/// This is needed because leading whitespace may have been consumed as trivia
+/// in list item context, making `line_start_leading_indent()` return 0.
+/// Token-based lookahead cannot recover the original column once trivia is skipped.
+fn real_line_indent_from_source(p: &MarkdownParser) -> usize {
+    let source = p.source().source_text();
+    let pos: usize = p.cur_range().start().into();
+
+    // Find the start of the current line
+    let line_start = source[..pos].rfind('\n').map_or(0, |i| i + 1);
+
+    // Count leading whitespace columns on this line
+    let mut column = 0;
+    for c in source[line_start..].chars() {
+        match c {
+            ' ' => column += 1,
+            '\t' => column += 4 - (column % 4),
+            _ => break,
+        }
+    }
+    column
+}


⚠️ Potential issue | 🟡 Minor

Handle CR-only line endings when computing real indent.
real_line_indent_from_source only searches for \n, so files using \r as line breaks will treat the whole file as one line and miscompute indentation.

🔧 Suggested fix

- let line_start = source[..pos].rfind('\n').map_or(0, |i| i + 1); + let line_start = source[..pos] + .rfind(|c| c == '\n' || c == '\r') + .map_or(0, |i| i + 1);

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

/// Compute the real leading indent of the current line from source text.

/// This is needed because leading whitespace may have been consumed as trivia

/// in list item context, making `line_start_leading_indent()` return 0.

/// Token-based lookahead cannot recover the original column once trivia is skipped.

fn real_line_indent_from_source(p: &MarkdownParser) -> usize {

let source = p.source().source_text();

let pos: usize = p.cur_range().start().into();

// Find the start of the current line

let line_start = source[..pos].rfind('\n').map_or(0, |i| i + 1);

// Count leading whitespace columns on this line

let mut column = 0;

for c in source[line_start..].chars() {

match c {

' ' => column += 1,

'\t' => column += 4 - (column % 4),

_ => break,

}

}

column

}

/// Compute the real leading indent of the current line from source text.

/// This is needed because leading whitespace may have been consumed as trivia

/// in list item context, making `line_start_leading_indent()` return 0.

/// Token-based lookahead cannot recover the original column once trivia is skipped.

fn real_line_indent_from_source(p: &MarkdownParser) -> usize {

let source = p.source().source_text();

let pos: usize = p.cur_range().start().into();

// Find the start of the current line

let line_start = source[..pos]

.rfind(|c| c == '\n' || c == '\r')

.map_or(0, |i| i + 1);

// Count leading whitespace columns on this line

let mut column = 0;

for c in source[line_start..].chars() {

match c {

' ' => column += 1,

'\t' => column += 4 - (column % 4),

_ => break,

}

}

column

}

🤖 Prompt for AI Agents

In `@crates/biome_markdown_parser/src/syntax/mod.rs` around lines 763 - 784, The function real_line_indent_from_source currently finds the line start by rfind('\n') only, which fails for CR-only or CRLF line endings; update the logic that computes line_start to look for the last line break char (either '\n' or '\r')—e.g., rfind on a predicate c == '\n' || c == '\r' (or compute the last index of either and take the greater), and ensure CRLF is treated as a single break when computing the start (if you find '\n' and the preceding byte is '\r', use the '\r' index + 1 consistently); keep the remaining column counting over source[line_start..].chars() unchanged and reference p.source().source_text() and p.cur_range().start() to locate the position.

crates/biome_markdown_parser/src/syntax/thematic_break_block.rs

crates/biome_markdown_parser/src/to_html.rs

Cover whitespace normalization and escape preservation for reference labels.

…e folding Replace the C library utf8proc with pure-Rust caseless crate for Unicode case folding. This ensures consistent behavior across all platforms including WebAssembly, which is a first-class target for Biome. - Use caseless::default_case_fold_str for full Unicode case folding - Optimize to avoid allocation when string is unchanged - Remove platform-specific cfg gates for WASM - Update reference.rs comment and fix clippy lint

Skip HTML block interrupts when line indent exceeds 3 spaces.

Ensure single-token thematic breaks only match when the line ends, and guard the remap fast path similarly.

Only preserve %HH sequences when both digits are hex.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@crates/biome_markdown_parser/src/syntax/html_block.rs`:
- Around line 260-268: The pattern in the matches! call incorrectly uses struct
syntax for the tuple variant HtmlBlockKind::Type1; update the pattern to use
tuple syntax (e.g., HtmlBlockKind::Type1(_) or HtmlBlockKind::Type1(..)) so the
match compiles, leaving the other variants (Type2..Type6) unchanged; this change
should be applied where the matches!(kind, ...) invocation is defined.

🧹 Nitpick comments (5)

crates/biome_markdown_parser/src/syntax/thematic_break_block.rs (3)
44-146: Consider consolidating with list.rs::is_thematic_break_pattern.

This file and crates/biome_markdown_parser/src/syntax/list.rs (lines 166-193) both define is_thematic_break_pattern with identical signatures but different implementations—this one operates on tokens while the other operates on raw source text. Given the PR's goal of improving maintainability, consolidating these into a single shared utility (or clearly documenting why both are needed) would reduce duplication and potential drift.

189-191: Silent early return could mask upstream bugs.

If has_eol is false here, the function returns without emitting any token, leaving the AST potentially incomplete. While this "shouldn't happen" if at_thematic_break_block is correct, a debug_assert! would help catch any future regressions.
🛡️ Suggested defensive assertion
 if !has_eol {
+ debug_assert!(false, "parse_thematic_break_tokens called but EOL check failed");
 return;
 }
168-250: Optional: extract shared logic with is_thematic_break_pattern.

This function duplicates much of the whitespace-skipping, break-character detection, and EOL-checking logic from is_thematic_break_pattern. If you find yourself modifying one, you'll likely need to mirror changes in the other. A small internal helper for the shared patterns could reduce this burden—though given the complexity, deferring this is reasonable.
crates/biome_markdown_parser/src/syntax/html_block.rs (1)
8-9: Module documentation is now outdated.

The note claims all HTML blocks terminate at blank lines, but the implementation below correctly handles specific terminators for Type1–Type5 (e.g., </script>, -->, ?>, >, ]]>). Worth updating to reflect the improved CommonMark compliance.
Suggested update
-//! **Note:** Not 100% CommonMark §4.6 compliant. All HTML blocks terminate at
-//! blank lines, whereas CommonMark types 1-5 have specific terminators (`-->`, `?>`, etc.).
+//! Implements CommonMark §4.6 HTML block types with appropriate terminators:
+//! Type 1-5 use specific terminators, Type 6-7 terminate at blank lines.
crates/biome_markdown_parser/src/to_html.rs (1)

254-326: Consider feature-gating this test harness API.
The module says “test harness”, but HtmlRenderContext/document_to_html are public in the main crate. A feature flag or moving the harness under tests would avoid shipping a non‑production API.

coderabbitai · 2026-01-31T03:20:38Z

crates/biome_markdown_parser/src/syntax/html_block.rs

+        matches!(
+            kind,
+            HtmlBlockKind::Type1 { .. }
+                | HtmlBlockKind::Type2
+                | HtmlBlockKind::Type3
+                | HtmlBlockKind::Type4
+                | HtmlBlockKind::Type5
+                | HtmlBlockKind::Type6
+        )


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Find HtmlBlockKind definition fd "\.rs$" crates/biome_markdown_parser/src | head -20

Repository: biomejs/biome

Length of output: 1098

🏁 Script executed:

# Search for HtmlBlockKind definition rg "enum HtmlBlockKind|Type1" crates/biome_markdown_parser/src --type rs -B 2 -A 10

Repository: biomejs/biome

Length of output: 83

🏁 Script executed:

# Check if the file compiles cd crates/biome_markdown_parser && cargo check 2>&1 | grep -A 5 "error\|warning" | head -50

Repository: biomejs/biome

Length of output: 143

🏁 Script executed:

# Search for HtmlBlockKind enum definition without type flag rg "enum HtmlBlockKind" crates/biome_markdown_parser/src -A 15

Repository: biomejs/biome

Length of output: 1546

🏁 Script executed:

# Read the html_block.rs file to see context around line 260 cat -n crates/biome_markdown_parser/src/syntax/html_block.rs | sed -n '1,50p'

Repository: biomejs/biome

Length of output: 2217

🏁 Script executed:

# Check the specific lines mentioned in the review (260-268) cat -n crates/biome_markdown_parser/src/syntax/html_block.rs | sed -n '250,270p'

Repository: biomejs/biome

Length of output: 760

Fix pattern syntax for tuple variant Type1.

HtmlBlockKind::Type1 is a tuple variant Type1(Type1Tag), but line 262 uses struct pattern syntax { .. }. For tuple variants, use Type1(_) instead.

Fix

matches!( kind, - HtmlBlockKind::Type1 { .. } + HtmlBlockKind::Type1(_) | HtmlBlockKind::Type2 | HtmlBlockKind::Type3 | HtmlBlockKind::Type4 | HtmlBlockKind::Type5 | HtmlBlockKind::Type6 )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

matches!(

kind,

HtmlBlockKind::Type1 { .. }

| HtmlBlockKind::Type2

| HtmlBlockKind::Type3

| HtmlBlockKind::Type4

| HtmlBlockKind::Type5

| HtmlBlockKind::Type6

)

matches!(

kind,

HtmlBlockKind::Type1(_)

| HtmlBlockKind::Type2

| HtmlBlockKind::Type3

| HtmlBlockKind::Type4

| HtmlBlockKind::Type5

| HtmlBlockKind::Type6

)

🤖 Prompt for AI Agents

In `@crates/biome_markdown_parser/src/syntax/html_block.rs` around lines 260 - 268, The pattern in the matches! call incorrectly uses struct syntax for the tuple variant HtmlBlockKind::Type1; update the pattern to use tuple syntax (e.g., HtmlBlockKind::Type1(_) or HtmlBlockKind::Type1(..)) so the match compiles, leaving the other variants (Type2..Type6) unchanged; this change should be applied where the matches!(kind, ...) invocation is defined.

Move percent-encoding to biome_markdown_parser and caseless to biome_string_case.

Return Cow from normalize_reference_label to skip allocations for normalized labels and avoid extra string copies when case folding is a no-op.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@crates/biome_markdown_parser/src/to_html.rs`:
- Around line 1595-1663: The alt-text extractor is double-escaping characters
because extract_alt_text_inline calls render_textual and also uses
escape_html/escape_html_attribute later; update extract_alt_text_inline so it
collects raw text (use collect_raw_inline_text or equivalent) instead of calling
render_textual in the AnyMdInline::MdTextual branch, and stop pre-escaping in
the MdInlineCode and MdAutolink branches (append the raw collected content after
any trimming instead of calling escape_html); leave final escaping to the code
that writes the alt attribute (escape_html_attribute) so alt text is escaped
exactly once.

🧹 Nitpick comments (1)

crates/biome_markdown_parser/src/syntax/inline/emphasis.rs (1)

274-274: Consider using a HashSet for link_bracket_starts lookups.

The contains(&i) call on a Vec is O(n) per lookup. For documents with many brackets, this could degrade performance. A HashSet<usize> would provide O(1) lookups.

♻️ Suggested refactor

-    let mut link_bracket_starts = Vec::new();
+    let mut link_bracket_starts = std::collections::HashSet::new();
     for pos in 0..bytes.len() {
         if bytes[pos] == b'['
             && let Some(result) = check_bracket_pattern(bytes, pos)
         {
             if result.is_inline_or_full_ref {
                 // Inline link or full reference link
-                link_bracket_starts.push(pos);
+                link_bracket_starts.insert(pos);
             } else {
                 // Could be a shortcut reference - check if definition exists
                 let label = extract_label_text(source, pos, result.close_pos);
                 let normalized = normalize_reference_label(label);
                 if !normalized.is_empty() && reference_checker(normalized.as_ref()) {
-                    link_bracket_starts.push(pos);
+                    link_bracket_starts.insert(pos);
                 }
             }
         }
     }

crates/biome_markdown_parser/src/to_html.rs

`extract_alt_text_inline` was escaping HTML characters, but the result was also passed through escape_html_attribute when writing the alt attribute. This caused characters like `&` to become ``&amp;`. Now `extract_alt_text_inline` collects raw text without escaping: - `MdTextual`: process backslash escapes only, no HTML escape - `MdInlineCode`: append raw content without `escape_html` - `MdAutolink`: append raw content without `escape_html` - `MdEntityReference`: decode entity without `escape_html` Also escape alt text when writing unresolved reference images as literal text.

ematipico · 2026-02-06T14:55:40Z

@jfmcdowell

Will merge this, however please note that myself and @tidefield found an issue here

biome/xtask/codegen/markdown.ungram

Lines 54 to 68 in b5dd9ce

    
           AnyLeafBlock = 
        
           	MdThematicBreakBlock 
        
           	| MdHeader 
        
           	| MdSetextHeader 
        
           	| AnyCodeBlock 
        
           	| MdHtmlBlock 
        
           	| MdLinkReferenceDefinition 
        
           	| MdLinkBlock 
        
           	| MdParagraph 
        
           	| MdNewline 
        
           AnyContainerBlock = 
        
           	MdQuote 
        
           	| MdBulletListItem 
        
           	| MdOrderedListItem

These two nodes don't follow the internal naming convention, and they fail in some part ofthe codegen of the formatter. Can you send a PR to fix it? Or @tidefield will do in their PR

tidefield · 2026-02-06T15:01:05Z

@ematipico I'll do it and include it in my current PR.

jfmcdowell · 2026-02-06T15:16:29Z

@ematipico I'll do it and include it in my current PR.

@ematipico / @tidefield thanks for the call out and assist. Let me know if you need any help @tidefield

jfmcdowell added 8 commits January 29, 2026 07:16

refactor(markdown): simplify inline list parsing flow

292c747

Refactors newline/quote handling into helpers to clarify control flow. Consolidates stop-token inline parsing and shared lookahead logic.

jfmcdowell closed this Jan 29, 2026

jfmcdowell reopened this Jan 29, 2026

jfmcdowell changed the base branch from main to next January 29, 2026 12:46

github-actions bot removed A-Core Area: core A-Project Area: project A-Linter Area: linter labels Jan 29, 2026

coderabbitai bot reviewed Jan 30, 2026

View reviewed changes

Cargo.toml Outdated Show resolved Hide resolved

crates/biome_markdown_parser/src/syntax/thematic_break_block.rs Show resolved Hide resolved

jfmcdowell added 9 commits January 30, 2026 10:06

refactor(markdown): move lexer imports to module scope

5589eea

refactor(markdown): use preorder visitor for to_html renderer

f093390

Switch the test HTML renderer to a preorder visitor and consolidate list/quote handling in a stateful renderer.

refactor(markdown): tidy list marker detection

80a87db

Extract list-marker follow checks and document marker rules.

refactor(markdown): fix imports

de946f8

refactor(markdown): fix magic numbers

ad53584

fix(markdown): reuse casefold result when possible

897cdaf

Avoid an extra allocation in reference label normalization by reusing the already-built string when casefolding is a no-op.

refactor(markdown): move link reference collection

c1bd5ef

Move reference definition collection into a dedicated module and use trimmed label text before normalization.

fix(markdown): align thematic break lookahead

dfad87a

Match lookahead validation with parser behavior by accepting single-token break patterns and reusing the minimum-break constant.

fix(markdown): gate utf8proc on non-wasm targets

155b503

Use utf8proc only on non-wasm builds and fall back to Unicode lowercasing on wasm to keep builds working.

coderabbitai bot reviewed Jan 30, 2026

View reviewed changes

jfmcdowell added 5 commits January 30, 2026 19:21

test(markdown): add reference label unit tests

e584d59

Cover whitespace normalization and escape preservation for reference labels.

fix(markdown): guard html block interrupt indent

0374ce0

Skip HTML block interrupts when line indent exceeds 3 spaces.

fix(markdown): require EOL for thematic break fast path

bba934e

Ensure single-token thematic breaks only match when the line ends, and guard the remap fast path similarly.

fix(markdown): tighten percent-encoding validation

73cf570

Only preserve %HH sequences when both digits are hex.

coderabbitai bot reviewed Jan 31, 2026

View reviewed changes

jfmcdowell requested a review from ematipico January 31, 2026 03:28

jfmcdowell added 2 commits January 31, 2026 12:16

chore(markdown): move caseless and percent-encoding

270c923

Move percent-encoding to biome_markdown_parser and caseless to biome_string_case.

refactor(markdown): return Cow for label normalization

e0f5be5

Return Cow from normalize_reference_label to skip allocations for normalized labels and avoid extra string copies when case folding is a no-op.

coderabbitai bot reviewed Jan 31, 2026

View reviewed changes

crates/biome_markdown_parser/src/to_html.rs Show resolved Hide resolved

ematipico approved these changes Feb 6, 2026

View reviewed changes

Merge branch 'next' into fix/markdown-inline-edge-cases

e63d4f8

ematipico merged commit 38c3fd7 into biomejs:next Feb 6, 2026
31 checks passed

Uh oh!

fix(markdown): improve CommonMark conformance and parser structure #8908

fix(markdown): improve CommonMark conformance and parser structure #8908

Uh oh!

Conversation

jfmcdowell commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Uh oh!

changeset-bot bot commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

coderabbitai bot commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ematipico commented Jan 30, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ematipico commented Feb 6, 2026

Uh oh!

Uh oh!

tidefield commented Feb 6, 2026

Uh oh!

jfmcdowell commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jfmcdowell commented Jan 29, 2026 •

edited

Loading

changeset-bot bot commented Jan 29, 2026 •

edited

Loading

coderabbitai bot commented Jan 30, 2026 •

edited

Loading