Implement documentation parser#103
Implement documentation parser#103snelusha wants to merge 18 commits intoballerina-platform:mainfrom
Conversation
06f9272 to
3fd5e13
Compare
fa23cc0 to
90d1fc1
Compare
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a full Markdown/ documentation pipeline: a documentation lexer + parser, integrates it into the main parser and AST builder, attaches Markdown docs to functions/constants, extends the pretty-printer to emit docs, adjusts lexer/tokenizer API, and adds many documentation-focused corpus and test assets. Changes
Sequence Diagram(s)sequenceDiagram
participant MainParser as Main Parser
participant DocLexer as Documentation Lexer
participant TokenReader as TokenReader
participant DocParser as Documentation Parser
participant ASTBuilder as AST Builder
participant PrettyPrinter as Pretty Printer
MainParser->>DocLexer: newDocumentationLexer(CharReader, leadingTrivia, diagnostics)
DocLexer->>TokenReader: emit documentation tokens
MainParser->>TokenReader: CreateTokenReader(lexer, debugCtx)
MainParser->>DocParser: NewDocumentationParser(TokenReader, dbg)
MainParser->>DocParser: Parse() -> markdown ST node(s)
DocParser->>ASTBuilder: return STNode tree (markdown nodes)
ASTBuilder->>ASTBuilder: createMarkdownDocumentationAttachment(...)
ASTBuilder->>ASTBuilder: attach docs to function/constant nodes
ASTBuilder->>PrettyPrinter: pass AST with documentation nodes
PrettyPrinter->>PrettyPrinter: printMarkdownDocumentation() -> output
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Comment |
2e145ac to
3aa9182
Compare
There was a problem hiding this comment.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
ast/pretty_printer.go (1)
728-738:⚠️ Potential issue | 🟠 MajorRemove the extra indent decrement in
printErrorTypeNode.
indentLevelis decremented without a matching increment, which can make indentation negative and corrupt subsequent output formatting.🔧 Suggested fix
if !node.IsTop() { p.indentLevel++ p.PrintInner(node.detailType.TypeDescriptor.(BLangNode)) p.indentLevel-- } - p.indentLevel-- p.endNode()
🤖 Fix all issues with AI agents
In `@ast/node_builder.go`:
- Around line 2905-2939: In transformCodeBlock on NodeBuilder, keep code block
line breaks when flattening: when iterating codeLines (variable codeLines and
codeLine.CodeDescription()), append a newline (e.g., "\n") after each
codeDescription.Text() (except possibly the last line) and ensure you insert a
newline before appending endBacktick.Text(); adjust the docText construction
around startBacktick, codeLines loop, and endBacktick so multi-line code blocks
preserve their line breaks in the resulting bLangDocLine.Text.
In `@parser/documentation_lexer.go`:
- Around line 637-650: In readDocReferenceTypeToken, call dl.reader.Mark() at
the start so the lexer snapshot is set before consuming chars; this ensures
subsequent getLexeme() (via processReferenceType or
getDocSyntaxTokenWithoutTrivia) returns the correct token text without including
prior trivia or being empty. Add the Mark() before checking dl.peek(), keep the
existing behavior for the BACKTICK branch (still Advance() and SwitchMode), and
then proceed to advance identifier chars and call processReferenceType() as
before.
- Around line 30-45: Rename the exported DocumentationLexer type and
NewDocumentationLexer constructor to unexported names (documentationLexer and
newDocumentationLexer) and update all internal references accordingly;
specifically change the type declaration "type DocumentationLexer struct" to
"type documentationLexer struct" and the constructor signature from "func
NewDocumentationLexer(charReader text.CharReader, leadingTriviaList
[]tree.STNode, diagnostics []tree.STNodeDiagnostic, debugCtx
*debugcommon.DebugContext) *DocumentationLexer" to "func
newDocumentationLexer(...) *documentationLexer", preserving the body (calls to
NewLexer, setting lexer.context.leadingTriviaList/diagnostics,
StartMode(PARSER_MODE_DOC_LINE_START_HASH), and previousBacktickMode =
PARSER_MODE_DEFAULT_MODE) and then replace every use site that expects
DocumentationLexer/NewDocumentationLexer (e.g., the single usage in parser code)
to the new unexported names and adjust types accordingly so compilation
succeeds.
In `@parser/documentation_parser.go`:
- Around line 105-115: The isInlineCodeRef function calls
p.getNextNextToken().Kind() without nil checks and can panic on truncated input;
update isInlineCodeRef to first capture t := p.getNextNextToken(), check if t ==
nil and if so return a conservative boolean (e.g., false) and emit a
missing-token diagnostic via the parser's diagnostic API (e.g., p.addDiagnostic
or p.reportMissingToken) before proceeding; then use t.Kind() in the existing
switch arms (replace direct calls to p.getNextNextToken().Kind() with the
guarded local t) to avoid panics.
In `@parser/testdata/parser/documentation/default_value_initialization/main.json`:
- Around line 1-3: Add an ignore rule to biome.json to prevent Biome from
attempting to parse Git LFS pointer files under parser/testdata; update the
file's configuration (look for the biome.json root object) and add an "ignore"
or "ignorePatterns" entry that excludes "parser/testdata/**" (or a broader
pattern matching LFS-managed files) so Biome will skip those paths during
lint/parse steps and avoid JSON parse errors from LFS pointer files.
In `@parser/testdata/parser/documentation/type_models_project/type_models.json`:
- Around line 1-3: The file is a Git LFS pointer (it starts with "version
https://git-lfs.github.com/spec/v1" and contains "oid sha256:"), so JSON
consumers must detect and skip or resolve LFS pointers instead of parsing them
as JSON; update the JSON-reading logic/tests (where files like type_models.json
are opened) to check the file start for the LFS pointer header and either (a)
treat it as non-JSON and skip the file, or (b) invoke the CI/checkout step to
fetch LFS content before parsing, and add a unit test that verifies the reader
returns a clear error/skip when encountering the LFS pointer header.
In `@parser/testdata/parser/record/record_annotation.json`:
- Around line 1-3: The CI release workflow is not fetching Git LFS objects so
parser/testdata/**/*.json LFS pointers fail JSON linting; update the release
workflow (release.yml) to set lfs: true so LFS files are fetched before linting,
or alternately add an exclusion for parser/testdata/** (or the JSON linting
rule) in biome.json to skip those testdata files during linting—change either
the release.yml lfs flag or the biome.json lint exclude pattern accordingly.
In `@parser/tokenizer.go`:
- Around line 23-29: Rename the exported interface Lexer to an unexported lexer
(lowercase) and update all internal type references: change the interface
declaration from "type Lexer interface { ... }" to "type lexer interface {
NextToken() tree.STToken; StartMode(ParserMode); SwitchMode(ParserMode);
EndMode(); GetCurrentMode() ParserMode }", then update the tokenizer.go field
type at the previous line 32 and the parameter type at the previous line 39 to
use the unexported lexer type; ensure any other internal parser package
references are updated accordingly (no external packages should rely on this
exported name).
🧹 Nitpick comments (1)
parser/testdata/parser/statements/vardeclr/module_record_var_decl_annotation_negetive.json (1)
1-3: Test data metadata update is appropriate.This Git LFS pointer file indicates the underlying test data has been updated (size increased from 17001 to 23217 bytes), which aligns with the PR's objective of adding parser tests and documentation corpus.
Note: The static analysis errors from Biome are false positives—it's attempting to parse this as JSON when it's actually a Git LFS pointer file with a specific non-JSON format.
Optional: Consider fixing the filename typo.
The filename contains "negetive" instead of "negative". While this is a pre-existing typo and fixing it is low priority, consider renaming the file in a future cleanup if it won't break test references.
parser/testdata/parser/documentation/default_value_initialization/main.json
Outdated
Show resolved
Hide resolved
parser/testdata/parser/documentation/type_models_project/type_models.json
Outdated
Show resolved
Hide resolved
| } | ||
|
|
||
| // Look ahead for a "Deprecated" word match. | ||
| for i := 0; i < 10; i++ { |
There was a problem hiding this comment.
can we do something like this
if dl.reader.PeekN(10) != "Deprecated" {
return dl.readDocInternalToken()
}
There was a problem hiding this comment.
We'd need to add a PeekString method to charReader since PeekN returns a single char, not a string.
c67558f to
c4f63b4
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
ast/pretty_printer.go (1)
741-752:⚠️ Potential issue | 🟠 MajorUnbalanced indent decrement causes misalignment.
When
node.IsTop()is true, theifblock (lines 745-749) is skipped, but line 750 still decrementsindentLevel. This causes the indent to become misaligned (potentially negative).🔧 Proposed fix
func (p *PrettyPrinter) printErrorTypeNode(node *BLangErrorTypeNode) { p.startNode() p.printString("error-type") if !node.IsTop() { p.indentLevel++ p.PrintInner(node.detailType.TypeDescriptor.(BLangNode)) p.indentLevel-- } - p.indentLevel-- p.endNode() }
🤖 Fix all issues with AI agents
In `@ast/pretty_printer.go`:
- Around line 419-436: The parameter printing is skipped when a node has
documentation but zero RequiredParams because the condition uses "if hasParams
|| !hasDoc"; update pretty_printer.go so the parameter block is always emitted:
always call p.printString("("), increment p.indentLevel, iterate over
node.RequiredParams (printing none if empty), decrement p.indentLevel, then call
p.printSticky(")"); remove the "!hasDoc" dependency and keep existing
indent/PrintInner logic around node.RequiredParams and
MarkdownDocumentationAttachment to preserve formatting.
In `@parser/documentation_lexer.go`:
- Around line 34-43: The constructor call in newDocumentationLexer uses the
exported NewLexer for an unexported lexer type; change the mismatch by either
renaming NewLexer to newLexer in the lexer implementation and update this call
to newLexer(charReader, debugCtx), or export the type by renaming the unexported
lexer type to Lexer and keep NewLexer; update all references accordingly (look
for NewLexer and the lexer type declaration and usage in newDocumentationLexer
and documentationLexer to ensure consistent exported/unexported naming).
In `@parser/documentation_parser.go`:
- Around line 567-586: The default panic in
DocumentationParser.parseBacktickExpr should be replaced with non-crashing
parser behavior: detect the unexpected token from p.peek() and return a
synthesized/missing token node (with a diagnostic/error recorded via the
parser's existing error reporting) instead of panicking; preserve current
behavior for BACKTICK_TOKEN, DOT_TOKEN (calling parseMethodCall) and
OPEN_PAREN_TOKEN (calling parseFuncCall), and use the same error reporting
mechanism other parse methods use so callers receive a recoverable node rather
than a crash (refer to parseQualifiedIdentifier, peek, consume, parseMethodCall,
parseFuncCall to find where to attach the diagnostic).
🧹 Nitpick comments (3)
parser/documentation_lexer.go (2)
188-203: Minor: switch structure can be simplified.The switch in
processWhitespaces(lines 196-201) has an empty case and a default that both lead to the same outer break. Consider simplifying:♻️ Optional simplification
switch c { case SPACE, TAB, FORM_FEED: reader.Advance() continue - case CARRIAGE_RETURN, NEWLINE: - default: - break } break
498-532: Consider usingstrings.Builderfor identifier construction.Lines 503-507 use string concatenation in a loop (
identifier += string(lookAheadChar)), which creates a new string allocation on each iteration. While identifiers are typically short, usingstrings.Builderis more idiomatic and efficient.♻️ Optional improvement
func (dl *documentationLexer) processDocumentationReference(nextChar rune) bool { lookAheadChar := nextChar lookAheadCount := 0 - identifier := "" + var identifier strings.Builder for isIdentifierInitialChar(lookAheadChar) { - identifier += string(lookAheadChar) + identifier.WriteRune(lookAheadChar) lookAheadCount++ lookAheadChar = dl.reader.PeekN(lookAheadCount) } - switch identifier { + switch identifier.String() {Add
"strings"to the imports if not already present.ast/pretty_printer.go (1)
733-737: Remove commented-out code.Dead code clutters the codebase. If this formatting was intentional but not needed, remove it.
♻️ Proposed cleanup
p.indentLevel-- - // p.buffer.WriteString("\n") - // for i := 0; i < p.indentLevel; i++ { - // p.buffer.WriteString(" ") - // } p.buffer.WriteString(")") p.addSpaceBeforeNode = true }
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@ast/pretty_printer.go`:
- Around line 746-748: The unconditional p.indentLevel-- after p.endNode() will
corrupt indentation because indentLevel is only incremented when !node.IsTop();
change the logic so you only decrement the same branch that incremented it: make
the decrement conditional on the same check (node.IsTop()) or remove the
duplicated decrement so that p.indentLevel is decremented exactly once for the
earlier increment; locate the surrounding node.IsTop() check and the two
p.indentLevel-- occurrences and ensure the decrement mirrors the increment
(i.e., wrap p.indentLevel-- in the same if !node.IsTop() block or delete the
extra decrement).
🧹 Nitpick comments (1)
ast/pretty_printer.go (1)
729-732: Remove commented-out code.Dead code should be removed to maintain clarity. If this was debugging code or an intentional design decision, consider documenting the rationale elsewhere.
🧹 Proposed fix
- // p.buffer.WriteString("\n") - // for i := 0; i < p.indentLevel; i++ { - // p.buffer.WriteString(" ") - // } p.buffer.WriteString(")")
5a200ae to
0ba1dbb
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@ast/pretty_printer.go`:
- Around line 815-853: In printMarkdownReturnParameterDocumentation, when
ReturnParameterDocumentationLines is non-empty the code leaves
p.addSpaceBeforeNode = false which can cause "(return-type" to be concatenated
with the closing ")" of the doc-lines; before printing the return type (i.e.
before calling p.printString("(return-type") /
p.PrintInner(node.ReturnType.(BLangNode))) ensure a separator by setting
p.addSpaceBeforeNode = true (or inserting a single space) after finishing
doc-lines so the printString("(return-type") will be emitted with a space
between the doc-lines closing ")" and the "(return-type" token.
In `@corpus/bal/subset2/01-documentation/function_deprecation-v.bal`:
- Around line 34-35: This test file is missing the required `@output` comment
block so the corpus harness can't record expected output; add an `@output` comment
block after the public function main() containing the actual output the test
should produce (use the exact marker "@output" and the output lines beneath it)
so the *-v.bal corpus test can verify end-to-end output.
In `@parser/lexer.go`:
- Around line 48-54: The public constructor NewLexer should be unexported:
rename NewLexer to newLexer so the package does not expose a public constructor
that returns an unexported type (*lexer); update all internal call sites to use
newLexer and keep the signature (reader text.CharReader, debugCtx
*debugcommon.DebugContext) *lexer unchanged, ensuring imports/tests inside the
parser package are updated accordingly and no external packages reference
NewLexer.
In `@parser/testdata/parser/documentation/markdown_native_function.json`:
- Around line 1-3: The CI is failing because
parser/testdata/parser/documentation/*.json contains Git LFS pointer files (not
real JSON); update the CI lint job (the JSON lint step) to either fetch LFS
objects before running linters or skip these testdata JSON files — i.e.,
configure the CI JSON lint step to run `git lfs pull` prior to linting or add an
exclude pattern for parser/testdata/parser/documentation/*.json (or update the
JSON linter config used by the JSON lint job) so the linter does not attempt to
parse LFS pointer files.
🧹 Nitpick comments (1)
parser/testdata/parser/statements/vardeclr/module_record_var_decl_annotation_negetive.json (1)
1-3: Git LFS pointer update looks correct.This is a Git LFS pointer file tracking test data. The static analysis errors from Biome are false positives—it's attempting to parse the LFS pointer format as JSON content, which is incorrect.
One minor note: the filename contains a typo ("negetive" → "negative"). Consider renaming for consistency if other files follow the correct spelling.
,
parser/testdata/parser/documentation/markdown_native_function.json
Outdated
Show resolved
Hide resolved
0ba1dbb to
4f2571f
Compare
4f2571f to
34d2d8a
Compare
34d2d8a to
3932af1
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #103 +/- ##
==========================================
+ Coverage 27.55% 29.00% +1.45%
==========================================
Files 252 254 +2
Lines 53694 55159 +1465
==========================================
+ Hits 14794 15999 +1205
- Misses 37969 38162 +193
- Partials 931 998 +67 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Purpose
Resolves #16
This PR implements a documentation parser and end-to-end Markdown documentation support across the parser, syntax tree, pretty-printer, and test corpus.
Key changes
Outcomes
Resolves: #16