Skip to content

Commit c3042fe

Browse files
authored
feat: add markitdown-inspired file parsers (Word, PowerPoint, Excel, EPub, ZIP) (#128)
* feat: add markitdown-inspired file parsers Add support for parsing additional file formats inspired by microsoft/markitdown: - Word (.docx) - using python-docx - PowerPoint (.pptx) - using python-pptx - Excel (.xlsx) - using openpyxl - Audio (.mp3, .wav, .m4a, etc.) - metadata extraction using mutagen - EPub (.epub) - using ebooklib - ZIP (.zip) - iterate contents All parsers convert content to markdown and delegate to MarkdownParser for tree structure creation, following OpenViking's parser pattern. Dependencies added to pyproject.toml: - python-docx, python-pptx, openpyxl - ebooklib, beautifulsoup4 - mutagen Includes comprehensive tests for all new parsers. Refs: markitdown-parsers * feat: make markitdown parsers built-in capabilities Move parser dependencies from optional to main dependencies. Register parsers directly without graceful fallback. Remove optional registration infrastructure. Parsers now built-in: - Word (.docx) via python-docx - PowerPoint (.pptx) via python-pptx - Excel (.xlsx) via openpyxl - EPub (.epub) via ebooklib - ZIP (.zip) via built-in zipfile - Audio (.mp3, .wav, etc.) via mutagen * refactor: align parsers with existing ecosystem Set source_format on ParseResult like TextParser does: - word: source_format = 'docx' - powerpoint: source_format = 'pptx' - excel: source_format = 'xlsx' - epub: source_format = 'epub' - zip: source_format = 'zip' - audio: source_format = 'audio' All parsers now follow the same pattern as existing TextParser and PDFParser for consistency. * fix: align markitdown parsers with ecosystem patterns Critical fixes: - Remove duplicate zip_archive.py (conflicting ZipParser class name) - Use zip_parser.py as canonical ZIP parser (follows TextParser pattern) - Fix parse_content() to delegate to MarkdownParser instead of raising ValueError (all parse_content tests were broken) - Set parser_name on all ParseResult outputs (was missing) - Set source_format AFTER MarkdownParser call (was being overwritten) - Accept ParserConfig in all parser __init__ (ecosystem consistency) - Add .xlsm to ExcelParser supported_extensions - Fix AudioParser._format_size to match ZipParser format (500.0 B) - Fix pyproject.toml urllib3 indentation corruption - Add tests/parse/conftest.py with VikingFS test fixture - Rewrite tests to actually pass and cover registry integration All 23 tests passing. * fix: improve markitdown parser consistency and add real-file tests Critical fixes: - WordParser: preserve table position in document order (was appending all tables at end, losing context). Walk document body XML in order instead of iterating paragraphs then tables separately. - PowerPointParser: replace magic number (type == 1) with proper PP_PLACEHOLDER enum constants, also handle CENTER_TITLE. - AudioParser: add Vorbis/FLAC/OGG tag extraction (previously only handled ID3 and MP4 formats). Tries all format mappings with dedup. - ZipParser: replace emoji in tree view with plain text markers for robustness in text processing pipelines. - TextParser: set parser_name='TextParser' on parse_content results for consistency with all other parsers. - __init__.py: export all new parser classes for public API. Tests (16 new, 39 total): - Real .docx/.xlsx/.pptx file creation and parsing - EPub HTML-to-markdown conversion edge cases - ZIP bad-file error handling and no-emoji tree view - AudioParser Vorbis tag extraction and edge cases - WordParser can_parse() extension matching * feat: update for rebase and remove audio redundant * feat: rollback unexpected change * chore: remove redundant mutagen for audio file
1 parent 66d2a36 commit c3042fe

File tree

10 files changed

+938
-12
lines changed

10 files changed

+938
-12
lines changed

openviking/parse/parsers/__init__.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,18 +3,28 @@
33

44
from .base_parser import BaseParser
55
from .code import CodeRepositoryParser
6+
from .epub import EPubParser
7+
from .excel import ExcelParser
68
from .html import HTMLParser, URLType, URLTypeDetector
79
from .markdown import MarkdownParser
810
from .pdf import PDFParser
11+
from .powerpoint import PowerPointParser
912
from .text import TextParser
13+
from .word import WordParser
14+
from .zip_parser import ZipParser
1015

1116
__all__ = [
1217
"BaseParser",
1318
"CodeRepositoryParser",
19+
"EPubParser",
20+
"ExcelParser",
1421
"HTMLParser",
1522
"URLType",
1623
"URLTypeDetector",
1724
"MarkdownParser",
1825
"PDFParser",
26+
"PowerPointParser",
1927
"TextParser",
28+
"WordParser",
29+
"ZipParser",
2030
]

openviking/parse/parsers/epub.py

Lines changed: 173 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,173 @@
1+
# Copyright (c) 2026 Beijing Volcano Engine Technology Co., Ltd.
2+
# SPDX-License-Identifier: Apache-2.0
3+
"""
4+
EPub (.epub) parser for OpenViking.
5+
6+
Converts EPub e-books to Markdown then parses using MarkdownParser.
7+
Inspired by microsoft/markitdown approach.
8+
"""
9+
10+
import html
11+
import re
12+
import zipfile
13+
from pathlib import Path
14+
from typing import List, Optional, Union
15+
16+
from openviking.parse.base import ParseResult
17+
from openviking.parse.parsers.base_parser import BaseParser
18+
from openviking_cli.utils.config.parser_config import ParserConfig
19+
from openviking_cli.utils.logger import get_logger
20+
21+
logger = get_logger(__name__)
22+
23+
24+
class EPubParser(BaseParser):
25+
"""
26+
EPub e-book parser for OpenViking.
27+
28+
Supports: .epub
29+
30+
Converts EPub e-books to Markdown using ebooklib (if available)
31+
or falls back to manual extraction, then delegates to MarkdownParser.
32+
"""
33+
34+
def __init__(self, config: Optional[ParserConfig] = None):
35+
"""Initialize EPub parser."""
36+
from openviking.parse.parsers.markdown import MarkdownParser
37+
38+
self._md_parser = MarkdownParser(config=config)
39+
self.config = config or ParserConfig()
40+
41+
@property
42+
def supported_extensions(self) -> List[str]:
43+
return [".epub"]
44+
45+
async def parse(self, source: Union[str, Path], instruction: str = "", **kwargs) -> ParseResult:
46+
"""Parse EPub e-book from file path."""
47+
path = Path(source)
48+
49+
if path.exists():
50+
markdown_content = self._convert_to_markdown(path)
51+
result = await self._md_parser.parse_content(
52+
markdown_content, source_path=str(path), instruction=instruction, **kwargs
53+
)
54+
else:
55+
result = await self._md_parser.parse_content(
56+
str(source), instruction=instruction, **kwargs
57+
)
58+
result.source_format = "epub"
59+
result.parser_name = "EPubParser"
60+
return result
61+
62+
async def parse_content(
63+
self, content: str, source_path: Optional[str] = None, instruction: str = "", **kwargs
64+
) -> ParseResult:
65+
"""Parse content - delegates to MarkdownParser."""
66+
result = await self._md_parser.parse_content(content, source_path, **kwargs)
67+
result.source_format = "epub"
68+
result.parser_name = "EPubParser"
69+
return result
70+
71+
def _convert_to_markdown(self, path: Path) -> str:
72+
"""Convert EPub e-book to Markdown string."""
73+
# Try using ebooklib first
74+
try:
75+
import ebooklib
76+
from ebooklib import epub
77+
78+
return self._convert_with_ebooklib(path, ebooklib, epub)
79+
except ImportError:
80+
pass
81+
82+
# Fall back to manual extraction
83+
return self._convert_manual(path)
84+
85+
def _convert_with_ebooklib(self, path: Path, ebooklib, epub) -> str:
86+
"""Convert EPub using ebooklib."""
87+
book = epub.read_epub(path)
88+
markdown_parts = []
89+
90+
title = self._get_metadata(book, "title")
91+
author = self._get_metadata(book, "creator")
92+
93+
if title:
94+
markdown_parts.append(f"# {title}")
95+
if author:
96+
markdown_parts.append(f"**Author:** {author}")
97+
98+
for item in book.get_items():
99+
if item.get_type() == ebooklib.ITEM_DOCUMENT:
100+
content = item.get_content().decode("utf-8", errors="ignore")
101+
md_content = self._html_to_markdown(content)
102+
if md_content.strip():
103+
markdown_parts.append(md_content)
104+
105+
return "\n\n".join(markdown_parts)
106+
107+
def _get_metadata(self, book, key: str) -> str:
108+
"""Get metadata from EPub book."""
109+
try:
110+
metadata = book.get_metadata("DC", key)
111+
if metadata:
112+
return metadata[0][0]
113+
except Exception:
114+
pass
115+
return ""
116+
117+
def _convert_manual(self, path: Path) -> str:
118+
"""Convert EPub manually using zipfile and HTML parsing."""
119+
markdown_parts = []
120+
121+
with zipfile.ZipFile(path, "r") as zf:
122+
html_files = [f for f in zf.namelist() if f.endswith((".html", ".xhtml", ".htm"))]
123+
124+
for html_file in sorted(html_files):
125+
try:
126+
content = zf.read(html_file).decode("utf-8", errors="ignore")
127+
md_content = self._html_to_markdown(content)
128+
if md_content.strip():
129+
markdown_parts.append(md_content)
130+
except Exception as e:
131+
logger.warning(f"Failed to process {html_file}: {e}")
132+
133+
return (
134+
"\n\n".join(markdown_parts)
135+
if markdown_parts
136+
else "# EPub Content\n\nUnable to extract content."
137+
)
138+
139+
def _html_to_markdown(self, html_content: str) -> str:
140+
"""Simple HTML to markdown conversion."""
141+
# Remove script and style tags
142+
html_content = re.sub(r"<script[^>]*>.*?</script>", "", html_content, flags=re.DOTALL)
143+
html_content = re.sub(r"<style[^>]*>.*?</style>", "", html_content, flags=re.DOTALL)
144+
145+
# Convert headers
146+
html_content = re.sub(r"<h1[^>]*>(.*?)</h1>", r"# \1", html_content, flags=re.DOTALL)
147+
html_content = re.sub(r"<h2[^>]*>(.*?)</h2>", r"## \1", html_content, flags=re.DOTALL)
148+
html_content = re.sub(r"<h3[^>]*>(.*?)</h3>", r"### \1", html_content, flags=re.DOTALL)
149+
html_content = re.sub(r"<h4[^>]*>(.*?)</h4>", r"#### \1", html_content, flags=re.DOTALL)
150+
151+
# Convert bold and italic
152+
html_content = re.sub(r"<strong>(.*?)</strong>", r"**\1**", html_content, flags=re.DOTALL)
153+
html_content = re.sub(r"<b>(.*?)</b>", r"**\1**", html_content, flags=re.DOTALL)
154+
html_content = re.sub(r"<em>(.*?)</em>", r"*\1*", html_content, flags=re.DOTALL)
155+
html_content = re.sub(r"<i>(.*?)</i>", r"*\1*", html_content, flags=re.DOTALL)
156+
157+
# Convert paragraphs
158+
html_content = re.sub(r"<p[^>]*>(.*?)</p>", r"\1\n\n", html_content, flags=re.DOTALL)
159+
160+
# Convert line breaks
161+
html_content = re.sub(r"<br\s*/?>", "\n", html_content)
162+
163+
# Remove remaining HTML tags
164+
html_content = re.sub(r"<[^>]+>", "", html_content)
165+
166+
# Unescape HTML entities
167+
html_content = html.unescape(html_content)
168+
169+
# Normalize whitespace
170+
html_content = re.sub(r"\n\s*\n", "\n\n", html_content)
171+
html_content = re.sub(r"[ \t]+", " ", html_content)
172+
173+
return html_content.strip()

openviking/parse/parsers/excel.py

Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
# Copyright (c) 2026 Beijing Volcano Engine Technology Co., Ltd.
2+
# SPDX-License-Identifier: Apache-2.0
3+
"""
4+
Excel (.xlsx/.xls/.xlsm) parser for OpenViking.
5+
6+
Converts Excel spreadsheets to Markdown then parses using MarkdownParser.
7+
Inspired by microsoft/markitdown approach.
8+
"""
9+
10+
from pathlib import Path
11+
from typing import List, Optional, Union
12+
13+
from openviking.parse.base import ParseResult
14+
from openviking.parse.parsers.base_parser import BaseParser
15+
from openviking_cli.utils.config.parser_config import ParserConfig
16+
from openviking_cli.utils.logger import get_logger
17+
18+
logger = get_logger(__name__)
19+
20+
21+
class ExcelParser(BaseParser):
22+
"""
23+
Excel spreadsheet parser for OpenViking.
24+
25+
Supports: .xlsx, .xls, .xlsm
26+
27+
Converts Excel spreadsheets to Markdown using openpyxl,
28+
then delegates to MarkdownParser for tree structure creation.
29+
"""
30+
31+
def __init__(self, config: Optional[ParserConfig] = None, max_rows_per_sheet: int = 1000):
32+
"""
33+
Initialize Excel parser.
34+
35+
Args:
36+
config: Parser configuration
37+
max_rows_per_sheet: Maximum rows to process per sheet (0 = unlimited)
38+
"""
39+
from openviking.parse.parsers.markdown import MarkdownParser
40+
41+
self._md_parser = MarkdownParser(config=config)
42+
self.config = config or ParserConfig()
43+
self.max_rows_per_sheet = max_rows_per_sheet
44+
45+
@property
46+
def supported_extensions(self) -> List[str]:
47+
return [".xlsx", ".xls", ".xlsm"]
48+
49+
async def parse(self, source: Union[str, Path], instruction: str = "", **kwargs) -> ParseResult:
50+
"""Parse Excel spreadsheet from file path."""
51+
path = Path(source)
52+
53+
if path.exists():
54+
import openpyxl
55+
56+
markdown_content = self._convert_to_markdown(path, openpyxl)
57+
result = await self._md_parser.parse_content(
58+
markdown_content, source_path=str(path), instruction=instruction, **kwargs
59+
)
60+
else:
61+
result = await self._md_parser.parse_content(
62+
str(source), instruction=instruction, **kwargs
63+
)
64+
result.source_format = "xlsx"
65+
result.parser_name = "ExcelParser"
66+
return result
67+
68+
async def parse_content(
69+
self, content: str, source_path: Optional[str] = None, instruction: str = "", **kwargs
70+
) -> ParseResult:
71+
"""Parse content - delegates to MarkdownParser."""
72+
result = await self._md_parser.parse_content(content, source_path, **kwargs)
73+
result.source_format = "xlsx"
74+
result.parser_name = "ExcelParser"
75+
return result
76+
77+
def _convert_to_markdown(self, path: Path, openpyxl) -> str:
78+
"""Convert Excel spreadsheet to Markdown string."""
79+
wb = openpyxl.load_workbook(path, data_only=True)
80+
81+
markdown_parts = []
82+
markdown_parts.append(f"# {path.stem}")
83+
markdown_parts.append(f"**Sheets:** {len(wb.sheetnames)}")
84+
85+
for sheet_name in wb.sheetnames:
86+
sheet = wb[sheet_name]
87+
sheet_content = self._convert_sheet(sheet, sheet_name)
88+
markdown_parts.append(sheet_content)
89+
90+
return "\n\n".join(markdown_parts)
91+
92+
def _convert_sheet(self, sheet, sheet_name: str) -> str:
93+
"""Convert a single sheet to markdown."""
94+
parts = []
95+
parts.append(f"## Sheet: {sheet_name}")
96+
97+
max_row = sheet.max_row
98+
max_col = sheet.max_column
99+
100+
if max_row == 0 or max_col == 0:
101+
parts.append("*Empty sheet*")
102+
return "\n\n".join(parts)
103+
104+
parts.append(f"**Dimensions:** {max_row} rows × {max_col} columns")
105+
106+
rows_to_process = max_row
107+
if self.max_rows_per_sheet > 0:
108+
rows_to_process = min(max_row, self.max_rows_per_sheet)
109+
110+
rows = []
111+
for _row_idx, row in enumerate(
112+
sheet.iter_rows(min_row=1, max_row=rows_to_process, values_only=True), 1
113+
):
114+
row_data = []
115+
for cell in row:
116+
if cell is None:
117+
row_data.append("")
118+
else:
119+
row_data.append(str(cell))
120+
rows.append(row_data)
121+
122+
if rows:
123+
from openviking.parse.base import format_table_to_markdown
124+
125+
table_md = format_table_to_markdown(rows, has_header=True)
126+
parts.append(table_md)
127+
128+
if self.max_rows_per_sheet > 0 and max_row > self.max_rows_per_sheet:
129+
parts.append(f"\n*... {max_row - self.max_rows_per_sheet} more rows truncated ...*")
130+
131+
return "\n\n".join(parts)

0 commit comments

Comments
 (0)