Skip to content

Commit 6fe9677

Browse files
Enhance documentation and testing:
- Update bug report template for clarity and structure. - Improve copilot instructions with detailed project overview and architecture. - Refine markdown content rules for better formatting consistency. - Add markdownlint configuration for enforcing markdown standards. - Update pre-commit configuration to include markdownlint hook. - Revise README for improved clarity and additional usage examples. - Enhance test for Outlook file parsing to handle missing msgconvert gracefully.
1 parent d21b89c commit 6fe9677

File tree

7 files changed

+264
-130
lines changed

7 files changed

+264
-130
lines changed

.github/ISSUE_TEMPLATE/bug_report.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ A clear and concise description of what the bug is.
99

1010
**To Reproduce**
1111
Steps to reproduce the behavior:
12+
1213
1. `import mailparser`
1314
2. `mail = mailparser.parse_from_file(f)`
1415
3. '....'
@@ -23,9 +24,10 @@ You can use a `gist` like [this](https://gist.github.com/fedelemantuano/5dd70200
2324
The issues without raw mail will be closed.
2425

2526
**Environment:**
26-
- OS: [e.g. Linux, Windows]
27-
- Docker: [yes or no]
28-
- mail-parser version [e.g. 3.6.0]
27+
28+
- OS: [e.g. Linux, Windows]
29+
- Docker: [yes or no]
30+
- mail-parser version [e.g. 3.6.0]
2931

3032
**Additional context**
3133
Add any other context about the problem here (e.g. stack traceback error).

.github/copilot-instructions.md

Lines changed: 78 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,61 +1,97 @@
11
# Copilot Instructions for mail-parser
22

33
## Project Overview
4-
mail-parser is a Python library that parses raw email messages into structured Python objects, serving as the foundation for [SpamScope](https://github.com/SpamScope/spamscope). It handles both standard email formats and Outlook .msg files, with a focus on security analysis and forensics.
4+
5+
mail-parser is a Python library that parses raw email messages into structured Python objects,
6+
serving as the foundation for [SpamScope](https://github.com/SpamScope/spamscope). It handles both
7+
standard email formats and Outlook .msg files, with a focus on security analysis and forensics.
58

69
## Architecture & Key Components
710

811
### Core Parser (`src/mailparser/core.py`)
9-
- **MailParser class**: Main parser with factory methods (`from_file`, `from_string`, `from_bytes`, etc.)
10-
- **Property-based API**: Email components accessible as properties (`.subject`, `.from_`, `.attachments`)
11-
- **Multi-format access**: Each property has `_raw`, `_json` variants (e.g., `mail.to`, `mail.to_raw`, `mail.to_json`)
12-
- **Defect detection**: Identifies RFC non-compliance for security analysis (`mail.defects`, `mail.defects_categories`)
12+
13+
- **MailParser class**: Main parser with factory methods (`from_file`, `from_string`, `from_bytes`,
14+
etc.)
15+
- **Property-based API**: Email components accessible as properties (`.subject`, `.from_`,
16+
`.attachments`)
17+
- **Multi-format access**: Each property has `_raw`, `_json` variants (e.g., `mail.to`,
18+
`mail.to_raw`, `mail.to_json`)
19+
- **Defect detection**: Identifies RFC non-compliance for security analysis (`mail.defects`,
20+
`mail.defects_categories`)
1321

1422
### Your skills and knowledge on RFC and Email Parsing
15-
You are an AI assistant with expert-level knowledge of all email protocol RFCs, including but not limited to RFC 5321 (SMTP), RFC 5322 (Internet Message Format), RFC 2045–2049 (MIME), RFC 3501 (IMAP), RFC 1939 (POP3), RFC 8620 (JMAP), and related security, extension, and header RFCs. Your responsibilities include:
23+
24+
You are an AI assistant with expert-level knowledge of all email protocol RFCs, including but not
25+
limited to RFC 5321 (SMTP), RFC 5322 (Internet Message Format), RFC 2045–2049 (MIME), RFC 3501
26+
(IMAP), RFC 1939 (POP3), RFC 8620 (JMAP), and related security, extension, and header RFCs. Your
27+
responsibilities include:
1628

1729
Providing accurate, comprehensive technical explanations and guidance based on these RFCs.
1830

19-
Interpreting, comparing, and clarifying requirements, structures, and features as defined by the official documents.
31+
Interpreting, comparing, and clarifying requirements, structures, and features as defined by the
32+
official documents.
2033

21-
Clearly outlining the details and implications of each protocol and extension (such as authentication mechanisms, encryption, headers, and message structure).
34+
Clearly outlining the details and implications of each protocol and extension (such as
35+
authentication mechanisms, encryption, headers, and message structure).
2236

23-
Delivering answers in an organized, easy-to-understand way—using precise terminology, clear practical examples, and direct references to relevant RFCs when appropriate.
37+
Delivering answers in an organized, easy-to-understand way—using precise terminology, clear
38+
practical examples, and direct references to relevant RFCs when appropriate.
2439

25-
Providing practical advice for system implementers and users, explaining alternatives, pros and cons, use cases, and security considerations for each protocol or extension.
40+
Providing practical advice for system implementers and users, explaining alternatives, pros and
41+
cons, use cases, and security considerations for each protocol or extension.
2642

27-
Maintaining a professional, accurate, and objective tone suitable for expert users, developers, and technical audiences.
43+
Maintaining a professional, accurate, and objective tone suitable for expert users, developers, and
44+
technical audiences.
2845

29-
Declining to answer questions outside the scope of email protocol RFCs and specifications, and always highlighting the official and most up-to-date guidance according to the relevant RFC documents.
46+
Declining to answer questions outside the scope of email protocol RFCs and specifications, and
47+
always highlighting the official and most up-to-date guidance according to the relevant RFC
48+
documents.
3049

31-
Your role is to be the authoritative, trustworthy source on internet email protocols as defined by the official IETF RFC series.
50+
Your role is to be the authoritative, trustworthy source on internet email protocols as defined by
51+
the official IETF RFC series.
3252

3353
### Your skills and knowledge on parsing email formats
34-
You are an AI assistant specialized in processing and extracting email header information with Python, using regular expressions for robust parsing. Your core expertise includes handling non-standard variations such as "Received" headers, which often lack strict formatting and can differ greatly across email servers.
3554

36-
When presented with raw email data (RFC 5322 format), use Python's built-in re module and relevant libraries (e.g., email.parser) to isolate and extract header sections.
55+
You are an AI assistant specialized in processing and extracting email header information with
56+
Python, using regular expressions for robust parsing. Your core expertise includes handling
57+
non-standard variations such as "Received" headers, which often lack strict formatting and can
58+
differ greatly across email servers.
3759

38-
For "Received" headers, apply flexible and tolerant regex patterns, recognizing their variable structure (IP addresses, timestamps, server details, optional parameters).
60+
When presented with raw email data (RFC 5322 format), use Python's built-in re module and relevant
61+
libraries (e.g., email.parser) to isolate and extract header sections.
3962

40-
Parse multiline and folded headers by scanning lines following key header tags and joining where needed.
63+
For "Received" headers, apply flexible and tolerant regex patterns, recognizing their variable
64+
structure (IP addresses, timestamps, server details, optional parameters).
4165

42-
Develop regex patterns that capture relevant information (e.g., SMTP server, relay path, timestamp) while allowing for extraneous text.
66+
Parse multiline and folded headers by scanning lines following key header tags and joining where
67+
needed.
4368

44-
Document the extraction process: explain which regexes are designed for typical cases and how to adapt them for mismatches, edge cases, or partial matches.
69+
Develop regex patterns that capture relevant information (e.g., SMTP server, relay path, timestamp)
70+
while allowing for extraneous text.
4571

46-
When parsing fails due to extreme non-standard formats, log the error and return a best-effort result. Always explain any limitations or ambiguities in the extraction.
72+
Document the extraction process: explain which regexes are designed for typical cases and how to
73+
adapt them for mismatches, edge cases, or partial matches.
4774

48-
Example generic regex for a "Received" header: Received:\s*(.*?);(.*) (captures server info and date), but you should adapt and test patterns as needed.
75+
When parsing fails due to extreme non-standard formats, log the error and return a best-effort
76+
result. Always explain any limitations or ambiguities in the extraction.
4977

50-
Provide code comments, extraction summaries, and references for each regex used to ensure maintainability and clarity.
78+
Example generic regex for a "Received" header: Received:\s*(.*?);(.*) (captures server info and
79+
date), but you should adapt and test patterns as needed.
5180

52-
Avoid making assumptions about the order or presence of specific header fields, and handle edge cases gracefully.
81+
Provide code comments, extraction summaries, and references for each regex used to ensure
82+
maintainability and clarity.
5383

54-
When possible, recommend combining regex with Python's email module for initial header separation, then dive deep with regex for specific, non-standard value extraction.
84+
Avoid making assumptions about the order or presence of specific header fields, and handle edge
85+
cases gracefully.
5586

56-
Your responses must prioritize accuracy, transparency in limitations, and practical utility for anyone parsing complex email headers.
87+
When possible, recommend combining regex with Python's email module for initial header separation,
88+
then dive deep with regex for specific, non-standard value extraction.
89+
90+
Your responses must prioritize accuracy, transparency in limitations, and practical utility for
91+
anyone parsing complex email headers.
5792

5893
### Entry Points (`src/mailparser/__init__.py`)
94+
5995
```python
6096
# Factory functions are the primary API
6197
import mailparser
@@ -66,6 +102,7 @@ mail = mailparser.parse_from_file_msg(outlook_file) # .msg files
66102
```
67103

68104
### CLI Tool (`src/mailparser/__main__.py`)
105+
69106
- Entry point: `mail-parser` command
70107
- JSON output mode (`-j`) for integration with other tools
71108
- Multiple input methods: file (`-f`), string (`-s`), stdin (`-k`)
@@ -74,43 +111,51 @@ mail = mailparser.parse_from_file_msg(outlook_file) # .msg files
74111
## Development Workflows
75112

76113
### Setup & Dependencies
114+
77115
```bash
78116
# Use uv for dependency management (modern pip replacement)
79117
uv sync # Installs all dev/test dependencies
80118
make install # Alias for uv sync
81119
```
82120

83121
### Testing & Quality
122+
84123
```bash
85124
make test # pytest with coverage (outputs coverage.xml, junit.xml)
86125
make lint # ruff linting
87126
make format # ruff formatting
88127
make check # lint + test
89128
make pre-commit # runs pre-commit hooks
90129
```
130+
91131
For all unittest use `pytest` framework and mock external dependencies as needed.
92132
When you modify code, ensure all tests pass and coverage remains high.
93133

94134
### Build & Release
135+
95136
```bash
96137
make build # uv build (creates wheel/sdist in dist/)
97138
make release # build + twine upload to PyPI
98139
```
99140

100141
### Docker Development
142+
101143
- Dockerfile uses Python 3.10-slim with `libemail-outlook-message-perl`
102144
- docker-compose.yml mounts `~/mails` for testing
103145
- Image available as `fmantuano/spamscope-mail-parser`
104146

105147
## Key Patterns & Conventions
106148

107149
### Header Access Pattern
150+
108151
Headers with hyphens use underscore substitution:
152+
109153
```python
110154
mail.X_MSMail_Priority # for X-MSMail-Priority header
111155
```
112156

113157
### Attachment Structure
158+
114159
```python
115160
# Each attachment is a dict with standardized keys
116161
for attachment in mail.attachments:
@@ -121,13 +166,16 @@ for attachment in mail.attachments:
121166
```
122167

123168
### Received Header Parsing
169+
124170
Complex parsing in `receiveds_parsing()` extracts hop-by-hop email routing:
171+
125172
```python
126173
mail.received # List of parsed received headers with structured data
127174
# Each hop contains: by, from, date, delay, envelope_from, etc.
128175
```
129176

130177
### Error Handling Hierarchy
178+
131179
```python
132180
MailParserError # Base exception
133181
├── MailParserOutlookError # Outlook .msg issues
@@ -137,29 +185,34 @@ MailParserError # Base exception
137185
```
138186

139187
## Testing Approach
188+
140189
- Test emails in `tests/mails/` (malformed, Outlook, various encodings)
141190
- Comprehensive property testing for all email components
142191
- CLI integration tests in CI pipeline
143192
- Coverage reporting with pytest-cov
144193

145194
## Security Focus
195+
146196
- **Defect detection**: Identifies malformed boundaries that could hide malicious content
147197
- **IP extraction**: `get_server_ipaddress()` with trust levels for forensic analysis
148198
- **Epilogue analysis**: Detects hidden content in malformed MIME boundaries
149199
- **Fingerprinting**: Mail and attachment hashing for threat intelligence
150200

151201
## Build System Specifics
202+
152203
- **pyproject.toml**: Modern Python packaging with hatch backend
153204
- **uv**: Used instead of pip for faster, reliable dependency resolution
154205
- **src/ layout**: Package in `src/mailparser/` for cleaner imports
155206
- **Dynamic versioning**: Version from `src/mailparser/version.py`
156207

157208
## External Dependencies
209+
158210
- **Outlook support**: Requires system package `libemail-outlook-message-perl` + Perl module `Email::Outlook::Message`
159211
- **six**: Python 2/3 compatibility (legacy requirement)
160212
- **Minimal runtime deps**: Only `six>=1.17.0` required
161213

162214
When working with this codebase:
215+
163216
- Use factory functions, not direct MailParser() instantiation
164217
- Test with various malformed emails from `tests/mails/`
165218
- Remember header property naming (underscores for hyphens)

.github/instructions/markdown.instructions.md

Lines changed: 22 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -7,28 +7,39 @@ applyTo: '**/*.md'
77

88
The following markdown content rules are enforced in the validators:
99

10-
1. **Headings**: Use appropriate heading levels (H2, H3, etc.) to structure your content. Do not use an H1 heading, as this will be generated based on the title.
10+
1. **Headings**: Use appropriate heading levels (H2, H3, etc.) to structure your content. Do not
11+
use an H1 heading, as this will be generated based on the title.
1112
2. **Lists**: Use bullet points or numbered lists for lists. Ensure proper indentation and spacing.
12-
3. **Code Blocks**: Use fenced code blocks for code snippets. Specify the language for syntax highlighting.
13+
3. **Code Blocks**: Use fenced code blocks for code snippets. Specify the language for syntax
14+
highlighting.
1315
4. **Links**: Use proper markdown syntax for links. Ensure that links are valid and accessible.
1416
5. **Images**: Use proper markdown syntax for images. Include alt text for accessibility.
1517
6. **Tables**: Use markdown tables for tabular data. Ensure proper formatting and alignment.
1618
7. **Line Length**: Limit line length to 400 characters for readability.
1719
8. **Whitespace**: Use appropriate whitespace to separate sections and improve readability.
18-
9. **Front Matter**: Include YAML front matter at the beginning of the file with required metadata fields.
20+
9. **Front Matter**: Include YAML front matter at the beginning of the file with required metadata
21+
fields.
1922

2023
## Formatting and Structure
2124

2225
Follow these guidelines for formatting and structuring your markdown content:
2326

24-
- **Headings**: Use `##` for H2 and `###` for H3. Ensure that headings are used in a hierarchical manner. Recommend restructuring if content includes H4, and more strongly recommend for H5.
25-
- **Lists**: Use `-` for bullet points and `1.` for numbered lists. Indent nested lists with two spaces.
26-
- **Code Blocks**: Use triple backticks (`) to create fenced code blocks. Specify the language after the opening backticks for syntax highlighting (e.g., `csharp).
27-
- **Links**: Use `[link text](URL)` for links. Ensure that the link text is descriptive and the URL is valid.
28-
- **Images**: Use `![alt text](image URL)` for images. Include a brief description of the image in the alt text.
29-
- **Tables**: Use `|` to create tables. Ensure that columns are properly aligned and headers are included.
30-
- **Line Length**: Break lines at 80 characters to improve readability. Use soft line breaks for long paragraphs.
31-
- **Whitespace**: Use blank lines to separate sections and improve readability. Avoid excessive whitespace.
27+
- **Headings**: Use `##` for H2 and `###` for H3. Ensure that headings are used in a hierarchical
28+
manner. Recommend restructuring if content includes H4, and more strongly recommend for H5.
29+
- **Lists**: Use `-` for bullet points and `1.` for numbered lists. Indent nested lists with two
30+
spaces.
31+
- **Code Blocks**: Use triple backticks (`) to create fenced code blocks. Specify the language
32+
after the opening backticks for syntax highlighting (e.g.,`csharp).
33+
- **Links**: Use `[link text](URL)` for links. Ensure that the link text is descriptive and the
34+
URL is valid.
35+
- **Images**: Use `![alt text](image URL)` for images. Include a brief description of the image in
36+
the alt text.
37+
- **Tables**: Use `|` to create tables. Ensure that columns are properly aligned and headers are
38+
included.
39+
- **Line Length**: Break lines at 80 characters to improve readability. Use soft line breaks for
40+
long paragraphs.
41+
- **Whitespace**: Use blank lines to separate sections and improve readability. Avoid excessive
42+
whitespace.
3243

3344
## Validation Requirements
3445

.markdownlint.yaml

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# Markdownlint configuration
2+
# See https://github.com/DavidAnson/markdownlint/blob/main/doc/Rules.md
3+
4+
# MD013/line-length - Line length
5+
MD013:
6+
# Disable line length check for code blocks and tables
7+
line_length: 120
8+
code_blocks: false
9+
tables: false
10+
11+
# MD033/no-inline-html - Inline HTML
12+
MD033:
13+
# Allow specific HTML elements commonly used in GitHub markdown
14+
allowed_elements:
15+
- a
16+
- img
17+
- br
18+
19+
# MD041/first-line-heading - First line in file should be a top level heading
20+
MD041: false # Allow files to start with badges or other content

.pre-commit-config.yaml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,3 +27,9 @@ repos:
2727
args: [ --fix ]
2828
# Run the formatter.
2929
- id: ruff-format
30+
31+
- repo: https://github.com/igorshubovych/markdownlint-cli
32+
rev: v0.42.0
33+
hooks:
34+
- id: markdownlint
35+
args: ['--fix']

0 commit comments

Comments
 (0)