Skip to content

Commit c8ea55e

Browse files
authored
Update/annual patches (#115)
* Update README.md with a better description of the Python capabilities * Update pyproject.toml and Dependencies * Revert last pyproject.toml command name * Add non-privileged user to Docker * Fix failing test cases * Add supply chain attestations to GitHub Action * Update for failing Docker builds
1 parent 30df0bd commit c8ea55e

File tree

10 files changed

+410
-57
lines changed

10 files changed

+410
-57
lines changed

.github/workflows/docker-publish.yml

Lines changed: 45 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,12 @@ name: Manual Docker Build and Push
33
on:
44
workflow_dispatch: # Allows manual triggering
55

6+
# Add permissions for pushing packages and OIDC token
7+
permissions:
8+
contents: read
9+
packages: write # Needed to push container images
10+
id-token: write # Needed for signing/attestations
11+
612
jobs:
713
build-and-push:
814
runs-on: ubuntu-latest
@@ -16,12 +22,50 @@ jobs:
1622
username: ${{ secrets.DOCKERHUB_USERNAME }}
1723
password: ${{ secrets.DOCKERHUB_TOKEN }}
1824

25+
# Install the cosign tool
26+
# https://github.com/sigstore/cosign-installer
27+
- name: Install cosign
28+
uses: sigstore/cosign-installer@v3.5.0
29+
with:
30+
cosign-release: 'v2.2.4'
31+
32+
# Setup Docker buildx
33+
# https://github.com/docker/build-push-action/issues/461
34+
- name: Setup Docker buildx
35+
uses: docker/setup-buildx-action@v3
36+
37+
# Extract metadata (tags, labels) for Docker
38+
# https://github.com/docker/metadata-action
39+
- name: Extract Docker metadata
40+
id: meta
41+
uses: docker/metadata-action@v5
42+
with:
43+
images: sethblack/python-seo-analyzer # Your Docker Hub image
44+
45+
# Build and push Docker image with attestation
46+
# https://github.com/docker/build-push-action
1947
- name: Build and push Docker image
48+
id: build-and-push # Add id to reference outputs
2049
uses: docker/build-push-action@v5
2150
with:
2251
context: .
2352
push: true
24-
tags: sethblack/python-seo-analyzer:latest # Use your Docker Hub username
53+
tags: ${{ steps.meta.outputs.tags }} # Use tags from metadata
54+
labels: ${{ steps.meta.outputs.labels }} # Use labels from metadata
55+
# Attestations for provenance and SBOM
56+
attests: |
57+
provenance:builder-id=${{ github.workflow }}/${{ github.job_id }}
58+
sbom:scan-mode=local,scan-args=--exclude=./tests
59+
60+
# Sign the resulting Docker image digest.
61+
# https://github.com/sigstore/cosign
62+
- name: Sign the published Docker image
63+
env:
64+
# https://docs.github.com/en/actions/security-guides/security-hardening-for-github-actions#using-an-intermediate-repository-for-the-build
65+
COSIGN_EXPERIMENTAL: "true"
66+
# This step uses the identity token to provision an ephemeral certificate
67+
# against the sigstore community Fulcio instance.
68+
run: echo "${{ steps.meta.outputs.tags }}" | xargs -I {} cosign sign --yes {}@${{ steps.build-and-push.outputs.digest }}
2569

2670
publish-to-pypi:
2771
needs: build-and-push # Optional: Make this job depend on the Docker build

Dockerfile

Lines changed: 16 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
FROM python:3.12-bullseye
1+
FROM python:3.13.2-bookworm
22

33
RUN apt-get update -y && apt-get upgrade -y
44

@@ -12,6 +12,20 @@ RUN uv cache clean --verbose
1212

1313
COPY . /python-seo-analyzer
1414

15+
# Create a non-root user
16+
RUN groupadd -r appgroup && useradd --no-log-init -r -g appgroup appuser
17+
18+
# Set ownership of the app directory
19+
RUN chown -R appuser:appgroup /python-seo-analyzer
20+
21+
# Switch back to root to install the package system-wide
22+
USER root
1523
RUN python3 -m pip install /python-seo-analyzer
1624

17-
ENTRYPOINT ["/usr/local/bin/seoanalyze"]
25+
# Switch back to the non-root user
26+
USER appuser
27+
28+
WORKDIR /app
29+
30+
ENTRYPOINT ["python-seo-analyzer"]
31+
CMD ["--version"]

README.md

Lines changed: 69 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,18 @@
11
Python SEO and GEO Analyzer
2-
===================
2+
===========================
33

4-
A modern SEO and GEO analysis tool that combines technical optimization and authentic human value. Beyond traditional site crawling and structure analysis, it uses AI to evaluate content's expertise signals, conversational engagement, and cross-platform presence. It helps you maintain strong technical foundations while ensuring your site demonstrates genuine authority and value to real users.
4+
[![PyPI version](https://badge.fury.io/py/pyseoanalyzer.svg)](https://badge.fury.io/py/pyseoanalyzer)
5+
[![Docker Pulls](https://img.shields.io/docker/pulls/sethblack/python-seo-analyzer.svg)](https://hub.docker.com/r/sethblack/python-seo-analyzer)
6+
7+
A modern SEO and GEO (Generative AI Engine Optimization or better AI Search Optimization) analysis tool that combines technical optimization and authentic human value. Beyond traditional site crawling and structure analysis, it uses AI to evaluate content's expertise signals, conversational engagement, and cross-platform presence. It helps you maintain strong technical foundations while ensuring your site demonstrates genuine authority and value to real users.
58

69
The AI features were heavily influenced by the clickbait-titled SEL article [A 13-point roadmap for thriving in the age of AI search](https://searchengineland.com/seo-roadmap-ai-search-449199).
710

11+
Note About Python
12+
-----------------
13+
14+
I've written quite a bit about the speed of Python and how there are very specific use cases where it isn't the best choice. I feel like crawling websites is definitely one of those cases. I wrote this tool in Python around 2010 to solve thea very specific need of crawling some small HTML-only websites for startups I was working at. I'm excited to see how much it has grown and how many people are using it. I feel like Python SEO Analyzer is acceptable for most smaller use cases, but if you are looking for something faster, I've built a much faster and more comprehensive tool [Black SEO Analyzer](https://github.com/sethblack/black-seo-analyzer).
15+
816
Installation
917
------------
1018

@@ -16,10 +24,67 @@ pip install pyseoanalyzer
1624

1725
### Docker
1826

19-
The docker image is available on [Docker Hub](https://hub.docker.com/r/sethblack/python-seo-analyzer) and can be run with the same command-line arguments as the script.
27+
#### Using the Pre-built Image from Docker Hub
28+
29+
The easiest way to use the Docker image is to pull it directly from [Docker Hub](https://hub.docker.com/r/sethblack/python-seo-analyzer).
30+
31+
```bash
32+
# Pull the latest image
33+
docker pull sethblack/python-seo-analyzer:latest
34+
35+
# Run the analyzer (replace example.com with the target URL)
36+
# The --rm flag automatically removes the container when it exits
37+
docker run --rm sethblack/python-seo-analyzer http://example.com/
38+
39+
# Run with specific arguments (e.g., sitemap and HTML output)
40+
# Note: If the sitemap is local, you'll need to mount it (see mounting example below)
41+
docker run --rm sethblack/python-seo-analyzer http://example.com/ --sitemap /path/inside/container/sitemap.xml --output-format html
42+
43+
# Run with AI analysis (requires ANTHROPIC_API_KEY)
44+
# Replace "your_api_key_here" with your actual Anthropic API key
45+
docker run --rm -e ANTHROPIC_API_KEY="your_api_key_here" sethblack/python-seo-analyzer http://example.com/ --run-llm-analysis
46+
47+
# Save HTML output to your local machine
48+
# This mounts the current directory (.) into /app/output inside the container.
49+
# The output file 'results.html' will be saved in your current directory.
50+
# The tool outputs JSON by default to stdout, so we redirect it for HTML.
51+
# Since the ENTRYPOINT handles the command, we redirect the container's stdout.
52+
# We need a shell inside the container to handle the redirection.
53+
docker run --rm -v "$(pwd):/app/output" sethblack/python-seo-analyzer /bin/sh -c "seoanalyze http://example.com/ --output-format html > /app/output/results.html"
54+
# Note for Windows CMD users: Use %cd% instead of $(pwd)
55+
# docker run --rm -v "%cd%:/app/output" sethblack/python-seo-analyzer /bin/sh -c "seoanalyze http://example.com/ --output-format html > /app/output/results.html"
56+
# Note for Windows PowerShell users: Use ${pwd} instead of $(pwd)
57+
# docker run --rm -v "${pwd}:/app/output" sethblack/python-seo-analyzer /bin/sh -c "seoanalyze http://example.com/ --output-format html > /app/output/results.html"
58+
59+
60+
# Mount a local sitemap file
61+
# This mounts 'local-sitemap.xml' from the current directory to '/app/sitemap.xml' inside the container
62+
docker run --rm -v "$(pwd)/local-sitemap.xml:/app/sitemap.xml" sethblack/python-seo-analyzer http://example.com/ --sitemap /app/sitemap.xml
63+
# Adjust paths and Windows commands as needed (see volume mounting example above)
2064

2165
```
22-
docker run sethblack/python-seo-analyzer [ARGS ...]
66+
67+
#### Building the Image Locally
68+
69+
You can also build the Docker image yourself from the source code. Make sure you have Docker installed and running.
70+
71+
```bash
72+
# Clone the repository (if you haven't already)
73+
# git clone https://github.com/sethblack/python-seo-analyzer.git
74+
# cd python-seo-analyzer
75+
76+
# Build the Docker image (tag it as 'my-seo-analyzer' for easy reference)
77+
docker build -t my-seo-analyzer .
78+
79+
# Run the locally built image
80+
docker run --rm my-seo-analyzer http://example.com/
81+
82+
# Run with AI analysis using the locally built image
83+
docker run --rm -e ANTHROPIC_API_KEY="your_api_key_here" my-seo-analyzer http://example.com/ --run-llm-analysis
84+
85+
# Run with HTML output saved locally using the built image
86+
docker run --rm -v "$(pwd):/app/output" my-seo-analyzer /bin/sh -c "seoanalyze http://example.com/ --output-format html > /app/output/results.html"
87+
# Adjust Windows commands as needed (see volume mounting example above)
2388
```
2489

2590
Command-line Usage

pyproject.toml

Lines changed: 13 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -4,18 +4,21 @@ build-backend = "hatchling.build"
44

55
[project]
66
name = "pyseoanalyzer"
7-
version = "2024.12.12"
7+
version = "2025.4.3"
88
authors = [
99
{name = "Seth Black", email = "sblack@sethserver.com"},
1010
]
1111
dependencies = [
12-
"beautifulsoup4>=4.12.3",
13-
"certifi>=2024.8.30",
14-
"Jinja2>=3.1.4",
15-
"lxml>=5.3.0",
16-
"MarkupSafe>=3.0.2",
17-
"trafilatura>=2.0.0",
18-
"urllib3>=2.2.3",
12+
"beautifulsoup4==4.13.3",
13+
"certifi==2025.1.31",
14+
"Jinja2==3.1.6",
15+
"langchain==0.3.22",
16+
"langchain-anthropic==0.3.10",
17+
"lxml==5.3.1",
18+
"MarkupSafe==3.0.2",
19+
"python-dotenv==1.1.0",
20+
"trafilatura==2.0.0",
21+
"urllib3==2.3.0",
1922
]
2023
requires-python = ">= 3.8"
2124
description = "An SEO tool that analyzes the structure of a site, crawls the site, count words in the body of the site and warns of any technical SEO issues."
@@ -47,9 +50,9 @@ classifiers = [
4750
]
4851

4952
[project.scripts]
50-
seoanalyze = "pyseoanalyzer.__main__:main"
53+
python-seo-analyzer = "pyseoanalyzer.__main__:main"
5154

5255
[project.urls]
5356
Homepage = "https://github.com/sethblack/python-seo-analyzer"
5457
Repository = "https://github.com/sethblack/python-seo-analyzer.git"
55-
Issues = "https://github.com/sethblack/python-seo-analyzer/issues"
58+
Issues = "https://github.com/sethblack/python-seo-analyzer/issues"

pyseoanalyzer/__init__.py

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,25 @@
11
#!/usr/bin/env python3
22

3+
import sys
4+
5+
# Use importlib.metadata (available in Python 3.8+) to get the version
6+
# defined in pyproject.toml. This avoids duplicating the version string.
7+
if sys.version_info >= (3, 8):
8+
from importlib import metadata
9+
else:
10+
# Fallback for Python < 3.8 (requires importlib-metadata backport)
11+
# Consider adding 'importlib-metadata; python_version < "3.8"' to dependencies
12+
# if you need to support older Python versions.
13+
import importlib_metadata as metadata
14+
15+
try:
16+
# __package__ refers to the package name ('pyseoanalyzer')
17+
__version__ = metadata.version(__package__)
18+
except metadata.PackageNotFoundError:
19+
# Fallback if the package is not installed (e.g., when running from source)
20+
# You might want to handle this differently, e.g., raise an error
21+
# or read from a VERSION file. For now, setting it to unknown.
22+
__version__ = "0.0.0-unknown"
23+
24+
325
from .analyzer import analyze

pyseoanalyzer/__main__.py

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,15 +4,20 @@
44
import inspect
55
import json
66
import os
7+
import sys
78

89
from .analyzer import analyze
10+
from . import __version__
911

1012

1113
def main():
1214
module_path = os.path.dirname(inspect.getfile(analyze))
13-
14-
arg_parser = argparse.ArgumentParser()
15-
15+
arg_parser = argparse.ArgumentParser(
16+
description="Analyze SEO aspects of a website."
17+
)
18+
arg_parser.add_argument(
19+
"--version", action="version", version=f"%(prog)s {__version__}"
20+
)
1621
arg_parser.add_argument("site", help="URL of the site you are wanting to analyze.")
1722
arg_parser.add_argument(
1823
"-s", "--sitemap", help="URL of the sitemap to seed the crawler with."

pyseoanalyzer/page.py

Lines changed: 19 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -81,10 +81,10 @@ def __init__(
8181
self.analyze_extra_tags = analyze_extra_tags
8282
self.encoding = encoding
8383
self.run_llm_analysis = run_llm_analysis
84-
self.title: str
85-
self.author: str
86-
self.description: str
87-
self.hostname: str
84+
self.title: str = ""
85+
self.author: str = ""
86+
self.description: str = ""
87+
self.hostname: str = ""
8888
self.sitename: str
8989
self.date: str
9090
self.keywords = {}
@@ -224,15 +224,21 @@ def analyze(self, raw_html=None):
224224
)
225225

226226
# I want to grab values from this even if they don't exist
227-
metadata = metadata.as_dict() if metadata else {}
228-
229-
self.title = metadata.get("title", "")
230-
self.author = metadata.get("author", "")
231-
self.description = metadata.get("description", "")
232-
self.hostname = metadata.get("hostname", "")
233-
self.sitename = metadata.get("sitename", "")
234-
self.date = metadata.get("date", "")
235-
metadata_keywords = metadata.get("keywords", "")
227+
metadata_dict = metadata.as_dict() if metadata else {}
228+
229+
# Helper function to get value or default to "" if None or 'None'
230+
def get_meta_value(key):
231+
value = metadata_dict.get(key)
232+
return "" if value is None or value == "None" else value
233+
234+
# Ensure fields are strings, defaulting to "" if None or 'None'
235+
self.title = get_meta_value("title")
236+
self.author = get_meta_value("author")
237+
self.description = get_meta_value("description")
238+
self.hostname = get_meta_value("hostname")
239+
self.sitename = get_meta_value("sitename")
240+
self.date = get_meta_value("date")
241+
metadata_keywords = get_meta_value("keywords")
236242

237243
if len(metadata_keywords) > 0:
238244
self.warn(

pyseoanalyzer/website.py

Lines changed: 14 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -78,18 +78,24 @@ def crawl(self):
7878
if page.parsed_url.netloc != page.base_domain.netloc:
7979
continue
8080

81-
page.analyze()
81+
# Analyze the page and check if successful
82+
analysis_successful = page.analyze()
8283

83-
self.content_hashes[page.content_hash].add(page.url)
84-
self.wordcount.update(page.wordcount)
85-
self.bigrams.update(page.bigrams)
86-
self.trigrams.update(page.trigrams)
84+
# Only process and add the page if analysis completed
85+
if analysis_successful:
86+
self.content_hashes[page.content_hash].add(page.url)
87+
self.wordcount.update(page.wordcount)
88+
self.bigrams.update(page.bigrams)
89+
self.trigrams.update(page.trigrams)
8790

88-
self.page_queue.extend(page.links)
91+
# Only add links if following is enabled and analysis was successful
92+
if self.follow_links:
93+
self.page_queue.extend(page.links)
8994

90-
self.crawled_pages.append(page)
91-
self.crawled_urls.add(page.url)
95+
self.crawled_pages.append(page)
96+
self.crawled_urls.add(page.url)
9297

98+
# Stop after the first page if not following links, regardless of analysis success
9399
if not self.follow_links:
94100
break
95101
except Exception as e:

requirements.txt

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,12 @@
1-
beautifulsoup4==4.12.3
2-
certifi==2024.8.30
1+
beautifulsoup4==4.13.3
2+
certifi==2025.1.31
33
Jinja2==3.1.6
4-
langchain==0.3.11
5-
langchain-anthropic==0.3.0
6-
lxml==5.3.0
4+
langchain==0.3.22
5+
langchain-anthropic==0.3.10
6+
lxml==5.3.1
77
MarkupSafe==3.0.2
8-
python-dotenv==1.0.1
8+
pytest==8.3.2 # Added for testing
9+
pytest-mock==3.14.0 # Added for testing
10+
python-dotenv==1.1.0
911
trafilatura==2.0.0
10-
urllib3==2.2.3
12+
urllib3==2.3.0

0 commit comments

Comments
 (0)