The typical problem in a large medium: volume + legacy
Expreso featured the classic fronts of a site with years of content:
- 404 and others 4XX by old URLs, invented by bots or generated by incorrect patterns.
- Broken images (missing assets) that triggered “Broken images” and “Page has broken image”.
- Mixed content (HTTPS pages with HTTP resources).
- Duplicated goal (multiple meta description) by overlapping components.
- Intermittent sitemaps (occasional 503 per load/proxy/cache).
- External links with redirection (http → https / shorteners).
In a medium, these details multiply quickly: if there is no maintenance system, the site can continue to publish non-stop... but every week the “noise” grows and the reports become more difficult to control.
Strategy: attack in layers (server → CDN → WordPress).
The logic was simple: instead of chasing bugs one by one, we built a cleaning and prevention system.
1) NGINX: global hygiene to reduce noise and protect resources
The first layer was to filter what comes to the site:
- Typical scanner requests (.env, wp-config.php, admin.php, shells, fake paths).
- Repetitive routes that bloat logs and audits (e.g. mraid.js variants, pagespeed, etc.).
- Hygienic“ behavior for 4XX without breaking legitimate endpoints.
It was organized in snippets so as not to risk the vhost and appropriate responses were applied:
- 444 / 410 / 204 depending on the case, avoiding passing junk traffic to PHP.
- Clear exceptions where the site must respond (e.g. valid .well-known or real routes).
Impact: less wasted load, fewer false errors and a more stable basis for tracking.
2) CDN (S3/CloudFront): resolve “Broken images” on a massive scale
Here was the actual volume.
In a medium, many historical images are lost or erased over time. It is unfeasible to recover them, but it is possible to prevent them:
- the user sees broken images,
- crawlers find thousands of 404s,
- the site is eternally “dirtied” in audits.
Practical solution applied in Expreso: placeholders in S3 + invalidation in CloudFront.
Flow:
- Extract from the report (CSV) the URLs of images with 4XX in cdn.expreso.press.
- Upload a minimum placeholder (1px or light image) with content-type correct.
- Invalidate CloudFront to clear 404 caches.
Key learning: AWS CLI's naive approach (head-object per URL) can become very slow or hang on large lists. For Expreso, we migrated to a “FAST” method with boto3 (SDK) + parallelism + timeouts, with visible progress in logs.
Important detail:
- Some reports come with paths already with wp-content/uploads/..., so the system should avoid duplicating keys such as wp-content/uploads/wp-content/uploads/.....
3) Mixed content: cut the “HTTP inside HTTPS”.”
Mixed content doesn't always break the page, but it does leave:
- security alerts
- negative reports
- technical inconsistencies
Work was done to standardize:
- resources and embeds in HTTPS
- correction of legacy patterns (http://...) when applying
- operating rule: nothing “unsafe” in bonded resources
4) External links with redirection (External 3XX)
In audits, it is common to find:
- external links on http:// redirecting to https://
- shorteners (bit.ly, tinyl...) that add jumps
- domains that changed canonical or redirect to strange paths
In Expreso the work was separated into:
- Auto: same host (secure normalization http→https / www→no-www).
- Review: shorteners, weird strings, redirects to login/paywall, etc.
This allows to fix in batch what is safe and leave the rest for risk-free revision.
Result: moving from “firefighting” to operating with control
Once the layers (server + CDN + normalization) have been applied, the site remains:
- cleaner for crawlers,
- with fewer visible breaks,
- and with a clear maintenance routine so that it does not accumulate again.
Future operation (what keeps Expreso healthy)
Simple technical rules for capturers (without getting into the content)
- Always paste links in https
- Avoid shorteners: paste the Final URL
- Images only from Media Library (no hotlink)
- Embeds/iframes always in https
Weekly routine for admins (20-30 minutes)
- Run audit (Ahrefs or similar) and attend in order:
- Broken images
- Links to broken pages
- Mixed content
- Multiple meta description
- Timeouts / 5XX
- Run batch fixes (scripts + invalidation) when applicable.
- Validate sitemaps (if there is intermittency, it is taken care of by performance/proxy/cache, not by “content”).
Conclusion
In an environment such as Express, technical SEO is not solved with an isolated adjustment: it is solved with system + routine.
When noise is controlled, delivery is stabilized and broken assets are corrected at scale, the site becomes more trackable, more reliable and much easier to maintain week to week.