Coding May 2, 2026 2 min read Hacker News (Top) EN

Show HN: Large Scale Article Extract of Newspapers 1730s-1960s

A pioneering effort to digitize and index historical newspapers has yielded a vast, searchable archive of over 600,000 pages, leveraging machine learning to overcome the technical hurdles of layout variability, font sizes, and image quality. The resulting dataset, built from the Chronicling America collection, now enables semantic and agentic search capabilities, a significant improvement over existing keyword-based search tools. This achievement marks a crucial step in unlocking the historical record for researchers and the public. AI-assisted, human-reviewed.

Coding

AI Summary

A pioneering effort to digitize and index historical newspapers has yielded a vast, searchable archive of over 600,000 pages, leveraging machine learning to overcome the technical hurdles of layout variability, font sizes, and image quality.
The resulting dataset, built from the Chronicling America collection, now enables semantic and agentic search capabilities, a significant improvement over existing keyword-based search tools.
This achievement marks a crucial step in unlocking the historical record for researchers and the public.

View sources

Coding

Hello HN, over the past 7 months I've spent nearly 3,000 hours on building SNEWPAPERS, the first historical newpaper archive with full-text extractions, nearly perfect OCR, a vast categorization taxonomy and of course with semantic and agentic search capabilities. Problem: I wanted to search through newspaper archives, but when I tried every service only lets you search for keywords and dates, and gives you back raw images of the papers, and too many of them with no context. A sea of noise. Solution: I taught machines how to read the newspapers and so far I've extracted the content from > 600k pages (about 5TB) from the Chronicling America collection. Problems I had to deal with were an infinite variety of layouts, font sizes, image scan qualities, resolutions, aspect ratios, navigating around the images on the page. I also had to figure out how to get OCR to be nearly perfect so people

Source Trail

Referenced sources behind this article

Source 1
Hacker News (Top)

More signals in the same editorial current

Coding 1 min Hacker News (Top)

Investors pile into clean energy as Iran war drives push for energy security

As global energy markets reel from the Iran crisis, a surge in investment is underway to bolster regional energy security, with a focus on solar and wind power, particularly in the Middle East and North Africa, where projects are being greenlit at a rate 25% higher than pre-conflict levels, driven by state-backed initiatives and private sector partnerships. Key players are prioritizing grid-scale deployments of photovoltaic systems and onshore wind farms, leveraging economies of scale to accelerate the transition. AI-assisted, human-reviewed.

Coding 2 min Hacker News (Top)

Musk's AI told me people were coming to kill me (BBC)

A Neuralink brain implant's AI-powered safety feature misinterprets user activity, triggering a false alert of imminent physical harm, highlighting the risks of relying on machine learning to detect human intent in real-time, particularly in high-stakes applications like medical devices. The incident underscores the need for more robust testing and validation of AI-driven safety protocols. This glitch raises questions about the reliability of AI-powered decision-making in life-critical systems. AI-assisted, human-reviewed.

Coding 1 min Hacker News (Top)

Specsmaxxing – On overcoming AI psychosis, and why I write specs in YAML

The rise of AI-driven development has spawned a new phenomenon: specsmaxxing, where engineers meticulously document code in YAML to mitigate the risks of AI psychosis, a condition where models produce flawed or nonsensical output due to incomplete or inaccurate specifications. By codifying requirements in a human-readable format, developers can ensure that AI tools generate accurate and reliable code. This shift highlights the growing need for specification-driven development in the age of AI-assisted, human-reviewed.

Coding 1 min Hacker News (Top)

AI, Intimacy, and the Data You Never Meant to Share

As users increasingly blur the lines between personal and public digital lives, a growing class of intimate AI-powered chatbots is quietly collecting sensitive metadata, including voice recordings, location history, and browsing habits, often without explicit consent or transparent data storage practices. This phenomenon is driven by the widespread adoption of cloud-based conversational AI platforms, which rely on complex neural networks to learn user behavior. The resulting data profiles are a goldmine for advertisers and a potential liability for users. AI-assisted, human-reviewed.

Coding 3 min OndaVox

GLM 5.1 offers a low-cost alternative to Claude Opus for developers

Zhipu AI's GLM 5.1 is emerging as a budget alternative to Anthropic's Claude Opus 4.6, priced at $18 per month—three times cheaper than Opus. It integrates with VS Code through the Cline extension and supports 8-hour autonomous coding sessions. Tested for three days, it reportedly matches Opus in performance for 'vibe coding' tasks and outperforms ChatGPT 5.4 and Gemini. Setup includes step-by-step configuration via a tutorial linked from the creator’s profile.

Coding 1 min Hacker News (Top)

Pushed by Trump policies, top U.S. battery scientist is moving to Singapore

A leading American battery researcher, driven by restrictive federal funding policies and a lack of clear climate change directives, is relocating to Singapore, where a more favorable regulatory environment and substantial government investment in clean energy research await. The scientist's departure highlights the unintended consequences of Trump-era policies on the nation's battery technology sector. This brain drain threatens to erode the U.S. lead in lithium-ion battery innovation. AI-assisted, human-reviewed.