GalleyCat FishbowlNY FishbowlDC UnBeige MediaJobsDaily SocialTimes AllFacebook AllTwitter LostRemote TVNewser TVSpy AgencySpy PRNewser

pdf

Rip and read: 6 OCR tools put to the test

Journalists have been handed wonky PDF files or had to scan mountains of paper documents for years, but until relatively recently there hasn’t been an easy way to translate those docs into digital text. Several tools for converting PDF files into text using optical character recognition or OCR for short have popped up recently, but which one works best?

To see which OCR tools did the job and which ones fell flat, a one-page online document was printed and scanned on an HP DeskJet F4280 printer at 200 DPI. The results are below and you can view the original document here.

 

SimpleOCR

Downloadable software available for PC

Accuracy: The software gets the majority of the text right, but portions of the document are translated into indecipherable characters, especially the italic text.

View results of OCR with SimpleOCR

 

DocumentCloud

Private document storehouse and analysis tool for newsrooms and journalists

Accuracy: Pretty close with a few errors here and there.

View results of OCR with DocumentCloud

 

SayWhat Translator

OCR app available from iTunes for $9.99

Accuracy: Total fail. Couldn’t recognize a single word. The results aren’t much better with larger or less text.

 

 

Google Docs

Free document creation, sharing, and storage system with OCR feature

Accuracy: Close to perfect with a few odd characters throughout the text.

View results of OCR with Google Docs

 

OCR Online

Online OCR and conversion tool; several format and language options; free with restrictions

Accuracy: Near perfect with a few missing punctuation marks. Great results for a free tool.

View results of OCR with OCR Online

 

Adobe Acrobat X Pro

PDF management software with OCR capabilities; $499

Accuracy: Results are near perfect and comparable to OCR Online. Which means unless you already have the program or are willing to pay 500 bucks, OCR Online is a more attractive choice.

View results of OCR with Acrobat X Pro

Mediabistro Course

Children's Book Writing and Illustrating

Children's Book Writing and IllustratingStarting October 22, work with a published children's author to complete a picture book ready to send to publishers! Jacquie Hann will help you to focus your ideas and build your story, create an illustration portfolio ready to present to art directors, and successfully navigate the process of publishing a children's book. Register now!

5 Creative uses of DocumentCloud

Since DocumentCloud burst on to the scene in 2009, newsrooms have used the tool to publish, analyze, and annotate documents all of sorts including police reports, government documents, and court records. DocumentCloud can, however, be used for more than just your run-of-the-mill documents and in a variety of creative ways, as evidenced by the examples below.

 

WNYC: NYC Ballot Design

How do you show the public that a ballot for an upcoming election is flawed and hard to use? WNYC found its answer in DocumentCloud and used the site’s annotation tool to highlight flaws in the design of a NYC ballot and how they could be corrected.

 

ProPublica: The Magnetar Trade

After a hedge named Magnetar declined to comment on detailed questions posed to it by ProPublica, the site decided to turn the tables and publish the actual letter sent to them by the company. In the investigative story, ProPublica linked directly to the paragraphs in the letter where Magnetar refused to comment. Here’s an example.

 

Washington Post: U.S. Constitution in the news

For a project for The Post created by yours truly, we used one of the oldest documents in American history — the U.S. Constitution — and denoted with sections are currently under political debate or referenced in recent news.

 

The Las Vegan Sun: Do No Harm

Posting a bunch of online documents can be useful, but daunting for the reader. For a five-part story on preventable injuries in local hospitals, the Sun uploaded the documents to DocumentCloud and linked to them in each article. The docs are also cleverly organized on a standard webpage with easy to identify thumbnails and in a unique interactive visualization.

 
PBS NewsHour: Mark Twain

NewsHour used DocumentCloud to publish a century-old manuscript of an unpublished Mark Twain essay that had been sitting in archives. Those who find Twain’s handwriting to be hard to read can switch between the original document and an easier to read text view.

Many thanks to the DocumentCloud team for their assistance in crafting this post.

Should PDFs be a necessary part of a news site?

The advantage of reading news on the net is that anyone can read exactly what they want without leafing through huge newspaper pages or sitting through long broadcasts. A PDF version of a newspaper is a great way to bridge the gap between the print and online product and gives readers the news they want in a compact format.

A PDF or Portable Document Format is a file format developed by Adobe that packs text, graphics and fonts into a single file. A PDF document is somewhat similar to HTML and may contain hyperlinks and multimedia elements and can be downloaded and printed or saved for later reading.


Metro Newspapers offers PDFs of its Silicon Valley, Santa Cruz, and North Bay California papers. The UK-based Telegraph has archived PDF versions of its afternoon paper Telegraph pm as does the Santa Monica Daily Press. Find more magazine PDFs here.

So how do you do it? A talented copy editor can layout selected stories and images and, depending on the program being used, export the file to Adobe PDF. The file can be uploaded to your site and made available as a link.

If you don’t have a copy editor to spare, xFruits offers a unique tool that converts your RSS feeds into a handy, though not as visually appealing, interactive document in a few minutes. Imagine offering users a sports section dedicated exclusively to their favorite team, based on an existing RSS feed. In order to work properly, the RSS feed must include content, not just a link to the article. Check out the xFruits-created 10,000 words PDF here.