Posted by: Diabolic Preacher | July 18, 2007

use captcha’s to improve correct book scans

ek teer se do shikaar!

reCAPTCHA is a very interesting project leveraging the efforts put in by millions of people to type the text displayed in distorted form in image files in what are popularly known as captcha images.

so how does this exactly work? first of all the system is developed by the same people who developed the original captcha system. This effort will only be as successful as the number of blogs/news sites which implement recaptcha as the CAPTCHA system for verifying against bots in their feedback or comment systems. What the system aims to do is to correct the books which were published before the computer age (kinda like those which are in the public domain) and are being scanned by The Internet Archive.

Now, the effort isn’t just to make free pdf exports of unreadable smudgy textbooks. They are intended to be digitized to be stored in searchable online libraries like the internet archive itself. Although the OCR (optical character recognition) technology has improved a lot, it still doesn’t give a high accuracy result. So this is where human intervention comes to the rescue…but to read whole books and digitize them manually word by word!! that’s just freakin too much, ain’t it? reCAPTCHA aims to harness the collective efforts of all commenters on all blogs implementing the reCAPTCHA system by asking them to enter one of the words that the OCR has failed to identify along with asking them to verify another word that the OCR had identified correctly, sort of like checkin if you are human n then trustin your answer based on experience.

some doubts u have there, eh? well every new effort can and should be doubted to find scope for further improvements but not doubted just coz we are resistant to change.

My blog friends who use wordpress and are able to install plugins are requested to make use of this reCAPTCHA system and pass on the word so that more and more blogs and hence morer and morer (hehe) users can contribute a wee bit of their time and observation powers towards the digitizing of old text resources.

The project has plugins for wordpress and mediawiki (the software that powers wikipedia as well as code snippets for PHP to add to your custom designed sites.

p.s. the recaptcha system has the audio option as well to help vision-impaired visitors to your blog.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Categories

%d bloggers like this: