donderdag 30 september 2010

Captcha basics : pre-processing

So before writing my updates about the captchas I wanted to explain that there a few phases that occure every time again. (wich are all solveable with enough resources)
This posts will describe the pre-processing basics.
The three phases:
  1. Pre-processing
  2. Segmentation
  3. Characterizing

To make segmenting the characters harder they usually insert random pixels (noise) into the images.
Noisy pixels

As seen in the image above, there are a lot of noisy pixels added and some of the numbers are faded out a little.

When looked at closely the noisy pixels look like this:
A:255  R:189  G:21  B:86

Where as the normal pixels look like this:
A:255  R:15 G:12 B:31

Conclusion: The black pixels that we want are ALWAYS round the same value as the other values.
Note: the A value or alpha layer sets the transparency wich is not used here.

Solution: Remove all pixels where the difference is higher than 30. Some pixels will remain but are not part of a number. So we also remove pixels if they aren't in a group bigger than 5 pixels.

Captcha cleaned

After these steps you are able to perfectly segment all the characters.

Starting the blog

Well basically I plan on updating this blog with ocr solutions for captchas.
Since captchas are a pain in the ass, I'm going to post progress and results on OCR'ing captchas.
I will probably host them as a free service and not as programs.
Anyway I hope this blog will help people and if you have certain requests to ocr captchas you can always send me a message.