驗證碼識別工具

旭龍 2013-04-19

展開全文

序

首先說下我要用到的工具：tesseract/ImageMagick/…etc.

tesseract是什么？

tesseract谷歌(原HP)開源的OCR（Optical Character Recognition，光學字符識別）識別引擎，引用google code tesseract-ocr的話——可能是開源界最精確的識別引擎:

Tesseract is probably the most accurate open source OCR engine available. Combined with the Leptonica Image Processing Library it can read a wide variety of image formats and convert them to text in over 60 languages. It was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but since then it has been improved extensively by Google. It is released under the Apache License 2.0.

ImageMagick是什么？

ImageMagick是一個用于查看、編輯位圖文件以及進行圖像格式轉換的開放源代碼軟件套裝
我在這里之所以提到ImageMagick是因為某些圖片格式需要用這個工具來轉換。

Leptonica 是什么？

Leptonica 是一圖像處理與圖像分析工具，tesseract依賴于它。而且不是所有的格式(如jpg)都能處理，所以我們需要借助imagemagick做格式轉換。leptonica格式受限為：

Here's a summary of compression support and limitations:
    - All formats except JPEG support 1 bpp binary.
    - All formats support 8 bpp grayscale (GIF must have a colormap).
    - All formats except GIF support 24 bpp rgb color.
    - All formats except PNM support 8 bpp colormap.
    - PNG and PNM support 2 and 4 bpp images.
    - PNG supports 2 and 4 bpp colormap, and 16 bpp without colormap.
    - PNG, JPEG, TIFF and GIF support image compression; PNM and BMP do not.
    - WEBP supports 24 bpp rgb color.

工具安裝

如果你老老實實的去google codetesseract-ocr下載最新的tar.gz

$tar xzvf tesseract-ocr-3.02.02.tar.gz  -C ~/Downloads/tesseract
$cd ~/Downloads/tesseract-ocr
$less README
$./autogen.sh
$./configure
$make
$make install
$sudo ldconfig

可能，你會在autogen.sh卡殼（環(huán)境沒有配置）。另外，你還有依賴關系要解決。
如果你的發(fā)行版有官方或者第三方維護的二進制包，干嘛自己編譯呢？直接命令行安裝（比如我的archlinux）:

[hilo@hilo ]$ sudo pacman -S tesseract #leptonica、libpng 等依賴會自動解決滴
[hilo@hilo ]$ sudo pacman -S tesseract-data-eng #英文的語言包還是必須要滴
[hilo@hilo ]$ sudo pacman -S imagemagick #如果你還沒有安裝過imagemagick

識別驗證碼

一般應用

比如我有一張a.jpg的圖片：

[hilo@hilo ~]$ convert a.jpg  a.tif #先轉為可識別的a.tif
[hilo@hilo ]$ tesseract a.tif out
[hilo@hilo ]$ cat out.txt #查看識別到的驗證碼

提高圖片質量

識別成功率跟圖片質量關系密切，一般拿到后的驗證碼都得經過灰度化，二值化，去噪，利用imgick就可以很方便的做到．

convert -monochrome foo.png bar.png　#將圖片二值化

這是推薦讀下鬼仔的高級驗證碼識別

我只想識別字符和數(shù)字？

ok, 沒有問題，可以參考faq,結尾僅需要加digits

tesseract imagename outputbase digits

訓練你的tesseract

不得不說，tesseract英文識別率已經很不錯了(現(xiàn)有的tesseract-data-eng）,但是驗證碼識別還是太雞肋了。但是請別忘記，tesseract的智能識別是需要訓練的．

未完

FAQ

這里羅列一下faq上沒有提到的的問題：

empty page!!

嚴格來說，這不是一個bug(tesseract 3.0),出現(xiàn)這個錯誤是因為tesseract搞不清圖像的字符布局，如果你看過tesseract wiki,你就應該知道如何解決：

-psm N
    Set Tesseract to only run a subset of layout analysis and assume a certain form of image. The options for N are:

    0 = Orientation and script detection (OSD) only.
    1 = Automatic page segmentation with OSD.
    2 = Automatic page segmentation, but no OSD, or OCR.
    3 = Fully automatic page segmentation, but no OSD. (Default)
    4 = Assume a single column of text of variable sizes.
    5 = Assume a single uniform block of vertically aligned text.
    6 = Assume a single uniform block of text.
    7 = Treat the image as a single text line.
    8 = Treat the image as a single word.
    9 = Treat the image as a single word in a circle.
    10 = Treat the image as a single character.

對于我們的驗證碼a.tif排列來說，采用-psm 7(single text line)比較合適。

$ tesseract 84.tif out -l eng  -psm 7 ;cat out.txt

本站是提供個人知識管理的網(wǎng)絡存儲空間，所有內容均由用戶發(fā)布，不代表本站觀點。請注意甄別內容中的聯(lián)系方式、誘導購買等信息，謹防詐騙。如發(fā)現(xiàn)有害或侵權內容，請點擊一鍵舉報。

轉藏分享

QQ空間 QQ好友新浪微博微信

獻花（0） +1

來自：旭龍 > 《圖像技術》

舉報/認領

0條評論

發(fā)表

請遵守用戶評論公約

類似文章 更多

旭龍

關注對話

TA的最新館藏

[轉] 在WPF中獲取DataGridTemplateColumn模板定義的內容控件
dynamic web module version對應的tomcat版本
SpringMVC關于json、xml自動轉換的原理研究[附帶源碼分析] – format...
Maven最佳實踐：版本管理
Apache Jakarta Commons 工具集簡介
Maven pom.xml 配置詳解

喜歡該文的人也喜歡更多

熱門閱讀換一換

小男孩‘自慰网亚洲一区二区,亚洲一级在线播放毛片,亚洲中文字幕av每天更新,黄aⅴ永久免费无码,91成人午夜在线精品,色网站免费在线观看,亚洲欧洲wwwww在线观看