Tips en Trucs 2024

Surya - meertalige OCR voor documenten

Surya wordt aangeprezen als een meertalige OCR-toolkit met tekstherkenning voor documenten. Het is een terminal-gebaseerd programma dat kan worden gebruikt met een CPU of GPU.

Er is ook een Streamlit app, software die datascripts omzet in deelbare webapps.

Surya installeren

Om Surya te draaien heb je Python 3.9 of hoger nodig en PyTorch, die laatste biedt bibliotheken voor tensormanipulatie op CPU's of GPU's, een ingebouwde neurale netwerkbibliotheek, hulpprogramma's voor modeltraining en een multiprocessing bibliotheek die kan werken met gedeeld geheugen.

We installeren en gebruiken in deze tip de CPU versie. De GPU versie moet normaal gezien sneller werken, maar werd niet getest. Daarenboven installeren we Surya in een geïsoleerde Python omgeving, waarvoor we ondersteunende software installeren met de volgende opdracht:

dany@pindabook:~$ apt install python3-venv -y
[sudo] wachtwoord voor root: 
Pakketlijsten worden ingelezen... Klaar
Boom van vereisten wordt opgebouwd... Klaar
De statusinformatie wordt gelezen... Klaar 
Het volgende pakket is automatisch geïnstalleerd en is niet langer nodig:
  linux-image-6.1.0-21-amd64
Gebruik 'sudo apt autoremove' om het te verwijderen.
De volgende extra pakketten zullen geïnstalleerd worden:
  python3-distutils python3-lib2to3 python3-pip-whl python3-setuptools-whl python3.11-venv
De volgende NIEUWE pakketten zullen geïnstalleerd worden:
  python3-distutils python3-lib2to3 python3-pip-whl python3-setuptools-whl python3-venv python3.11-venv
0 opgewaardeerd, 6 nieuw geïnstalleerd, 0 te verwijderen en 0 niet opgewaardeerd.
Er moeten 3.043 kB aan archieven opgehaald worden.
Na deze bewerking zal er 4.197 kB extra schijfruimte gebruikt worden.
Ophalen:1 http://deb.debian.org/debian bookworm/main amd64 python3-lib2to3 all 3.11.2-3 [76,3 kB]
Ophalen:2 http://deb.debian.org/debian bookworm/main amd64 python3-distutils all 3.11.2-3 [131 kB]
Ophalen:3 http://deb.debian.org/debian bookworm/main amd64 python3-pip-whl all 23.0.1+dfsg-1 [1.717 kB]
Ophalen:4 http://deb.debian.org/debian bookworm/main amd64 python3-setuptools-whl all 66.1.1-1 [1.111 kB]
Ophalen:5 http://deb.debian.org/debian bookworm/main amd64 python3.11-venv amd64 3.11.2-6+deb12u2 [5.896 B]
Ophalen:6 http://deb.debian.org/debian bookworm/main amd64 python3-venv amd64 3.11.2-1+b1 [1.200 B]
3.043 kB opgehaald in 0s (10,6 MB/s)     
Voorheen niet geselecteerd pakket python3-lib2to3 wordt geselecteerd.
(Database wordt ingelezen ... 178767 bestanden en mappen momenteel geïnstalleerd.)
Uitpakken van .../0-python3-lib2to3_3.11.2-3_all.deb wordt voorbereid...
Bezig met uitpakken van python3-lib2to3 (3.11.2-3) ...
...
Voorheen niet geselecteerd pakket python3-venv wordt geselecteerd.
Uitpakken van .../5-python3-venv_3.11.2-1+b1_amd64.deb wordt voorbereid...
Bezig met uitpakken van python3-venv (3.11.2-1+b1) ...
Instellen van python3-setuptools-whl (66.1.1-1) ...
Instellen van python3-pip-whl (23.0.1+dfsg-1) ...
Instellen van python3-lib2to3 (3.11.2-3) ...
Instellen van python3-distutils (3.11.2-3) ...
Instellen van python3.11-venv (3.11.2-6+deb12u2) ...
Instellen van python3-venv (3.11.2-1+b1) ...

Daarna maken we een map aan waarin de geïsoleerde Python omgeving terecht komt. En betreden we de pas aangemaakte map:

dany@pindabook:~$ mkdir pytorch_env
dany@pindabook:~$ cd pytorch_env

De geïsoleerde Python omgeving maken we aan en activeren we met de volgende opdrachten:

dany@pindabook:~/pytorch_env$ python3 -m venv pytorch_env
dany@pindabook:~/pytorch_env$ source pytorch_env/bin/activate
(pytorch_env) dany@pindabook:~/pytorch_env$

Aan de speciale prompt, merk je dat je nu in de geïsoleerde Python omgeving werkt. Om PyTorch te installeren met alleen CPU-ondersteuning, voer je uit:

(pytorch_env) dany@pindabook:~/pytorch_env$ pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
Looking in indexes: https://download.pytorch.org/whl/cpu
Collecting torch
 Downloading https://download.pytorch.org/whl/cpu/torch-2.4.0%2Bcpu-cp311-cp311-linux_x86_64.whl (195.1 MB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 195.1/195.1 MB 5.0 MB/s eta 0:00:00
Collecting torchvision
 Downloading https://download.pytorch.org/whl/cpu/torchvision-0.19.0%2Bcpu-cp311-cp311-linux_x86_64.whl (1.6 MB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 10.0 MB/s eta 0:00:00
Collecting torchaudio
 Downloading https://download.pytorch.org/whl/cpu/torchaudio-2.4.0%2Bcpu-cp311-cp311-linux_x86_64.whl (1.7 MB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 10.6 MB/s eta 0:00:00
Collecting filelock
 Downloading https://download.pytorch.org/whl/filelock-3.13.1-py3-none-any.whl (11 kB)
Collecting typing-extensions>=4.8.0
 Downloading https://download.pytorch.org/whl/typing_extensions-4.9.0-py3-none-any.whl (32 kB)
Collecting sympy
 Downloading https://download.pytorch.org/whl/sympy-1.12-py3-none-any.whl (5.7 MB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.7/5.7 MB 10.7 MB/s eta 0:00:00
Collecting networkx
 Downloading https://download.pytorch.org/whl/networkx-3.2.1-py3-none-any.whl (1.6 MB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 9.5 MB/s eta 0:00:00
Collecting jinja2
 Downloading https://download.pytorch.org/whl/Jinja2-3.1.3-py3-none-any.whl (133 kB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 133.2/133.2 kB 4.7 MB/s eta 0:00:00
Collecting fsspec
 Downloading https://download.pytorch.org/whl/fsspec-2024.2.0-py3-none-any.whl (170 kB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 170.9/170.9 kB 5.9 MB/s eta 0:00:00
Collecting numpy
 Downloading https://download.pytorch.org/whl/numpy-1.26.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.3/18.3 MB 10.0 MB/s eta 0:00:00
Collecting pillow!=8.3.*,>=5.3.0
 Downloading https://download.pytorch.org/whl/pillow-10.2.0-cp311-cp311-manylinux_2_28_x86_64.whl (4.5 MB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.5/4.5 MB 10.8 MB/s eta 0:00:00
Collecting MarkupSafe>=2.0
 Downloading https://download.pytorch.org/whl/MarkupSafe-2.1.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (28 kB)
Collecting mpmath>=0.19
 Downloading https://download.pytorch.org/whl/mpmath-1.3.0-py3-none-any.whl (536 kB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 kB 7.4 MB/s eta 0:00:00
Installing collected packages: mpmath, typing-extensions, sympy, pillow, numpy, networkx, MarkupSafe, fsspec, filelock, jinja2, torch, torchvision, torchaudio
Successfully installed MarkupSafe-2.1.5 filelock-3.13.1 fsspec-2024.2.0 jinja2-3.1.3 mpmath-1.3.0 networkx-3.2.1 numpy-1.26.3 pillow-10.2.0 sympy-1.12 torch-2.4.0+cpu torchaudio-2.4.0+cpu torchvision-0.19.0+cpu typing-extensions-4.9.0

En kan je uiteindelijk Suria installeren:

(pytorch_env) dany@pindabook:~/pytorch_env$ pip install surya-ocr
Collecting surya-ocr
 Downloading surya_ocr-0.4.15-py3-none-any.whl (95 kB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 95.2/95.2 kB 2.1 MB/s eta 0:00:00
Collecting filetype<2.0.0,>=1.2.0
 Downloading filetype-1.2.0-py2.py3-none-any.whl (19 kB)
Collecting ftfy<7.0.0,>=6.1.3
 Downloading ftfy-6.2.0-py3-none-any.whl (54 kB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 54.4/54.4 kB 2.8 MB/s eta 0:00:00
Collecting opencv-python<5.0.0.0,>=4.9.0.80
 Downloading opencv_python-4.10.0.84-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (62.5 MB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62.5/62.5 MB 7.9 MB/s eta 0:00:00
Requirement already satisfied: pillow<11.0.0,>=10.2.0 in ./pytorch_env/lib/python3.11/site-packages (from surya-ocr) (10.2.0)
Collecting pydantic<3.0.0,>=2.5.3
 Downloading pydantic-2.8.2-py3-none-any.whl (423 kB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 423.9/423.9 kB 7.6 MB/s eta 0:00:00
Collecting pydantic-settings<3.0.0,>=2.1.0
 Downloading pydantic_settings-2.4.0-py3-none-any.whl (23 kB)
Collecting pypdfium2<5.0.0,>=4.25.0
 Downloading pypdfium2-4.30.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.8 MB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.8/2.8 MB 10.5 MB/s eta 0:00:00
Collecting python-dotenv<2.0.0,>=1.0.0
 Downloading python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Collecting tabulate<0.10.0,>=0.9.0
 Downloading tabulate-0.9.0-py3-none-any.whl (35 kB)
Requirement already satisfied: torch<3.0.0,>=2.3.0 in ./pytorch_env/lib/python3.11/site-packages (from surya-ocr) (2.4.0+cpu)
Collecting transformers<5.0.0,>=4.41.0
 Downloading transformers-4.43.3-py3-none-any.whl (9.4 MB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.4/9.4 MB 10.4 MB/s eta 0:00:00
Collecting wcwidth<0.3.0,>=0.2.12
 Downloading wcwidth-0.2.13-py2.py3-none-any.whl (34 kB)
Requirement already satisfied: numpy>=1.21.2 in ./pytorch_env/lib/python3.11/site-packages (from opencv-python<5.0.0.0,>=4.9.0.80->surya-ocr) (1.26.3)
Collecting annotated-types>=0.4.0
 Downloading annotated_types-0.7.0-py3-none-any.whl (13 kB)
Collecting pydantic-core==2.20.1
 Downloading pydantic_core-2.20.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.1 MB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.1/2.1 MB 10.8 MB/s eta 0:00:00
Requirement already satisfied: typing-extensions>=4.6.1 in ./pytorch_env/lib/python3.11/site-packages (from pydantic<3.0.0,>=2.5.3->surya-ocr) (4.9.0)
Requirement already satisfied: filelock in ./pytorch_env/lib/python3.11/site-packages (from torch<3.0.0,>=2.3.0->surya-ocr) (3.13.1)
Requirement already satisfied: sympy in ./pytorch_env/lib/python3.11/site-packages (from torch<3.0.0,>=2.3.0->surya-ocr) (1.12)
Requirement already satisfied: networkx in ./pytorch_env/lib/python3.11/site-packages (from torch<3.0.0,>=2.3.0->surya-ocr) (3.2.1)
Requirement already satisfied: jinja2 in ./pytorch_env/lib/python3.11/site-packages (from torch<3.0.0,>=2.3.0->surya-ocr) (3.1.3)
Requirement already satisfied: fsspec in ./pytorch_env/lib/python3.11/site-packages (from torch<3.0.0,>=2.3.0->surya-ocr) (2024.2.0)
Collecting huggingface-hub<1.0,>=0.23.2
 Downloading huggingface_hub-0.24.5-py3-none-any.whl (417 kB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 417.5/417.5 kB 10.8 MB/s eta 0:00:00
Collecting packaging>=20.0
 Downloading packaging-24.1-py3-none-any.whl (53 kB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 54.0/54.0 kB 2.8 MB/s eta 0:00:00
Collecting pyyaml>=5.1
 Downloading PyYAML-6.0.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (757 kB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 757.7/757.7 kB 10.5 MB/s eta 0:00:00
Collecting regex!=2019.12.17
 Downloading regex-2024.7.24-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (786 kB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 786.6/786.6 kB 9.9 MB/s eta 0:00:00
Collecting requests
 Downloading requests-2.32.3-py3-none-any.whl (64 kB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 64.9/64.9 kB 2.6 MB/s eta 0:00:00
Collecting safetensors>=0.4.1
 Downloading safetensors-0.4.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 10.4 MB/s eta 0:00:00
Collecting tokenizers<0.20,>=0.19
 Downloading tokenizers-0.19.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.6/3.6 MB 10.5 MB/s eta 0:00:00
Collecting tqdm>=4.27
 Downloading tqdm-4.66.5-py3-none-any.whl (78 kB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.4/78.4 kB 3.7 MB/s eta 0:00:00
Requirement already satisfied: MarkupSafe>=2.0 in ./pytorch_env/lib/python3.11/site-packages (from jinja2->torch<3.0.0,>=2.3.0->surya-ocr) (2.1.5)
Collecting charset-normalizer<4,>=2
 Downloading charset_normalizer-3.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (140 kB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 140.3/140.3 kB 7.1 MB/s eta 0:00:00
Collecting idna<4,>=2.5
 Downloading idna-3.7-py3-none-any.whl (66 kB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 66.8/66.8 kB 3.6 MB/s eta 0:00:00
Collecting urllib3<3,>=1.21.1
 Downloading urllib3-2.2.2-py3-none-any.whl (121 kB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 121.4/121.4 kB 5.7 MB/s eta 0:00:00
Collecting certifi>=2017.4.17
 Downloading certifi-2024.7.4-py3-none-any.whl (162 kB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 163.0/163.0 kB 6.1 MB/s eta 0:00:00
Requirement already satisfied: mpmath>=0.19 in ./pytorch_env/lib/python3.11/site-packages (from sympy->torch<3.0.0,>=2.3.0->surya-ocr) (1.3.0)
Installing collected packages: wcwidth, filetype, urllib3, tqdm, tabulate, safetensors, regex, pyyaml, python-dotenv, pypdfium2, pydantic-core, packaging, opencv-python, idna, ftfy, charset-normalizer, certifi, annotated-types, requests
, pydantic, pydantic-settings, huggingface-hub, tokenizers, transformers, surya-ocr
Successfully installed annotated-types-0.7.0 certifi-2024.7.4 charset-normalizer-3.3.2 filetype-1.2.0 ftfy-6.2.0 huggingface-hub-0.24.5 idna-3.7 opencv-python-4.10.0.84 packaging-24.1 pydantic-2.8.2 pydantic-core-2.20.1 pydantic-setting
s-2.4.0 pypdfium2-4.30.0 python-dotenv-1.0.1 pyyaml-6.0.1 regex-2024.7.24 requests-2.32.3 safetensors-0.4.3 surya-ocr-0.4.15 tabulate-0.9.0 tokenizers-0.19.1 tqdm-4.66.5 transformers-4.43.3 urllib3-2.2.2 wcwidth-0.2.13

Je kunt de installatie uitbreiden met een grafische webapp:

(pytorch_env) dany@pindabook:~/pytorch_env$ pip install streamlit
Collecting streamlit
 Downloading streamlit-1.37.0-py2.py3-none-any.whl (8.7 MB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.7/8.7 MB 10.0 MB/s eta 0:00:00
Collecting altair<6,>=4.0
 Downloading altair-5.3.0-py3-none-any.whl (857 kB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 857.8/857.8 kB 9.1 MB/s eta 0:00:00
Collecting blinker<2,>=1.0.0
 Downloading blinker-1.8.2-py3-none-any.whl (9.5 kB)
Collecting cachetools<6,>=4.0
 Downloading cachetools-5.4.0-py3-none-any.whl (9.5 kB)
Collecting click<9,>=7.0
 Downloading click-8.1.7-py3-none-any.whl (97 kB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 97.9/97.9 kB 5.4 MB/s eta 0:00:00
Requirement already satisfied: numpy<3,>=1.20 in ./pytorch_env/lib/python3.11/site-packages (from streamlit) (1.26.3)
Requirement already satisfied: packaging<25,>=20 in ./pytorch_env/lib/python3.11/site-packages (from streamlit) (24.1)
Collecting pandas<3,>=1.3.0
 Downloading pandas-2.2.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.0 MB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.0/13.0 MB 10.3 MB/s eta 0:00:00
Requirement already satisfied: pillow<11,>=7.1.0 in ./pytorch_env/lib/python3.11/site-packages (from streamlit) (10.2.0)
Collecting protobuf<6,>=3.20
 Downloading protobuf-5.27.3-cp38-abi3-manylinux2014_x86_64.whl (309 kB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 309.3/309.3 kB 8.3 MB/s eta 0:00:00
Collecting pyarrow>=7.0
 Downloading pyarrow-17.0.0-cp311-cp311-manylinux_2_28_x86_64.whl (39.9 MB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39.9/39.9 MB 8.9 MB/s eta 0:00:00
Requirement already satisfied: requests<3,>=2.27 in ./pytorch_env/lib/python3.11/site-packages (from streamlit) (2.32.3)
Collecting rich<14,>=10.14.0
 Downloading rich-13.7.1-py3-none-any.whl (240 kB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 240.7/240.7 kB 7.5 MB/s eta 0:00:00
Collecting tenacity<9,>=8.1.0
 Downloading tenacity-8.5.0-py3-none-any.whl (28 kB)
Collecting toml<2,>=0.10.1
 Downloading toml-0.10.2-py2.py3-none-any.whl (16 kB)
Requirement already satisfied: typing-extensions<5,>=4.3.0 in ./pytorch_env/lib/python3.11/site-packages (from streamlit) (4.9.0)
Collecting gitpython!=3.1.19,<4,>=3.0.7
 Downloading GitPython-3.1.43-py3-none-any.whl (207 kB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 207.3/207.3 kB 7.0 MB/s eta 0:00:00
Collecting pydeck<1,>=0.8.0b4
 Downloading pydeck-0.9.1-py2.py3-none-any.whl (6.9 MB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.9/6.9 MB 10.8 MB/s eta 0:00:00
Collecting tornado<7,>=6.0.3
 Downloading tornado-6.4.1-cp38-abi3-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (436 kB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 436.8/436.8 kB 8.8 MB/s eta 0:00:00
Collecting watchdog<5,>=2.1.5
 Downloading watchdog-4.0.1-py3-none-manylinux2014_x86_64.whl (83 kB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 83.0/83.0 kB 3.9 MB/s eta 0:00:00
Requirement already satisfied: jinja2 in ./pytorch_env/lib/python3.11/site-packages (from altair<6,>=4.0->streamlit) (3.1.3)
Collecting jsonschema>=3.0
 Downloading jsonschema-4.23.0-py3-none-any.whl (88 kB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 88.5/88.5 kB 3.6 MB/s eta 0:00:00
Collecting toolz
 Downloading toolz-0.12.1-py3-none-any.whl (56 kB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.1/56.1 kB 3.7 MB/s eta 0:00:00
Collecting gitdb<5,>=4.0.1
 Downloading gitdb-4.0.11-py3-none-any.whl (62 kB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62.7/62.7 kB 2.5 MB/s eta 0:00:00
Collecting python-dateutil>=2.8.2
 Downloading python_dateutil-2.9.0.post0-py2.py3-none-any.whl (229 kB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 229.9/229.9 kB 8.4 MB/s eta 0:00:00
Collecting pytz>=2020.1
 Downloading pytz-2024.1-py2.py3-none-any.whl (505 kB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 505.5/505.5 kB 10.0 MB/s eta 0:00:00
Collecting tzdata>=2022.7
 Downloading tzdata-2024.1-py2.py3-none-any.whl (345 kB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 345.4/345.4 kB 7.5 MB/s eta 0:00:00
Requirement already satisfied: charset-normalizer<4,>=2 in ./pytorch_env/lib/python3.11/site-packages (from requests<3,>=2.27->streamlit) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in ./pytorch_env/lib/python3.11/site-packages (from requests<3,>=2.27->streamlit) (3.7)
Requirement already satisfied: urllib3<3,>=1.21.1 in ./pytorch_env/lib/python3.11/site-packages (from requests<3,>=2.27->streamlit) (2.2.2)
Requirement already satisfied: certifi>=2017.4.17 in ./pytorch_env/lib/python3.11/site-packages (from requests<3,>=2.27->streamlit) (2024.7.4)
Collecting markdown-it-py>=2.2.0
 Downloading markdown_it_py-3.0.0-py3-none-any.whl (87 kB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 87.5/87.5 kB 3.9 MB/s eta 0:00:00
Collecting pygments<3.0.0,>=2.13.0
 Downloading pygments-2.18.0-py3-none-any.whl (1.2 MB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 11.0 MB/s eta 0:00:00
Collecting smmap<6,>=3.0.1
 Downloading smmap-5.0.1-py3-none-any.whl (24 kB)
Requirement already satisfied: MarkupSafe>=2.0 in ./pytorch_env/lib/python3.11/site-packages (from jinja2->altair<6,>=4.0->streamlit) (2.1.5)
Collecting attrs>=22.2.0
 Downloading attrs-24.1.0-py3-none-any.whl (63 kB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63.9/63.9 kB 4.3 MB/s eta 0:00:00
Collecting jsonschema-specifications>=2023.03.6
 Downloading jsonschema_specifications-2023.12.1-py3-none-any.whl (18 kB)
Collecting referencing>=0.28.4
 Downloading referencing-0.35.1-py3-none-any.whl (26 kB)
Collecting rpds-py>=0.7.1
 Downloading rpds_py-0.19.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (355 kB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 355.5/355.5 kB 8.2 MB/s eta 0:00:00
Collecting mdurl~=0.1
 Downloading mdurl-0.1.2-py3-none-any.whl (10.0 kB)
Collecting six>=1.5
 Downloading six-1.16.0-py2.py3-none-any.whl (11 kB)
Installing collected packages: pytz, watchdog, tzdata, tornado, toolz, toml, tenacity, smmap, six, rpds-py, pygments, pyarrow, protobuf, mdurl, click, cachetools, blinker, attrs, referencing, python-dateutil, pydeck, markdown-it-py, git
db, rich, pandas, jsonschema-specifications, gitpython, jsonschema, altair, streamlit
Successfully installed altair-5.3.0 attrs-24.1.0 blinker-1.8.2 cachetools-5.4.0 click-8.1.7 gitdb-4.0.11 gitpython-3.1.43 jsonschema-4.23.0 jsonschema-specifications-2023.12.1 markdown-it-py-3.0.0 mdurl-0.1.2 pandas-2.2.2 protobuf-5.27.
3 pyarrow-17.0.0 pydeck-0.9.1 pygments-2.18.0 python-dateutil-2.9.0.post0 pytz-2024.1 referencing-0.35.1 rich-13.7.1 rpds-py-0.19.1 six-1.16.0 smmap-5.0.1 streamlit-1.37.0 tenacity-8.5.0 toml-0.10.2 toolz-0.12.1 tornado-6.4.1 tzdata-202
4.1 watchdog-4.0.1

Daarna mag je de geïsoleerde Python omgeving verlaten:

(pytorch_env) dany@pindabook:~/pytorch_env$ deactivate
dany@pindabook:~/pytorch_env$ cd

Surya tekstherkenning in de terminal

Bij het eerste gebruik worden de modelgewichten automatisch gedownload. In het voorbeeld hieronder staat de te herkennen tekst in het gescande afbeeldingsbestand surya.png en is de tekst Nederlandstalig.

dany@pindabook:~$ pytorch_env/pytorch_env/bin/surya_ocr surya.png --langs nl
preprocessor_config.json: 100%|████████████████████████████████████████████████████| 675/675 [00:00<00:00, 3.64MB/s]
config.json: 100%|█████████████████████████████████████████████████████████████████| 811/811 [00:00<00:00, 5.04MB/s]
model.safetensors: 100%|█████████████████████████████████████████████████████████| 154M/154M [00:13<00:00, 11.4MB/s]
Loaded detection model vikp/surya_det3 on device cpu with dtype torch.float32
config.json: 100%|█████████████████████████████████████████████████████████████| 6.91k/6.91k [00:00<00:00, 30.9MB/s]
model.safetensors: 100%|███████████████████████████████████████████████████████| 1.05G/1.05G [01:32<00:00, 11.4MB/s]
generation_config.json: 100%|██████████████████████████████████████████████████████| 181/181 [00:00<00:00, 1.17MB/s]
Loaded recognition model vikp/surya_rec on device cpu with dtype torch.float32
preprocessor_config.json: 100%|████████████████████████████████████████████████████| 608/608 [00:00<00:00, 3.95MB/s]
Detecting bboxes: 100%|███████████████████████████████████████████████████████████████| 1/1 [00:33<00:00, 33.95s/it]
Recognizing Text: 100%|███████████████████████████████████████████████████████████████| 5/5 [07:35<00:00, 91.07s/it]
Wrote results to results/surya/surya

Surya schrijft het resultaat in het JSON formaat naar de map results/surya/surya. Handig voor verdere bewerking, maar niet geschikt voor gebruiksvriendelijke consumptie.

Grafisch

Een pak gebruiksvreindelijker wordt het als je de webapp gebruikt. Maar deze werkt enkel in de geïsoleerde Python omgeving, die je activeert met:

dany@pindabook:~$ cd pytorch_env
dany@pindabook:~/pytorch_env$ source pytorch_env/bin/activate

De grafische Surya webapp start je met de opdracht:

dany@pindabook:~$ cd pytorch_env/
dany@pindabook:~/pytorch_env$ source pytorch_env/bin/activate
(pytorch_env) dany@pindabook:~/pytorch_env$ surya_gui  

     Welcome to Streamlit!

     If you’d like to receive helpful onboarding emails, news, offers, promotions,
     and the occasional swag, please enter your email address below. Otherwise,
     leave this field blank.

     Email:   

 You can find our privacy policy at https://streamlit.io/privacy-policy

 Summary:
 - This open source library collects usage statistics.
 - We cannot see and do not store information contained inside Streamlit apps,
   such as text, charts, images, etc.
 - Telemetry data is stored in servers in the United States.
 - If you'd like to opt out, add the following to ~/.streamlit/config.toml,
   creating that file if necessary:

   [browser]
   gatherUsageStats = false


 You can now view your Streamlit app in your browser.

 Local URL: http://localhost:8501
 Network URL: http://192.168.129.29:8501

Loaded detection model vikp/surya_det3 on device cpu with dtype torch.float32
Loaded recognition model vikp/surya_rec on device cpu with dtype torch.float32
Loaded detection model vikp/surya_layout3 on device cpu with dtype torch.float32
Loaded reading order model vikp/surya_order on device cpu with dtype torch.float32
Detecting bboxes: 100%|█████████████████████████████████████████████████████████| 1/1 [00:41<00:00, 41.52s/it]
Detecting bboxes: 100%|█████████████████████████████████████████████████████████| 1/1 [00:36<00:00, 36.25s/it]
Detecting bboxes: 100%|█████████████████████████████████████████████████████████| 1/1 [00:34<00:00, 34.53s/it]
Recognizing Text: 100%|█████████████████████████████████████████████████████████| 5/5 [05:46<00:00, 69.33s/it]
/home/dany/pytorch_env/pytorch_env/lib/python3.11/site-packages/PIL/Image.py:3186: DecompressionBombWarning: I
mage size (142214688 pixels) exceeds limit of 89478485 pixels, could be decompression bomb DOS attack.
 warnings.warn(
Detecting bboxes: 100%|█████████████████████████████████████████████████████████| 1/1 [00:37<00:00, 37.31s/it]
Detecting bboxes: 100%|█████████████████████████████████████████████████████████| 1/1 [00:36<00:00, 36.54s/it]
Detecting bboxes: 100%|█████████████████████████████████████████████████████████| 1/1 [00:34<00:00, 34.78s/it]
Detecting bboxes: 100%|█████████████████████████████████████████████████████████| 1/1 [00:35<00:00, 35.09s/it]
Finding reading order: 100%|████████████████████████████████████████████████████| 1/1 [00:15<00:00, 15.84s/it]

In onderstaande voorbeeld hebben we opnieuw het gescande afbeeldingsbestand surya.png gebruikt. Deze tekst is Nederlandstalig, maar bevat terminal regels (Engelstalig). We hebben daarvoor bij Languages de taal Dutch (Nederlands) toegevoegd. Na een klik op de knop wordt Surya aan het werk gezet. Wacht tot Surya klaar is en druk daarna op de knop . Surya plaatst links een afbeelding met het resultaat en rechts het origineel).

Surya

Daaronder verschijnt het resultaat in JSON formaat. Je kunt door te klikken op de tab Text Lines (for debugging) het resultaat in het tekstformaat gebruiken.

Surya OCR

Surya levert nu al indrukwekkende resultaten. Uit tests met verschillende afbeeldingen blijkt dat de tekstherkenning indrukwekkend is, zeker gezien het feit dat de software zich nog in een vroeg stadium van ontwikkeling bevindt.

De software werkt beter met documenten met gedrukte tekst en de resultaten kunnen worden verbeterd door afbeeldingen vooraf te bewerken of door de resolutie van de afbeelding te wijzigen.

De software ondersteunt meer dan 90 talen.

Wie de ontwikkeling van Surya wil volgen en/of meer informatie wenst kan terecht op de Surya GitHub webpagina.

Surya verwijderen