Ecosyste.ms: OpenCollective

An open API service for software projects hosted on Open Collective.

OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
Collective - Host: opensource - https://opencollective.com/ocrmypdf - Code: https://github.com/jbarlow83/OCRmyPDF

Add support for -b (skip big pages)

github.com/ocrmypdf/OCRmyPDF - db311fb6a2100484e0f384473dacfca57dcc6748 authored almost 10 years ago by Jim Barlow <[email protected]>
Remove filenames from .hocr files

As documented, Tesseract does not escape the filename when inserting it
into .hocr, potentially ...

github.com/ocrmypdf/OCRmyPDF - 02c1dcec8e722f6b3e4004c7099518cd55cddd09 authored almost 10 years ago by Jim Barlow <[email protected]>
Support Tesseract 3.03 quirk: .html vs .hocr extension

github.com/ocrmypdf/OCRmyPDF - 52dc74d3cee9c82fccd2093dba5b5cd43c705bec authored almost 10 years ago by Jim Barlow <[email protected]>
Convert the final image to a JPEG if the original image was a JPEG

Of course, this introduces recompression artifacts, and is unnecessary
if no options are given t...

github.com/ocrmypdf/OCRmyPDF - cc2af2bc15184463074ca3804fb859629a8f7ecb authored almost 10 years ago by Jim Barlow <[email protected]>
Use the appropriate PNG rendered given the types of image present

github.com/ocrmypdf/OCRmyPDF - 638c6db05de22495eb85bc904a353adefec836bc authored almost 10 years ago by Jim Barlow <[email protected]>
Use Ghostscript -> PNG instead of pdftoppm for rendering

Ghostscript has the clunkiest imaginable syntax, obtuse documentation,
quirky behavior, and poor...

github.com/ocrmypdf/OCRmyPDF - f7db8d9aff044596449608f75f4f292ed0b0ff8f authored almost 10 years ago by Jim Barlow <[email protected]>
Support Ghostscript 9.14's new color conversion engine (not portable)

The flag -dUseCIEColor is now deprecated, as it invokes the old engine
which introduces color er...

github.com/ocrmypdf/OCRmyPDF - 564fb7a87eaac24e12a9bfdd5337098e41098325 authored almost 10 years ago by Jim Barlow <[email protected]>
Standardize tmpfile prefix

github.com/ocrmypdf/OCRmyPDF - 4d88e64774b88002d1535661ffd196b7910e4ad2 authored almost 10 years ago by Jim Barlow <[email protected]>
Handle case where a page contains no images - don't OCR

It doesn't make much sense to do anything with an all vector page
except extract the page unmodi...

github.com/ocrmypdf/OCRmyPDF - 26f1163b46f90c8ea31827d9ca01817236a39b0d authored almost 10 years ago by Jim Barlow <[email protected]>
Implement debug text only page option

github.com/ocrmypdf/OCRmyPDF - 40058e99e0550311eded97e8b12822064996a680 authored almost 10 years ago by Jim Barlow <[email protected]>
Describe what decision was made based on -f and -s and presence of text

github.com/ocrmypdf/OCRmyPDF - bece4c3e024c91d8bb35d5d4d334a07cf848db6e authored almost 10 years ago by Jim Barlow <[email protected]>
When deciding on OCR, check for presence of text rather than a font

It appears to be possible to have a PDF with an embedded font that is
either unused or used only...

github.com/ocrmypdf/OCRmyPDF - f0f6b57c8712260ff45ac1340a116f648d3950bd authored almost 10 years ago by Jim Barlow <[email protected]>
Logic error

github.com/ocrmypdf/OCRmyPDF - dc2a4ab04494d80ca99b052120b2d0d739ff33d8 authored almost 10 years ago by Jim Barlow <[email protected]>
Implement skipping OCR when -s is specified

Appears to be necessary to disable each state of the pipeline that is
inactive, not just initial...

github.com/ocrmypdf/OCRmyPDF - b16d6f5b81d25aab0445eeb3b08b245c2a5e31a5 authored almost 10 years ago by Jim Barlow <[email protected]>
Not a named param

github.com/ocrmypdf/OCRmyPDF - 69ce6ff7b5f4d1199a31566dbd97d7dbd0b6e3ec authored about 10 years ago by Jim Barlow <[email protected]>
Add Tesseract timeout to keep things reasonable

github.com/ocrmypdf/OCRmyPDF - 32ba50b8dca1eb5555b557549e8bba1ae22207cb authored about 10 years ago by Jim Barlow <[email protected]>
The -dci options now work (and valid combinations thereof)

github.com/ocrmypdf/OCRmyPDF - 36aca45f35260297fe4f626a01f4c1353f1b1ad7 authored about 10 years ago by Jim Barlow <[email protected]>
Leptonica deskew can handle .pnm input, unlike imagemagick

github.com/ocrmypdf/OCRmyPDF - 925290342d29564111a491ba61adb50fe371dcb6 authored about 10 years ago by Jim Barlow <[email protected]>
Add leptonica deskew

github.com/ocrmypdf/OCRmyPDF - 4dc0370c57f1c590cf62370984e4922231c8acde authored about 10 years ago by Jim Barlow <[email protected]>
Run as a module instead

github.com/ocrmypdf/OCRmyPDF - b92f8e43f22040bc79755034798d809e980450d8 authored about 10 years ago by Jim Barlow <[email protected]>
Merge branch 'feature/findskew' into develop

github.com/ocrmypdf/OCRmyPDF - 22b0733a1df0abfeb0badeb86ea4e762c06c1295 authored about 10 years ago by Jim Barlow <[email protected]>
Attempt to fix multiprocessing pickling error

github.com/ocrmypdf/OCRmyPDF - 6021684ab6b43ad564342454284143360da4d9da authored about 10 years ago by Jim Barlow <[email protected]>
Fix symlink error that occurs in multipage processing

github.com/ocrmypdf/OCRmyPDF - f4b1d0cdfe516953ad45c2f5963dfc084513b031 authored about 10 years ago by Jim Barlow <[email protected]>
Comments

github.com/ocrmypdf/OCRmyPDF - d0d804862102df7140ee16975014bf4858be9072 authored about 10 years ago by Jim Barlow <[email protected]>
Use abspath instead of relpath for temporary directory symlink

github.com/ocrmypdf/OCRmyPDF - cfd119325dc1b3e01a42d1cb2b251b16763eba1e authored about 10 years ago by Jim Barlow <[email protected]>
Support missing tess_cfg_files parameter when omitted by OCRmyPDF.sh

github.com/ocrmypdf/OCRmyPDF - ad30833ffca8bb6f0a151f8b8bc5075f13acb4d9 authored about 10 years ago by Jim Barlow <[email protected]>
Use TIFFs as intermediates

pdftoppm in recent versions (0.26.4,5) seems to be incapable of
producing valid TIFFs, so have i...

github.com/ocrmypdf/OCRmyPDF - e5c79a6666f8b8936e72be1f54a47d3456c339ba authored about 10 years ago by Jim Barlow <[email protected]>
Standardize intermediate filenames better

convert .pnm -deskew <...> .pnm seems to have a bug that produces an
invalid .pnm file which lat...

github.com/ocrmypdf/OCRmyPDF - 63dc753c1b3f27e34b07b3c47678de021b985c29 authored about 10 years ago by Jim Barlow <[email protected]>
Basic error handling

github.com/ocrmypdf/OCRmyPDF - 017bc1f25214fc94aed7fe30400c2e927b527916 authored about 10 years ago by Jim Barlow <[email protected]>
Sort of working, but fragile; uses tmp folder properly now

github.com/ocrmypdf/OCRmyPDF - bcd67c009d4cb63d8e4b2b51234bdcc663ea5722 authored about 10 years ago by Jim Barlow <[email protected]>
start rewrite ocrmypdf in python

github.com/ocrmypdf/OCRmyPDF - 635358884e7e1f1f03ee9564e090801be7b8a72d authored about 10 years ago by fritz-hh <[email protected]>
Now produces a finished OCR-PDF page

github.com/ocrmypdf/OCRmyPDF - 2f6cfafdfc64713e99b6ee7ec78961b9b784ec77 authored about 10 years ago by Jim Barlow <[email protected]>
First crack at Ruffus, working well

github.com/ocrmypdf/OCRmyPDF - 25234fa30bbf53ae8494057586cfcadc72a78822 authored about 10 years ago by Jim Barlow <[email protected]>
Merge remote-tracking branch 'origin/v2.x' into v3.x

github.com/ocrmypdf/OCRmyPDF - 5b173418041b9534a693c1f348bf427427cf8011 authored about 10 years ago by fritz-hh <[email protected]>
fixes #95

Exit if the output path points to a folder
Exit if the output path point to an existing file

github.com/ocrmypdf/OCRmyPDF - 9bedfa9a72e6e6b723f3808cfda33cd4077a87af authored about 10 years ago by fritz-hh <[email protected]>
make clear it is a draft from v3.x branch

github.com/ocrmypdf/OCRmyPDF - e1f122097034861a564dc33fd39c08d409f9ef3e authored over 10 years ago by fritz-hh <[email protected]>
Merge remote-tracking branch 'origin/v2.x' into v3.x

github.com/ocrmypdf/OCRmyPDF - 5855bcd1fe5cb71ac80fe7f57ee3f461dda5b098 authored over 10 years ago by fritz-hh <[email protected]>
make clear it is a draft from v2.x branch

github.com/ocrmypdf/OCRmyPDF - a14af5b9eeea786c9c1ca7a785d4f5c68a051c4f authored over 10 years ago by fritz-hh <[email protected]>
Update ROADMAP.md

github.com/ocrmypdf/OCRmyPDF - ea5cfa40c181663ad80d38acef8cd2ba2d6cf6c4 authored over 10 years ago by fritz-hh <[email protected]>
roadmap usage updated

github.com/ocrmypdf/OCRmyPDF - 90d892512af4f36c21bb77b3156b51d8c6fad9dd authored over 10 years ago by fritz-hh <[email protected]>
usage corrected [-f|-s]

github.com/ocrmypdf/OCRmyPDF - 9c6fedb15b99b9714a540c76a2102b5cc0828b77 authored over 10 years ago by fritz-hh <[email protected]>
roadmap arguments specified

github.com/ocrmypdf/OCRmyPDF - 3a7175115f6b194ad850fb14e854dd62521faca0 authored over 10 years ago by fritz-hh <[email protected]>
typo in usage

github.com/ocrmypdf/OCRmyPDF - 98c41f322386166453eb2b51d02490c328801bd9 authored over 10 years ago by fritz-hh <[email protected]>
roadmap: better layout

github.com/ocrmypdf/OCRmyPDF - d101e96e1665f6a20348c32726b0aa2a6fa7bf21 authored over 10 years ago by fritz-hh <[email protected]>
roadmap rename steps

github.com/ocrmypdf/OCRmyPDF - a446b6c4407e3b9c67b0bae1a7ce187d130e981e authored over 10 years ago by fritz-hh <[email protected]>
roadmap detailed

github.com/ocrmypdf/OCRmyPDF - b1fec0f1b19854098496e60209f74fda784a15dc authored over 10 years ago by fritz-hh <[email protected]>
draft roadmap for v3.x

github.com/ocrmypdf/OCRmyPDF - 1dfdc93745224c98aedd1785ef1930a1f93713c9 authored over 10 years ago by fritz-hh <[email protected]>
default language now set in the config.sh file

github.com/ocrmypdf/OCRmyPDF - 6c5ee4095cf379bcbe844efe29e34882652fd3d3 authored over 10 years ago by fritz-hh <[email protected]>
Introduce -s option + fix bug when -C no set

- Introduce -s option to no ocr pages containing fonts
- Solve issue with -f and -s if -C is not...

github.com/ocrmypdf/OCRmyPDF - 986fbf63a42de91278313b2a8cacb8c3ecf47d38 authored over 10 years ago by fritz-hh <[email protected]>
correct download path

github.com/ocrmypdf/OCRmyPDF - 2612105d3213b7b6a1ab4ee74450945a4f9cf755 authored over 10 years ago by fritz-hh <[email protected]>
update release notes for v2.2-stable

github.com/ocrmypdf/OCRmyPDF - 954fe13f542c9e743749b9b94dc2fa7b3fd6cfe3 authored over 10 years ago by fritz-hh <[email protected]>
Make clear this is a draft

github.com/ocrmypdf/OCRmyPDF - bb5a00685e29bb3abe69ba4479eb28b9b752bfd3 authored over 10 years ago by fritz-hh <[email protected]>
deskew and clean

github.com/ocrmypdf/OCRmyPDF - dabbddb04eea4754bf3f350789c1d5e392141efa authored over 10 years ago by Jim Barlow <[email protected]>
return right return code

Python does not map the expression to its return code automatically, so
this line returns succes...

github.com/ocrmypdf/OCRmyPDF - 5f173e5acb42b8bc594e3b8b3d5c9b42b5b4ea68 authored over 10 years ago by fritz-hh <[email protected]>
remove reportlab patch. fixes #91

remove patch that was required for versions of reportlab <3.0 (fixed in
3.0 now)
patch was neces...

github.com/ocrmypdf/OCRmyPDF - b28ff40aea81cf0059ab274faf980e3a56b23f3a authored over 10 years ago by fritz-hh <[email protected]>
Moving quickly - we can now output .ppm files at correct resolution

github.com/ocrmypdf/OCRmyPDF - fccfb4589e93f2209a7f326528ce9733fd62a16f authored over 10 years ago by Jim Barlow <[email protected]>
Initial ocrpage.py rewrite into python3

github.com/ocrmypdf/OCRmyPDF - 5384c980137c026613508e47030813eec2cacdb3 authored over 10 years ago by Jim Barlow <[email protected]>
Merge pull request #89 from jbarlow83/feature/readlink-osx

More portable solution (works also on OS X) to get OCRmyPDF.sh path (following simlinks)

github.com/ocrmypdf/OCRmyPDF - 2ed2307573d1769bf52fa212266e6d1eed6b1603 authored over 10 years ago by fritz-hh <[email protected]>
Eliminate readlink entirely and do the same thing on all platforms

github.com/ocrmypdf/OCRmyPDF - 3f8a2d8d3ed18583b46ff7a5707ce7958b7d2ec5 authored over 10 years ago by Jim Barlow <[email protected]>
Check if the input file exist

Previously I checked only if the folder in which the input file should
be exists

github.com/ocrmypdf/OCRmyPDF - 1a13b7c85fbc3b0cc364a9f2166d767a33e94c26 authored over 10 years ago by fritz-hh <[email protected]>
Merge branch 'feature/keep-text-pages' into develop

github.com/ocrmypdf/OCRmyPDF - d7130a1e56f93c9544a19f2d50813b83488be8b9 authored over 10 years ago by Jim Barlow <[email protected]>
Fix parameter order problems

Put TESS_CFG_FILES last because it is optional and can be blank. If
omitted it breaks the sequen...

github.com/ocrmypdf/OCRmyPDF - f69054cb17e5c2a66eb033d149da49ef7d846a6f authored over 10 years ago by Jim Barlow <[email protected]>
Merge branches 'feature/readlink-osx' and 'feature/keep-text-pages' into develop

Conflicts:
OCRmyPDF.sh

github.com/ocrmypdf/OCRmyPDF - 80dc6eca2cab25c7b3456e2e9072dc834ef9f491 authored over 10 years ago by Jim Barlow <[email protected]>
Fix call to readlink on OS X

readlink -f is a GNU coreutils extension, so not available on OS X and
other platforms.

github.com/ocrmypdf/OCRmyPDF - d250fbb3d60afa7092c39f4a827ebaec779164b8 authored over 10 years ago by Jim Barlow <[email protected]>
Add command line option to skip pages that contain font data

If a page contains font data, the script would abort, unless -f was given,
in which case it woul...

github.com/ocrmypdf/OCRmyPDF - 09bbe92611d23b8aeb89a018a3ffff0452df8928 authored over 10 years ago by Jim Barlow <[email protected]>
Check for missing pdftoppm when poppler installed with --disable-splash-output

When I upgraded to poppler 0.24.5, pdftoppm was not compiled because the
script had --disable-sp...

github.com/ocrmypdf/OCRmyPDF - 69d922e096314117b1c3f5b3645949d4c5fde81e authored over 10 years ago by Jim Barlow <[email protected]>
prevent new spurious jhove message to be displayed

github.com/ocrmypdf/OCRmyPDF - d510e7e4aee471c7a167a79c9819e3edbece38b8 authored over 10 years ago by fritz-hh <[email protected]>
update to jhove v1.11

github.com/ocrmypdf/OCRmyPDF - 5893290dd9fba6c878798d2e445218417043ebb6 authored over 10 years ago by fritz-hh <[email protected]>
typo in OCRmyPDF.sh

github.com/ocrmypdf/OCRmyPDF - 5c3bbc4031ab44050a182fcd6431309030f4b1c4 authored over 10 years ago by fritz-hh <[email protected]>
add link to heise open source

github.com/ocrmypdf/OCRmyPDF - 27cd8cf0db14b7f77a3da51b2d4f26ddeace3029 authored over 10 years ago by fritz-hh <[email protected]>
Release notes updated for v2.1-stable

github.com/ocrmypdf/OCRmyPDF - b403016d5b8ec428cbcd2e124e8fbf78e4720221 authored over 10 years ago by fritz-hh <[email protected]>
Merge pull request #82 from orbitcowboy/v2.x

Fixed typo

github.com/ocrmypdf/OCRmyPDF - 5a81823969a021f05c9b75b4123b69097013a6a8 authored over 10 years ago by fritz-hh <[email protected]>
Merge pull request #83 from DorianScholz/v2.x

- small changes to make this work on Ubuntu 12.04 called via symlink
- lowered minimum parallel...

github.com/ocrmypdf/OCRmyPDF - 17801401cd437300d7a082dd24ac44d183618037 authored over 10 years ago by fritz-hh <[email protected]>
lowered minimum version for parallel to 20121122

github.com/ocrmypdf/OCRmyPDF - 5c7b2a2a364573f47eb8f8696b7dbd1447a19a4e authored over 10 years ago by Dorian Scholz <[email protected]>
added BASEPATH to allow for execution via symlink

github.com/ocrmypdf/OCRmyPDF - 1db06de2877f352c97d01c99f3030144adb33350 authored over 10 years ago by Dorian Scholz <[email protected]>
Fixed typo

github.com/ocrmypdf/OCRmyPDF - 3904178d44609bc4755df9eaacd8e48018571ecf authored over 10 years ago by Martin Ettl <[email protected]>
Merge pull request #81 from MoritzFago/v2.x

fixed tipo ghostcript to ghostscript

github.com/ocrmypdf/OCRmyPDF - 8bb9c3610cd5044f7450ba76f7411163056a7e26 authored over 10 years ago by fritz-hh <[email protected]>
fixed tipo ghostcript to ghostscript

github.com/ocrmypdf/OCRmyPDF - 7dcc382ccc07f9a00035e54b1f930128efb48646 authored over 10 years ago by MoritzFago <[email protected]>
Merge pull request #77 from andysigner/v2.x

Fixed typo in help text

github.com/ocrmypdf/OCRmyPDF - b71fc807d208251401155146be0a3980b1a9ed5c authored over 10 years ago by fritz-hh <[email protected]>
Fixed typo in help text

github.com/ocrmypdf/OCRmyPDF - 15d28d970aa89c0dbde2a0f4795aad63c53431ed authored over 10 years ago by Andy Signer <[email protected]>
Merge pull request #73 from andreas-christ/v2.x

Fixed typo in import of reportlab.

github.com/ocrmypdf/OCRmyPDF - e083a860e9805eda0e37711a8cc1ed8a02d1444a authored over 10 years ago by fritz-hh <[email protected]>
Fixed typo in import of reportlab.

github.com/ocrmypdf/OCRmyPDF - 6463b9dd840cc0eb76aeeb558b5e8c5c9d58dcba authored over 10 years ago by Andreas Christ <[email protected]>
Consider that the hocr file has not always the same name

Closes #72

github.com/ocrmypdf/OCRmyPDF - c873de6ca4ce15880a62c221eea4699090805bcb authored over 10 years ago by fritz-hh <[email protected]>
support both older and newer versions of reportlab

closes #71

github.com/ocrmypdf/OCRmyPDF - b70863b47e3e092d8761a9bee1e4564e68caec5f authored over 10 years ago by fritz-hh <[email protected]>
ignore *.pyc files

github.com/ocrmypdf/OCRmyPDF - 3546f84c6d13224daad9744309a880f92009bdc3 authored over 10 years ago by fritz-hh <[email protected]>
Add command line option to skip pages that contain font data

If a page contains font data, the script would abort, unless -f was given,
in which case it woul...

github.com/ocrmypdf/OCRmyPDF - 1d98917db933d8b7fdad44c57b8983c102cbc56f authored almost 11 years ago by Jim Barlow <[email protected]>
RELEASE_NOTES update prior delivery of v2.0-stable

github.com/ocrmypdf/OCRmyPDF - 1c34fd69cf9bd40578ec33fc303c23cf9db9a714 authored almost 11 years ago by fritz-hh <[email protected]>
fixes #51

Allow tesseract 3.02.01 to be used.
Even 3.02.01 fails in few cases (see issue #28). I decided t...

github.com/ocrmypdf/OCRmyPDF - 4cf38404ccc34e4d01dece0c228e5bb92d9273c9 authored almost 11 years ago by fritz-hh <[email protected]>
Expose pixFindSkew API

github.com/ocrmypdf/OCRmyPDF - 112fb5098bfcddd6a5d23a620d47623170f6db38 authored almost 11 years ago by Jim Barlow <[email protected]>
Bug fix: leptonica generates .png when asked to produce .pbm/pgm/ppm

Leptonica does not interpret those extensions correctly. However, when
asked to produce a .pnm ...

github.com/ocrmypdf/OCRmyPDF - 5ace6906c73f475e3208459794a41eadb946999d authored almost 11 years ago by Jim Barlow <[email protected]>
Fix a silly typo, and other minor cleanup

github.com/ocrmypdf/OCRmyPDF - 8cfbdaf0d022dea822e76f1b93bf6a672f5ac07d authored almost 11 years ago by Jim Barlow <[email protected]>
Replace ImageMagick-convert with Leptonica

github.com/ocrmypdf/OCRmyPDF - 670343497677ac44b52b66ed1daedc2876442a46 authored almost 11 years ago by Jim Barlow <[email protected]>
Implement ctypes wrapper around Leptonica to access its deskew function

A few design notes:
Leptonica's deskew is far superior to ImageMagick's convert -deskew command ...

github.com/ocrmypdf/OCRmyPDF - 62edc15cd766d701728996ed6a41fb43b4d528b4 authored almost 11 years ago by Jim Barlow <[email protected]>
List supported languages

In case lan is not supported, list the supported languages in the error
message

github.com/ocrmypdf/OCRmyPDF - be830ddc3132af8768035899bf513b6c786e125d authored almost 11 years ago by fritz-hh <[email protected]>
fixes #60

Check if the languages option provided to tesseract (-l) are supported

github.com/ocrmypdf/OCRmyPDF - 18322b424f4a9824dab417eaadb5d5dd4f9337a3 authored almost 11 years ago by fritz-hh <[email protected]>
more robust way to check tesseract version

better way of checking if the tesseract version is compatible with the
script.
If the required t...

github.com/ocrmypdf/OCRmyPDF - 6901c60db4bb211c1ea83cd5da7e131148705be2 authored almost 11 years ago by fritz-hh <[email protected]>
config file: version updated to v2.0-rc2

github.com/ocrmypdf/OCRmyPDF - e369ce67669f5f9a9a1cac4c8d8f3b96b7d3d0e4 authored almost 11 years ago by fritz-hh <[email protected]>
release notes updated for v2.0-rc2

github.com/ocrmypdf/OCRmyPDF - 64e4e5d91e7f71ca7d2843a028561bf96f7fdf1c authored almost 11 years ago by fritz-hh <[email protected]>