Ecosyste.ms: OpenCollective

An open API service for software projects hosted on Open Collective.

OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
Collective - Host: opensource - https://opencollective.com/ocrmypdf - Code: https://github.com/jbarlow83/OCRmyPDF

More Dockerfile repair

I'm not fully happy with this arrangement, as it effectively downloads
OCRmyPDF twice, not to me...

github.com/ocrmypdf/OCRmyPDF - 58f4582517ea8f9ccf1f910bb87abccdf4e2fcfc authored almost 9 years ago by James R. Barlow <[email protected]>
Merge branch 'develop'

github.com/ocrmypdf/OCRmyPDF - 2d15c09cca214fa81dbc219b89c3c1414702a1f7 authored almost 9 years ago by James R. Barlow <[email protected]>
Fetch application from PyPI instead of local

setuptools_scm barfs because it can't find the version, because Docker hub
retrieves the applica...

github.com/ocrmypdf/OCRmyPDF - 04cb8865b0d1267d1cd3ff955420eb7a2f455a98 authored almost 9 years ago by James R. Barlow <[email protected]>
v3.2.1

github.com/ocrmypdf/OCRmyPDF - 6fe32bbaf7089af06b756c04068a9b34b577893a authored almost 9 years ago by James R. Barlow <[email protected]>
Bump Dockerfile versions

github.com/ocrmypdf/OCRmyPDF - 4abb20390dc7f881f25e2aad9d337c296123d67f authored almost 9 years ago by James R. Barlow <[email protected]>
Fix img2pdf 0.2 usage

All tests pass when forced to rely on img2pdf, so seems okay

github.com/ocrmypdf/OCRmyPDF - daa3916430558c1923b2201bfc9cd6336043cf4c authored almost 9 years ago by James R. Barlow <[email protected]>
Try img2pdf 0.2

github.com/ocrmypdf/OCRmyPDF - e9b87cefccffc0afaf74f9531615a7617aee0e4a authored almost 9 years ago by James R. Barlow <[email protected]>
Tighten up package requirements to deal with incompatible img2pdf 0.2 release

github.com/ocrmypdf/OCRmyPDF - 60593b5ad300ca1791aca7f1b8be541ded714b9f authored almost 9 years ago by James R. Barlow <[email protected]>
Fix Python 2.7 warning

github.com/ocrmypdf/OCRmyPDF - f708b11ea49162ac6465d1a5db2baf173a521a7a authored almost 9 years ago by James R. Barlow <[email protected]>
Try tweaking Dockerfile for automated build again

github.com/ocrmypdf/OCRmyPDF - 7982f58b2ea053cd7b55795821a6d890454b8c56 authored almost 9 years ago by James R. Barlow <[email protected]>
Minor fix for Dockerfile polyglot

github.com/ocrmypdf/OCRmyPDF - e805c1908a8cbd6c0d01f2ca22c03814e3fc465c authored almost 9 years ago by James R. Barlow <[email protected]>
Merge branch 'release/v3.2' into develop

github.com/ocrmypdf/OCRmyPDF - cb3ba8e97391f7b20363dafea3c5dbfd0ccb645b authored almost 9 years ago by James R. Barlow <[email protected]>
Merge branch 'release/v3.2'

github.com/ocrmypdf/OCRmyPDF - 344fc40cbcb6a0349cd853c864d905d5b1592ce9 authored almost 9 years ago by James R. Barlow <[email protected]>
Merge branch 'develop' into release/v3.2

github.com/ocrmypdf/OCRmyPDF - 7e5c37137b44fc4ded8f32dbe55a554b3bc10357 authored almost 9 years ago by James R. Barlow <[email protected]>
Update release notes for v3.2

github.com/ocrmypdf/OCRmyPDF - 1aae11714b243ad47dd796af06c06f9f2daa2ca6 authored almost 9 years ago by James R. Barlow <[email protected]>
Update .gitignore

github.com/ocrmypdf/OCRmyPDF - d82f14a7aabee34ec9a863163725cac71099f379 authored almost 9 years ago by James R. Barlow <[email protected]>
Set JPEG output quality to 95 for better transcoding

github.com/ocrmypdf/OCRmyPDF - 4b65e0b09384621df94c2cee0676953593386574 authored almost 9 years ago by James R. Barlow <[email protected]>
Bug in tesseract_noop spoof: produced wrong page sizes

Now checks input image to ensure the implied page size of its .hocr file
matches the rest of the...

github.com/ocrmypdf/OCRmyPDF - 43b0faa83071e298e51541ba64fcc8c51e99deb1 authored almost 9 years ago by James R. Barlow <[email protected]>
Merge commit 'ccfbb54e8c26784e438ba2fcac2179f21e7d857b' into release/v3.2

github.com/ocrmypdf/OCRmyPDF - 8674c9fb2083dc93fe2473b03cbd24dffb45df9a authored almost 9 years ago by James R. Barlow <[email protected]>
Update release notes for v3.2

Fix the notes

github.com/ocrmypdf/OCRmyPDF - ccfbb54e8c26784e438ba2fcac2179f21e7d857b authored almost 9 years ago by jbarlow83 <[email protected]>
Suppress tesseract argument printout

github.com/ocrmypdf/OCRmyPDF - 9893ebf889c066537796ca33e3da6410d200ae4e authored almost 9 years ago by James R. Barlow <[email protected]>
Merge commit 'ca546d70e5bff9e9b115371f7813f3c326822bd8' into release/v3.2

github.com/ocrmypdf/OCRmyPDF - 303eb3e93aa876019ea6dd33e3b345725acaa41f authored almost 9 years ago by James R. Barlow <[email protected]>
Merge pull request #45 from spwhitton/hocrtransform-shebang-fix

fix shebang in hocrtransform.py

github.com/ocrmypdf/OCRmyPDF - ca546d70e5bff9e9b115371f7813f3c326822bd8 authored almost 9 years ago by jbarlow83 <[email protected]>
fix shebang in hocrtransform.py

github.com/ocrmypdf/OCRmyPDF - 6a5ea2d64ae825b339d31bffafb3f4b77421cbb7 authored almost 9 years ago by Sean Whitton <[email protected]>
Reorg gitignore

github.com/ocrmypdf/OCRmyPDF - ec3d92ad8e71971f606a45d476324447970ef4c4 authored almost 9 years ago by James R. Barlow <[email protected]>
Improve organization of CFFI setup

github.com/ocrmypdf/OCRmyPDF - 66a095d7de3d440379b69d428bd3d0d39c701891 authored almost 9 years ago by James R. Barlow <[email protected]>
Experiment with CFFI instead of ctypes

github.com/ocrmypdf/OCRmyPDF - 411981efbcba27175dbc928f5ce528e959782ce0 authored almost 9 years ago by James R. Barlow <[email protected]>
Leptonica: convert to CFFI

github.com/ocrmypdf/OCRmyPDF - 350ad5210e075f2b9496931c26c2fdd495db8514 authored almost 9 years ago by James R. Barlow <[email protected]>
Suppress tesseract argument printout

github.com/ocrmypdf/OCRmyPDF - f3b588764ee0779a45be4ab653f4dcb6e444e15f authored almost 9 years ago by James R. Barlow <[email protected]>
Support optionally using leptonica to deskew

unpaper doesn't seem to be good at deskewing. It fails on test case
with a lot of italics. I thi...

github.com/ocrmypdf/OCRmyPDF - b49f5a7d7716b1effba05c424b841c72b16c3da2 authored almost 9 years ago by James R. Barlow <[email protected]>
Merge branch 'release/v3.2-rc1'

github.com/ocrmypdf/OCRmyPDF - bacbcba58a0e33d68f941163c0d2678c71c15978 authored almost 9 years ago by James R. Barlow <[email protected]>
Update release notes for v3.2-rc1

github.com/ocrmypdf/OCRmyPDF - 52e8aa434fd1029e0f6d3b19716935a11cb226e9 authored almost 9 years ago by James R. Barlow <[email protected]>
Better versioning: no silly version files, but wrong ver in development

Small price to pay.

github.com/ocrmypdf/OCRmyPDF - 37c508f3f884e7203d6053ea0d319cf2e5eaf8b6 authored almost 9 years ago by James R. Barlow <[email protected]>
More fiddling with version

github.com/ocrmypdf/OCRmyPDF - 26e36422cc66e6ebff7dfacce116ac5deb661c5c authored almost 9 years ago by James R. Barlow <[email protected]>
Try automatic versioning with setuptools_scm

github.com/ocrmypdf/OCRmyPDF - f82cb002bcfa4f55452575096e93eaeaef85f2aa authored almost 9 years ago by James R. Barlow <[email protected]>
Fix name of pdfa_def.ps

Used to include a copy of the parent dir's name.

github.com/ocrmypdf/OCRmyPDF - c1eb047a4b76aeeca1bb4983c99f8967e65119e5 authored almost 9 years ago by James R. Barlow <[email protected]>
Remove stale comment

github.com/ocrmypdf/OCRmyPDF - 626ca18f5c06188fb3f54310be299f3bc5f5981f authored almost 9 years ago by James R. Barlow <[email protected]>
New tests for ccitt, jbig2 encodings

github.com/ocrmypdf/OCRmyPDF - 9058dedfbefa5ca35904cbc1b11c021ca20c6474 authored almost 9 years ago by James R. Barlow <[email protected]>
Optimize: use img2pdf stream instead of repeated copies

github.com/ocrmypdf/OCRmyPDF - a0952bfca39f8a1550baae54ba4e0e18f9e42290 authored almost 9 years ago by James R. Barlow <[email protected]>
Use os.makedirs for test output directories

Broke Travis

github.com/ocrmypdf/OCRmyPDF - 354e61946e0ad7ec090189c609ebdb99824e1973 authored almost 9 years ago by James R. Barlow <[email protected]>
Merge branch 'feature/pypdf-page-merge' into develop

github.com/ocrmypdf/OCRmyPDF - fd6d1d748a419086849d2a54d8c6a2cb35895502 authored almost 9 years ago by James R. Barlow <[email protected]>
Adjust test_oversample test case

Add -f to force generation of the background image at the desired
oversample resolution. Our ne...

github.com/ocrmypdf/OCRmyPDF - 360acd1e2cdc01156c08811f744d7280ea1ae785 authored almost 9 years ago by James R. Barlow <[email protected]>
Fix all but test_oversample[hocr]

github.com/ocrmypdf/OCRmyPDF - fc0479f1100a3baea8726b477546d5bdf9c798c3 authored almost 9 years ago by James R. Barlow <[email protected]>
Implement image+text merging in other cases

5 failed, 28 passed

failures:
test_oversample[hocr], test_skip_ocr, test_skip_big, test_maximum...

github.com/ocrmypdf/OCRmyPDF - 62728205b6ff770f681ce9af19110adac870fa5d authored almost 9 years ago by James R. Barlow <[email protected]>
Render hocr page: no longer needs an image as input

github.com/ocrmypdf/OCRmyPDF - dc0fb25e64d8387e454988ba492a1693a8ea6a12 authored almost 9 years ago by James R. Barlow <[email protected]>
Update pipeline.svg

github.com/ocrmypdf/OCRmyPDF - f3e04cce56cd153d3f888d25269e4b7471619910 authored almost 9 years ago by James R. Barlow <[email protected]>
Add safety check to prevent merge from running when not sensible

github.com/ocrmypdf/OCRmyPDF - 7067110308bbcbca4cbb26249cb9313b2d52fb37 authored almost 9 years ago by James R. Barlow <[email protected]>
Implement "perfect reconstruction" - transfer page and watermark OCR layer

Works, does not account for changes to clean/deskew, etc.
Surprisingly, it works. PyPDF2 fixes s...

github.com/ocrmypdf/OCRmyPDF - 599d8897039dd29419a293d3a1289a58c70b5b0e authored almost 9 years ago by James R. Barlow <[email protected]>
Merge branch 'feature/test-pageinfo-cleanup' into develop

github.com/ocrmypdf/OCRmyPDF - 2fa8366632db2db54ec38212ba040da89ac31422 authored almost 9 years ago by James R. Barlow <[email protected]>
New hocrtransform test

github.com/ocrmypdf/OCRmyPDF - c368c51badd7352a0b01a7cc9107e88ee5a6feda authored almost 9 years ago by James R. Barlow <[email protected]>
Move pageinfo test into tests folder

github.com/ocrmypdf/OCRmyPDF - 7c558b37133a9cb18f0966165a8b361036c1286c authored almost 9 years ago by James R. Barlow <[email protected]>
Merge branch 'feature/pagesegmode' into develop

github.com/ocrmypdf/OCRmyPDF - 8d323ae5102569640d3f55fca228f4154037b22c authored almost 9 years ago by James R. Barlow <[email protected]>
Use tesseract cache for -psm

github.com/ocrmypdf/OCRmyPDF - 3b53e9adac014aa46ab7a2f378632feef5271755 authored almost 9 years ago by James R. Barlow <[email protected]>
Activate --tesseract-pagesegmode

github.com/ocrmypdf/OCRmyPDF - 074c1d71b49a49f025efed1f76c353000bab29a9 authored almost 9 years ago by James R. Barlow <[email protected]>
Adjust command line parameters

Was splitting each argument to --tesseract-config into a list of single
character strings

github.com/ocrmypdf/OCRmyPDF - 1fca9a004dd1f6d6b6e5419908416c5cc818c977 authored almost 9 years ago by James R. Barlow <[email protected]>
Override ruffus' handling of --jobs

Ruffus treats omitted parameter as -j1. For our purposes it makes more
sense for omitting the pa...

github.com/ocrmypdf/OCRmyPDF - b485a1ef78c86baf9f3f10722ea9e60ecc5c94bf authored almost 9 years ago by James R. Barlow <[email protected]>
Merge branch 'hotfix/v3.1.1' into develop

# Conflicts:
# RELEASE_NOTES.rst

github.com/ocrmypdf/OCRmyPDF - 326ef7a3ac949b06eaf5cda1243ac8d981de2b0b authored almost 9 years ago by James R. Barlow <[email protected]>
Merge branch 'hotfix/v3.1.1'

github.com/ocrmypdf/OCRmyPDF - 12bc58b5b63a83a4a4988070ac4114471603bd14 authored almost 9 years ago by James R. Barlow <[email protected]>
Bump version

github.com/ocrmypdf/OCRmyPDF - 6af0815681ea9cd22532e73ad9feec24cf9e7df5 authored almost 9 years ago by James R. Barlow <[email protected]>
Merge branch 'hotfix/v3.1.1' into develop

github.com/ocrmypdf/OCRmyPDF - 66c2b9b78e22db3d81fbc70693b899992bfb4242 authored almost 9 years ago by James R. Barlow <[email protected]>
Supporting all languages bloats the image by an extra 1 GB

Make it a special image

github.com/ocrmypdf/OCRmyPDF - d03c056cb11f8ec9cc4b01c9f77c220ff5b7a3fe authored about 9 years ago by James R. Barlow <[email protected]>
Dockerfile: remove manual build of unpaper

Fortunately unpaper now exists as binary package, eliminating the need
to install all of the bui...

github.com/ocrmypdf/OCRmyPDF - 3f94d628fa8ed0f8ae6ad719e56048cbc00b4d82 authored about 9 years ago by James R. Barlow <[email protected]>
Update dockerfile: include all languages

Also update ignore files

github.com/ocrmypdf/OCRmyPDF - a64c7dbe99946ea7c6e1abc4ef1acfb901beb8f4 authored about 9 years ago by James R. Barlow <[email protected]>
Place ruffus database in temporary folder

Because we don't really use ruffus checkpoint feature, putting the
database in a permanent locat...

github.com/ocrmypdf/OCRmyPDF - 61b3ccb57c2d59d07a0f6696dc33ec8031c6eeaf authored about 9 years ago by James R. Barlow <[email protected]>
Just go right ahead and demand Python 3.4

github.com/ocrmypdf/OCRmyPDF - 424b4b33b15b923d4d55929a8205f0570a6ce022 authored about 9 years ago by James R. Barlow <[email protected]>
Python 2 warning message

github.com/ocrmypdf/OCRmyPDF - e510f89792d3fbe4def6832f314ec6263a2d09c1 authored about 9 years ago by James R. Barlow <[email protected]>
Off by one error in page info calculation

github.com/ocrmypdf/OCRmyPDF - 49cd6cc619fc7ce3a1be28a898af06be3cc2eb69 authored about 9 years ago by James R. Barlow <[email protected]>
Tell Travis about the cache

github.com/ocrmypdf/OCRmyPDF - 9aa3d340d46cd1fb1f57a937848bef01c8771399 authored about 9 years ago by James R. Barlow <[email protected]>
Adjust test cases to use cache and noop more effectively

This reduces total execution time to 164s on my machine, down from
about double that.

github.com/ocrmypdf/OCRmyPDF - 09782242c84f836cf5001353d021c028118d6348 authored about 9 years ago by James R. Barlow <[email protected]>
Add tesseract caching to speed up tests

github.com/ocrmypdf/OCRmyPDF - 9ec4aa039dbf512ea91ca4f7dac0d48ca2b122ee authored about 9 years ago by James R. Barlow <[email protected]>
Let some tests use the spoofed tesseract

Where getting OCR doesn't matter

github.com/ocrmypdf/OCRmyPDF - ecebe2f24b5ca5e5a9cb2fa1ce2ebc8ee58858f7 authored about 9 years ago by James R. Barlow <[email protected]>
Implement pdf renderer side of tess spoof

github.com/ocrmypdf/OCRmyPDF - 7313a77c2a0f8d0e20b61a50b55a13fd18420498 authored about 9 years ago by James R. Barlow <[email protected]>
Add Tesseract spoofing

github.com/ocrmypdf/OCRmyPDF - 45113676a3c635c25f1b431d84a9fdd6b1dc6ff0 authored about 9 years ago by James R. Barlow <[email protected]>
Check for encrypted PDF and complain appropriately

github.com/ocrmypdf/OCRmyPDF - 102bd07019ee494a02436bfdbfaa9360470d717f authored about 9 years ago by James R. Barlow <[email protected]>
Use envvars in a new test case

And get rid of the messy binary replacement spoofing

github.com/ocrmypdf/OCRmyPDF - 9622e31da9e783e0b7b936ff12b93ae452487f92 authored about 9 years ago by James R. Barlow <[email protected]>
Environment variables can now override default programs

github.com/ocrmypdf/OCRmyPDF - 1731ce2a44b6c62674e8a29182542cf9f4494093 authored about 9 years ago by James R. Barlow <[email protected]>
Did a quick test of Ghostscript vs QPDF at PDF page splitting

qpdf won so hard it wasn't funny, even though it must be called once
per page to do the job. Per...

github.com/ocrmypdf/OCRmyPDF - 276f421c446d8a202c0a461896551d3110eaf081 authored about 9 years ago by James R. Barlow <[email protected]>
All subprocess invocations refactored out of main.py

github.com/ocrmypdf/OCRmyPDF - 133357779aa2e947a98ab72bd6da79bcb21d7980 authored about 9 years ago by James R. Barlow <[email protected]>
Move PDF validation check to qpdf.py

github.com/ocrmypdf/OCRmyPDF - 5d8167b232eb2fb64d01cb61b75af77f16d1908c authored about 9 years ago by James R. Barlow <[email protected]>
Move more qpdf calls into qpdf.py

github.com/ocrmypdf/OCRmyPDF - e76ae8c46c44400adb81b9e889372d72cfe54d0a authored about 9 years ago by James R. Barlow <[email protected]>
Refactor qpdf subprocess calls into module

github.com/ocrmypdf/OCRmyPDF - 53a7c0e66892acff8afeee41fd4c846738ca160f authored about 9 years ago by James R. Barlow <[email protected]>
Merge commit '9f374461559460527e47237323e511123f31b6b0' into feature/envvars

github.com/ocrmypdf/OCRmyPDF - 4ca243e4900dcee8e815c76516ce2306e337ce93 authored about 9 years ago by James R. Barlow <[email protected]>
Merge pull request #34 from shemgp/master

Don't exit when qpdf repairs the file successfully but displays warning

github.com/ocrmypdf/OCRmyPDF - 9f374461559460527e47237323e511123f31b6b0 authored about 9 years ago by jbarlow83 <[email protected]>
Use boolean instead of integers

github.com/ocrmypdf/OCRmyPDF - d7c7559b05d49a461c5a41a21fa7082ea827c77e authored about 9 years ago by Shem Pasamba <[email protected]>
Don't exit when qpdf repair was successful

github.com/ocrmypdf/OCRmyPDF - b2b66d134482f559922c072f2bddd7e0948207cc authored about 9 years ago by Shem Pasamba <[email protected]>
Refactor tesseract --pdfrenderer calls to tesseract.py

github.com/ocrmypdf/OCRmyPDF - 5d111a3c04d1c1fb6d0f1e8cf5a7475b66252539 authored about 9 years ago by James R. Barlow <[email protected]>
Migrate tesseract-hocr code to tesseract module, because modularity

github.com/ocrmypdf/OCRmyPDF - 10416f847f20968af2e542de6512da2fa3baab5f authored about 9 years ago by James R. Barlow <[email protected]>
All tests passed, bump version

github.com/ocrmypdf/OCRmyPDF - 79b3472b26d32183664bfc8d91f79c2259f36e76 authored about 9 years ago by James R. Barlow <[email protected]>
Merge branch 'feature/pdfa-2' into develop

github.com/ocrmypdf/OCRmyPDF - f1b2f1ae0857ff03cdad170e566eb7857ca35c45 authored about 9 years ago by James R. Barlow <[email protected]>
Trivial

github.com/ocrmypdf/OCRmyPDF - ee7d97ae8c50187ddf35a4f95e4a57f4e7af8e89 authored about 9 years ago by James R. Barlow <[email protected]>
Remove eval() call by introspecting ExitCode

github.com/ocrmypdf/OCRmyPDF - 7d9f473bb1b3dcf39c1f5142b09a917728739ddd authored about 9 years ago by James R. Barlow <[email protected]>
We don't want threads. Really. Do. Not. Want.

github.com/ocrmypdf/OCRmyPDF - e77a5e5e75dba89322e9437864ddabd010d09e5f authored about 9 years ago by James R. Barlow <[email protected]>
Comments

github.com/ocrmypdf/OCRmyPDF - 6ab19af1220fa2493c6028c701e2a0e68f11515d authored about 9 years ago by James R. Barlow <[email protected]>
Better error messages for input file not found or invalid

Not as good finding a general way to deal with ruffus exceptions, but
better than nil.

github.com/ocrmypdf/OCRmyPDF - 276fe498679070b4ca22d213c06c6b859c6c7fdd authored about 9 years ago by James R. Barlow <[email protected]>
Fix issue #20 - fails on uppercase .PDF

github.com/ocrmypdf/OCRmyPDF - acb31abe86bc3fd7b55e81557f216ed50237bee5 authored about 9 years ago by James R. Barlow <[email protected]>
Introduce --pdf-renderer auto

Tess 3.03's has various quality problems like wrong DPI that are fixed
in Tess 3.04. Idea here i...

github.com/ocrmypdf/OCRmyPDF - 4f964a3c8ad0a97b52fbd0a4e39108497f43cb29 authored about 9 years ago by James R. Barlow <[email protected]>
pageinfo: workaround PyPDF extractText limitations on hidden text

It appears that extractText() does not find all text. At a glance it
may be that Tesseract's PDF...

github.com/ocrmypdf/OCRmyPDF - df1fda74388edd2125852db7cf2dde9840c3e7e7 authored about 9 years ago by James R. Barlow <[email protected]>
pageinfo: improve robustness of text test for Tesseract produced PDFs

github.com/ocrmypdf/OCRmyPDF - d6124c17878a99f8ed7dde24a7c83c35cca79899 authored about 9 years ago by James R. Barlow <[email protected]>
Set /Creator metadata to OCRmyPDF

with reference to Tess version and settings

github.com/ocrmypdf/OCRmyPDF - 80d89b54208c911f56a3949a7f07b8f5cc56e7df authored about 9 years ago by James R. Barlow <[email protected]>
Choose PDF/A-2b by default instead of A-1b

github.com/ocrmypdf/OCRmyPDF - 74059eecf1240961e64d84a2a34ae95091a369cf authored about 9 years ago by James R. Barlow <[email protected]>