Ecosyste.ms: OpenCollective

An open API service for software projects hosted on Open Collective.

OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
Collective - Host: opensource - https://opencollective.com/ocrmypdf - Code: https://github.com/jbarlow83/OCRmyPDF

Actually link the release notes

github.com/ocrmypdf/OCRmyPDF - 6e6f918630bba7077ba9a50d75a138767422bce7 authored over 9 years ago by jbarlow83 <[email protected]>
Fix git clone command with one I tested ;)

github.com/ocrmypdf/OCRmyPDF - 46338122461b510add63fd83c39693ff96a52020 authored over 9 years ago by jbarlow83 <[email protected]>
Update README with more detailed instructions

github.com/ocrmypdf/OCRmyPDF - 14bd1555aa2f5b214e662b036fe8774236aa6d3c authored over 9 years ago by jbarlow83 <[email protected]>
Fixes: clarify install instructions and reactivate external program checks

github.com/ocrmypdf/OCRmyPDF - b9d7687fa096cca819adcc1800cc652238857af2 authored over 9 years ago by James R. Barlow <[email protected]>
Merge branch 'develop'

# Conflicts:
# RELEASE_NOTES.md
# src/config.sh
# src/hocrTransform.py
# src/ocrPage.sh

github.com/ocrmypdf/OCRmyPDF - 93b36965e2b33bed12603b2c4bdad749fb74a2c1 authored over 9 years ago by James R. Barlow <[email protected]>
-rc2: because pypi won't accept -rc1

github.com/ocrmypdf/OCRmyPDF - 9e0c443c2f2d9923feeb05c3e5c0778bf1c209d6 authored over 9 years ago by James R. Barlow <[email protected]>
Don't mess with options

github.com/ocrmypdf/OCRmyPDF - 60832152b1d25698ef8d9b7de7dc96a8256a77de authored over 9 years ago by James R. Barlow <[email protected]>
Update release notes, add copyrights

github.com/ocrmypdf/OCRmyPDF - 6a160d22fe4a5712156013fd6aeace9fc5ed3fe7 authored over 9 years ago by James R. Barlow <[email protected]>
More test cases

github.com/ocrmypdf/OCRmyPDF - e35526192ce0de18713a75b6bc20b1cf01ee6ad5 authored over 9 years ago by James R. Barlow <[email protected]>
More test cases for other parameters

github.com/ocrmypdf/OCRmyPDF - bea57bdded53ddd117740f289861e45b37dcbcd4 authored over 9 years ago by James R. Barlow <[email protected]>
Minor tweaks to uncommon arguments

github.com/ocrmypdf/OCRmyPDF - 2a9da225e4636c0b5b81776f4dfb5a8f99380883 authored over 9 years ago by James R. Barlow <[email protected]>
Test cases for --tesseract-timeout

github.com/ocrmypdf/OCRmyPDF - a3f37de9b5f24323dfc34e17d60a302b050b5ed9 authored over 9 years ago by James R. Barlow <[email protected]>
Get rid of subprocess call on import of tesseract, unpaper -- bit nasty

github.com/ocrmypdf/OCRmyPDF - 606416095389c170d43bdb1962f9bc1423c7c6b9 authored over 9 years ago by James R. Barlow <[email protected]>
Drop nose, all tests working reasonably again

Although the real issue was that the ruffus pipeline cannot be executed
twice in the same proces...

github.com/ocrmypdf/OCRmyPDF - 850814131426c0e009472665c8253b625bb134f7 authored over 9 years ago by James R. Barlow <[email protected]>
nose can't really handle external tests so looking into py.test instead

Specifically it trips over the need to reimport ocrmypdf.main. That in
turn raises questions ab...

github.com/ocrmypdf/OCRmyPDF - 1c9559788297984628514a346e22edf987da7df2 authored over 9 years ago by James R. Barlow <[email protected]>
--oversample: Default to 0

github.com/ocrmypdf/OCRmyPDF - 587fa63c8e8e0364c33e4384a1fad8794bb2f2f3 authored over 9 years ago by James R. Barlow <[email protected]>
Add --oversample test for hocr rendering

github.com/ocrmypdf/OCRmyPDF - b40eec4cb0b07d30b8dec32b6ebfbe1efc6c9559 authored over 9 years ago by James R. Barlow <[email protected]>
Add test to confirm that metadata is transferred to final PDF/A

github.com/ocrmypdf/OCRmyPDF - 7bcd48c26924b79d988248db4ac1890c209f1597 authored over 9 years ago by James R. Barlow <[email protected]>
Improve argument handling, test cases

github.com/ocrmypdf/OCRmyPDF - 2e7cd52c0f7002f9fa17003b97e7465c5c02dd3e authored over 9 years ago by James R. Barlow <[email protected]>
Put ghostscript in a module

github.com/ocrmypdf/OCRmyPDF - 77d4cb367e14b7f86371e28e8795a28989084448 authored over 9 years ago by James R. Barlow <[email protected]>
Implement tesseract timeout

github.com/ocrmypdf/OCRmyPDF - 2c45c5abc63d29cc91b8caa956163a6ee26b6ec0 authored over 9 years ago by James R. Barlow <[email protected]>
Implement tesseract PDF rendering as an alternative

It's much better a rendering text baselines than hocr and seems to
produce small file sizes, so ...

github.com/ocrmypdf/OCRmyPDF - a89afabd798329c451f0099fa1a9756392ddffd9 authored over 9 years ago by James R. Barlow <[email protected]>
setup.py: Only do program checks when installing

github.com/ocrmypdf/OCRmyPDF - 03f7c9bf07aaa381cc752e4ddd34cad2b6322b59 authored over 9 years ago by James R. Barlow <[email protected]>
setup.py: check for third party program requirements

github.com/ocrmypdf/OCRmyPDF - d5f4862749eba85a1d309a676c5551f269f7d469 authored over 9 years ago by James R. Barlow <[email protected]>
More testing: JPEG

github.com/ocrmypdf/OCRmyPDF - 8aced0b6d3c5d003d11a309651ed24ac1b61e2e0 authored over 9 years ago by James R. Barlow <[email protected]>
Don't create inline images in output PDFs

...except that Ghostscript will sometimes turn out of line images into
inline images on its own,...

github.com/ocrmypdf/OCRmyPDF - 6b9adef6849f4013b0caee82787d947bd924dcee authored over 9 years ago by James R. Barlow <[email protected]>
Make this PDF a whole image page

Originally it had a smaller image centred in a page, which is not quite
supported.

github.com/ocrmypdf/OCRmyPDF - 5440d988fc57cf3457a9103c981b551d6fe5185d authored over 9 years ago by James R. Barlow <[email protected]>
pageinfo: drop pdftotext and use PyPDF instead

github.com/ocrmypdf/OCRmyPDF - 30da4fc569ac3f8ca68cc32510b1b5cc10384bd1 authored over 9 years ago by James R. Barlow <[email protected]>
Test cases for pageinfo; complain about inline images

github.com/ocrmypdf/OCRmyPDF - 2c1b5e100b79d7c620a21cdef75e9741e6b4159e authored over 9 years ago by James R. Barlow <[email protected]>
Add some pageinfo test cases; found problem with inline images

github.com/ocrmypdf/OCRmyPDF - 3684f278ed6b00d51bf2177eb3369387a07bcf98 authored over 9 years ago by James R. Barlow <[email protected]>
Remove redundant *res_render

github.com/ocrmypdf/OCRmyPDF - 6c3cb6acba9856bd5af312537cdb6875c06d5b77 authored over 9 years ago by James R. Barlow <[email protected]>
Replace .md with .rst

Github supports both, and PyPI expects .rst files, so use .rst and make
everyone happy.

Auto-co...

github.com/ocrmypdf/OCRmyPDF - b98ba8d17451e1a89a4fa710abcc1fd53d63077e authored over 9 years ago by James R. Barlow <[email protected]>
More packaging changes: move jhove, fix console script

github.com/ocrmypdf/OCRmyPDF - d3088829af87192719077f93abc80c98374bab17 authored over 9 years ago by James R. Barlow <[email protected]>
Packaging stuff

github.com/ocrmypdf/OCRmyPDF - 9aaaba17149600c787217e444d4428b8b075ddfd authored over 9 years ago by James R. Barlow <[email protected]>
Prepare for Python packaging - move to ocrmypdf folder

github.com/ocrmypdf/OCRmyPDF - 9adb0d696f6d0112bca8fe7ae6376ed1048bcd12 authored over 9 years ago by Jim Barlow <[email protected]>
Update release notes so far

github.com/ocrmypdf/OCRmyPDF - c270f1ba5fd14d41f62524fcbb4abbf1c0972d50 authored over 9 years ago by Jim Barlow <[email protected]>
Metadata override from command lien

github.com/ocrmypdf/OCRmyPDF - 7b255b575abb0fbd6f47c766cb0892ee9d2ab532 authored over 9 years ago by Jim Barlow <[email protected]>
Transfer Unicode document information from input PDF to output PDF

What a pain getting Unicode right, but there it is.

I cannot find anything to confirm that it i...

github.com/ocrmypdf/OCRmyPDF - d7a9f3a2ab4622410b50afb9b948a533e5f9d69e authored over 9 years ago by Jim Barlow <[email protected]>
Copy document metadata from source document into output (untested)

This works for ASCII only; will do Unicode version.

github.com/ocrmypdf/OCRmyPDF - abf2e7e9bb2a62052ae6531100169f3df91f84ef authored over 9 years ago by Jim Barlow <[email protected]>
Reimplement debug pages

github.com/ocrmypdf/OCRmyPDF - 72e5fa9ba04900214545bc21602fc09625214e03 authored over 9 years ago by Jim Barlow <[email protected]>
Reimplement skip text pages

github.com/ocrmypdf/OCRmyPDF - 32c1078d2cd721c4ea441cbd6978c997ce5551d0 authored over 9 years ago by Jim Barlow <[email protected]>
Change @subdivide to @split

@split is for "1 to many" operations, so it's the right tool for this
case.

github.com/ocrmypdf/OCRmyPDF - 133f901a6912860afca8300c837774745afe5981 authored over 9 years ago by Jim Barlow <[email protected]>
Try to make pdfinfo less obnoxious by printing too many decimals

github.com/ocrmypdf/OCRmyPDF - 42cd683ec072275a36b125191c18a362a384c26a authored over 9 years ago by Jim Barlow <[email protected]>
For now, unpaper is the only deskew provider

github.com/ocrmypdf/OCRmyPDF - 151eb0537751c32ce5d5144e38607f14f30fb40c authored over 9 years ago by Jim Barlow <[email protected]>
Remove ability to override temporary (working) folder

Little point to this feature - on most platforms the environment
variable can be overridden if d...

github.com/ocrmypdf/OCRmyPDF - 16177d0a5208dde52a2f3abe194aebc80aeecfae authored over 9 years ago by Jim Barlow <[email protected]>
Automatically try to use all available CPUs

github.com/ocrmypdf/OCRmyPDF - 5ce544289fd206485d8f8f1b0ad75e0cf2e11f46 authored over 9 years ago by Jim Barlow <[email protected]>
Remove duplicate test folder

github.com/ocrmypdf/OCRmyPDF - 77bd35c3c7d91456c612db68d760b9e16b1e90fe authored over 9 years ago by Jim Barlow <[email protected]>
Goodbye, so long, farewell, shell...

github.com/ocrmypdf/OCRmyPDF - 0c5c208db0ac18fbf309bb6a3e065adcfd4a30f4 authored over 9 years ago by Jim Barlow <[email protected]>
Split selecting final image and render PDF result into separate tasks

Simplifies the logic - one deals with all images, the other details
with an image and .hocr. Als...

github.com/ocrmypdf/OCRmyPDF - 60eb745331cda44b2c4db3a65af7cb905442740a authored over 9 years ago by Jim Barlow <[email protected]>
Modularize unpaper; get -d and -c working again

github.com/ocrmypdf/OCRmyPDF - 9f90b5cb0a31927617e23728ece2e162cadf09df authored over 9 years ago by Jim Barlow <[email protected]>
Remove more dead/old code

github.com/ocrmypdf/OCRmyPDF - 5adff94545ea2df71da5932260fc7b22456f2547 authored over 9 years ago by Jim Barlow <[email protected]>
Implement deskew and clean using unpaper

github.com/ocrmypdf/OCRmyPDF - aa2baabfa9914fc06276f8a5c6dac1fa8404c1c8 authored over 9 years ago by Jim Barlow <[email protected]>
Cleanup externals

github.com/ocrmypdf/OCRmyPDF - 75c2b23efcf32a0e291d08bd405ae101e7ea744f authored over 9 years ago by Jim Barlow <[email protected]>
Implement oversample

github.com/ocrmypdf/OCRmyPDF - 6451017962a2b1c83d7906349193f3f9333ce9c8 authored over 9 years ago by Jim Barlow <[email protected]>
Put .rendered.pdf files into temp folder

github.com/ocrmypdf/OCRmyPDF - 0f857a6a3459973bc8f0f688ef947523f81b69e5 authored over 9 years ago by Jim Barlow <[email protected]>
Change 'clean' to 'repair' for clarity since 'clean' is what unpaper does

github.com/ocrmypdf/OCRmyPDF - 7638a88a6a32e59f48bfd8ff474ba0cfbf35470c authored over 9 years ago by Jim Barlow <[email protected]>
Remove 'pdftoppm' renderer

Ghostscript is more reliable than Poppler's pdftoppm renderer. gs is
also a hard dependency, as ...

github.com/ocrmypdf/OCRmyPDF - bed12d20218e640b712a75ebff91fc675e6f7a4c authored over 9 years ago by Jim Barlow <[email protected]>
Platform independent search for iccprofiles for PDF/A

github.com/ocrmypdf/OCRmyPDF - 8c0dc9a06da9a9fac10a220570fcce22dacd3136 authored over 9 years ago by Jim Barlow <[email protected]>
First successful PDF/A produced by new pipeline

github.com/ocrmypdf/OCRmyPDF - 289e4025ad61df4fbdec2ff2d340b39b4b38cb17 authored over 9 years ago by Jim Barlow <[email protected]>
Rasterize PDF pages and generate .hocr files

github.com/ocrmypdf/OCRmyPDF - 5476eafe4ca4046fc38a02894bb151b227f3bb99 authored over 9 years ago by Jim Barlow <[email protected]>
Langauge checking

github.com/ocrmypdf/OCRmyPDF - df32f283cd01f713d9b8c59a440aa26371d6b2fc authored over 9 years ago by Jim Barlow <[email protected]>
Add tesseract version check

github.com/ocrmypdf/OCRmyPDF - 68ecaac9cca59765b6b6e6c31429d4850c500463 authored over 9 years ago by Jim Barlow <[email protected]>
Add PDF/A validation

github.com/ocrmypdf/OCRmyPDF - cffd4623ca77d49c54e9f0c827bb9b69c9ac7bd9 authored over 9 years ago by Jim Barlow <[email protected]>
Can now generate PDF/A files, multipage and single page

github.com/ocrmypdf/OCRmyPDF - 6dc2782e806d4a7d5df7032550a44d15ad1a7560 authored over 9 years ago by Jim Barlow <[email protected]>
Wrap a proxy around pdfinfo block so it can be passed around processes

github.com/ocrmypdf/OCRmyPDF - 5df187c086ad306347b5e221dc6ee0bed1aea231 authored over 9 years ago by Jim Barlow <[email protected]>
Get rid of chdir, replace deprecated @split with @subdivide

github.com/ocrmypdf/OCRmyPDF - 7fd172e41e5f277f8dffdced3dcb1bc991f42b42 authored over 9 years ago by Jim Barlow <[email protected]>
Try a method for passing along the pdfinfo struct

github.com/ocrmypdf/OCRmyPDF - 619528a1b57ee6e765d0d94c7f5b4f7e7d4ba88d authored over 9 years ago by Jim Barlow <[email protected]>
Reinstate WrapperLogger with more multiprocessing fixes

github.com/ocrmypdf/OCRmyPDF - 596d468c1457eb45f532d0a36e96a06d333ca4cf authored over 9 years ago by Jim Barlow <[email protected]>
diff --git a/src/ocrmypdf.py b/src/ocrmypdf.py

index 68d1591..95afa8f 100755
--- a/src/ocrmypdf.py
+++ b/src/ocrmypdf.py
@@ -24,6 +24,7 @@ impor...

github.com/ocrmypdf/OCRmyPDF - eddbf1060a8dcbf6ec2e7c6dca8dc5d7d9471a1b authored over 9 years ago by Jim Barlow <[email protected]>
Move pageinfo code out of the pipeline

github.com/ocrmypdf/OCRmyPDF - 33731a686448768ef328e75a3ce8a73d0679ab42 authored over 9 years ago by Jim Barlow <[email protected]>
Fix errors related to use working directory

Mainly workaround lack of @split(...output_dir) in ruffus

github.com/ocrmypdf/OCRmyPDF - 0c36cd2e24bb48b296cc6cfbcc29237f00a90a10 authored over 9 years ago by Jim Barlow <[email protected]>
New pipeline runs, splits pages

github.com/ocrmypdf/OCRmyPDF - 5cef1be26d4b9b047d711bb59d242f96d5908ea8 authored over 9 years ago by Jim Barlow <[email protected]>
Fixes from early testing of new pipeline

github.com/ocrmypdf/OCRmyPDF - e89f482c3dc09c654317ca42bff51dfb23f835c6 authored over 9 years ago by Jim Barlow <[email protected]>
Learn to split PDF into pages

github.com/ocrmypdf/OCRmyPDF - fe3e40305dcd3976d9da2f9e4938639d8ddcd1ef authored over 9 years ago by Jim Barlow <[email protected]>
Begin unifying main script and page script

github.com/ocrmypdf/OCRmyPDF - a92b5ceb6b944a285e2c1d16a68f874998c2c659 authored over 9 years ago by Jim Barlow <[email protected]>
Suppress the xref warning for now

github.com/ocrmypdf/OCRmyPDF - 0e7e7d843794dfd6266ee70332fa4f2fd60524e7 authored over 9 years ago by Jim Barlow <[email protected]>
Fixes to colorspace and other inquiries

github.com/ocrmypdf/OCRmyPDF - f47fa98f3343166feca0ad50ec363c28038475b8 authored over 9 years ago by Jim Barlow <[email protected]>
Replace pdfimages -list call to poppler with PyPDF test for image

The immediate reason for doing this is that (newer?) versions of parse()
seem to choke on the pa...

github.com/ocrmypdf/OCRmyPDF - d3d5879911fb3a8fc3c8e0232ddbccbf7d3fa77a authored over 9 years ago by Jim Barlow <[email protected]>
Require Py3 for tests

github.com/ocrmypdf/OCRmyPDF - b2168e11db59d5c73a1e1fb059ef9e5e589b1560 authored over 9 years ago by Jim Barlow <[email protected]>
New test: check skew

github.com/ocrmypdf/OCRmyPDF - 6d5d8be70897c31526512d2f73740340f6a97d62 authored over 9 years ago by Jim Barlow <[email protected]>
Add another test

github.com/ocrmypdf/OCRmyPDF - ce2dbdf372ec711f6c134914363e035943f5cbb8 authored over 9 years ago by Jim Barlow <[email protected]>
Basic test cases

github.com/ocrmypdf/OCRmyPDF - ec8a35a7a676dc32d8e2563c3d8adb6bc2639eee authored over 9 years ago by Jim Barlow <[email protected]>
Complete wrapping of logger/logger_mutex

github.com/ocrmypdf/OCRmyPDF - f6577c22c3c2de8cf6ba818e369699cb47dbdb58 authored over 9 years ago by Jim Barlow <[email protected]>
Implement oversampling in ocrpage.py

github.com/ocrmypdf/OCRmyPDF - 43d6c030930ca14eeb09798b7a33b3dec6a32e53 authored almost 10 years ago by Jim Barlow <[email protected]>
More consistent spacing

github.com/ocrmypdf/OCRmyPDF - 1870f116bbb0fea87c7e5d64046dd53dad561860 authored almost 10 years ago by Jim Barlow <[email protected]>
Don't presume two jobs

github.com/ocrmypdf/OCRmyPDF - 8b87def013489c7f1817ba0f81762ff5db3cc413 authored almost 10 years ago by Jim Barlow <[email protected]>
Tidy up readme

github.com/ocrmypdf/OCRmyPDF - de599d97b5eb97854e39fe3660314af847f6824b authored almost 10 years ago by Jim Barlow <[email protected]>
Cleanup logger

github.com/ocrmypdf/OCRmyPDF - 5d7e6b45c4b0634ba56d344213b6341f1dca8575 authored almost 10 years ago by Jim Barlow <[email protected]>
Change python2 -> python3 for readlink()

github.com/ocrmypdf/OCRmyPDF - c6091bcfe18bd3e32309118ea8716514af3b0336 authored almost 10 years ago by Jim Barlow <[email protected]>
It's now py3 that uses lxml, reportlab

github.com/ocrmypdf/OCRmyPDF - 466a8a13186bf7f49c25bb420cca429da8a6ea54 authored almost 10 years ago by Jim Barlow <[email protected]>
Add rudimentary support for combining OCR layer with existing content

It appears to be very fragile due to weaknesses in PyPDF. Better
option is probably to use pdftk...

github.com/ocrmypdf/OCRmyPDF - a99ba3b6966cd73f718af35ae54d7eb0b9c9a0db authored almost 10 years ago by Jim Barlow <[email protected]>
Add option to render text as invisible OCR text

Prior to this change, hocrtransform would render printable text (black
on white) and then a full...

github.com/ocrmypdf/OCRmyPDF - 9229f7c6cc2306fe16d628bd984f2e9afb2843c0 authored almost 10 years ago by Jim Barlow <[email protected]>
Clean up pixel transform logic with namedtuple

github.com/ocrmypdf/OCRmyPDF - bf114bb1883c2cf0d3ca6312f62a003fee685e3c authored almost 10 years ago by Jim Barlow <[email protected]>
More PEP8/lint

github.com/ocrmypdf/OCRmyPDF - b8eed2f8612c9fbbab15e3fe5e2b0543043bef7f authored almost 10 years ago by Jim Barlow <[email protected]>
Call HocrTransform directly instead of through a subprocess

github.com/ocrmypdf/OCRmyPDF - ccb1e347be58f03eb73cb02a5140a52945893967 authored almost 10 years ago by Jim Barlow <[email protected]>
Rename hocrTransform -> hocrtransform

github.com/ocrmypdf/OCRmyPDF - 8698974f11830b73fafc7264a0b2f0d36388761b authored almost 10 years ago by Jim Barlow <[email protected]>
Convert hocrtransform to py3

github.com/ocrmypdf/OCRmyPDF - f2c79c4341f9e09cf6b7c887a0746aac61bb91c4 authored almost 10 years ago by Jim Barlow <[email protected]>
Module marker for src folder

github.com/ocrmypdf/OCRmyPDF - 4966d1346b0f9e633dfed26ae49466a4861dc98f authored almost 10 years ago by Jim Barlow <[email protected]>