github.com/ocrmypdf/OCRmyPDF commits | Ecosyste.ms: OpenCollective

Actually link the release notes

6e6f918630bba7077ba9a50d75a138767422bce7 authored over 9 years ago

Fix git clone command with one I tested ;)

46338122461b510add63fd83c39693ff96a52020 authored over 9 years ago

Update README with more detailed instructions

14bd1555aa2f5b214e662b036fe8774236aa6d3c authored over 9 years ago

Fixes: clarify install instructions and reactivate external program checks

b9d7687fa096cca819adcc1800cc652238857af2 authored over 9 years ago

Merge branch 'develop'

# Conflicts:
# RELEASE_NOTES.md
# src/config.sh
# src/hocrTransform.py
# src/ocrPage.sh

93b36965e2b33bed12603b2c4bdad749fb74a2c1 authored over 9 years ago

-rc2: because pypi won't accept -rc1

9e0c443c2f2d9923feeb05c3e5c0778bf1c209d6 authored over 9 years ago

Don't mess with options

60832152b1d25698ef8d9b7de7dc96a8256a77de authored over 9 years ago

Update release notes, add copyrights

6a160d22fe4a5712156013fd6aeace9fc5ed3fe7 authored over 9 years ago

More test cases

e35526192ce0de18713a75b6bc20b1cf01ee6ad5 authored over 9 years ago

More test cases for other parameters

bea57bdded53ddd117740f289861e45b37dcbcd4 authored over 9 years ago

Minor tweaks to uncommon arguments

2a9da225e4636c0b5b81776f4dfb5a8f99380883 authored over 9 years ago

Test cases for --tesseract-timeout

a3f37de9b5f24323dfc34e17d60a302b050b5ed9 authored over 9 years ago

Get rid of subprocess call on import of tesseract, unpaper -- bit nasty

606416095389c170d43bdb1962f9bc1423c7c6b9 authored over 9 years ago

Drop nose, all tests working reasonably again

Although the real issue was that the ruffus pipeline cannot be executed
twice in the same proces...

850814131426c0e009472665c8253b625bb134f7 authored over 9 years ago

nose can't really handle external tests so looking into py.test instead

Specifically it trips over the need to reimport ocrmypdf.main. That in
turn raises questions ab...

1c9559788297984628514a346e22edf987da7df2 authored over 9 years ago

--oversample: Default to 0

587fa63c8e8e0364c33e4384a1fad8794bb2f2f3 authored over 9 years ago

Add --oversample test for hocr rendering

b40eec4cb0b07d30b8dec32b6ebfbe1efc6c9559 authored over 9 years ago

Add test to confirm that metadata is transferred to final PDF/A

7bcd48c26924b79d988248db4ac1890c209f1597 authored over 9 years ago

Improve argument handling, test cases

2e7cd52c0f7002f9fa17003b97e7465c5c02dd3e authored over 9 years ago

Put ghostscript in a module

77d4cb367e14b7f86371e28e8795a28989084448 authored over 9 years ago

Implement tesseract timeout

2c45c5abc63d29cc91b8caa956163a6ee26b6ec0 authored over 9 years ago

Implement tesseract PDF rendering as an alternative

It's much better a rendering text baselines than hocr and seems to
produce small file sizes, so ...

a89afabd798329c451f0099fa1a9756392ddffd9 authored over 9 years ago

setup.py: Only do program checks when installing

03f7c9bf07aaa381cc752e4ddd34cad2b6322b59 authored over 9 years ago

setup.py: check for third party program requirements

d5f4862749eba85a1d309a676c5551f269f7d469 authored over 9 years ago

More testing: JPEG

8aced0b6d3c5d003d11a309651ed24ac1b61e2e0 authored over 9 years ago

Don't create inline images in output PDFs

...except that Ghostscript will sometimes turn out of line images into
inline images on its own,...

6b9adef6849f4013b0caee82787d947bd924dcee authored over 9 years ago

Make this PDF a whole image page

Originally it had a smaller image centred in a page, which is not quite
supported.

5440d988fc57cf3457a9103c981b551d6fe5185d authored over 9 years ago

pageinfo: drop pdftotext and use PyPDF instead

30da4fc569ac3f8ca68cc32510b1b5cc10384bd1 authored over 9 years ago

Test cases for pageinfo; complain about inline images

2c1b5e100b79d7c620a21cdef75e9741e6b4159e authored over 9 years ago

Add some pageinfo test cases; found problem with inline images

3684f278ed6b00d51bf2177eb3369387a07bcf98 authored over 9 years ago

Remove redundant *res_render

6c3cb6acba9856bd5af312537cdb6875c06d5b77 authored over 9 years ago

Replace .md with .rst

Github supports both, and PyPI expects .rst files, so use .rst and make
everyone happy.

Auto-co...

b98ba8d17451e1a89a4fa710abcc1fd53d63077e authored over 9 years ago

More packaging changes: move jhove, fix console script

d3088829af87192719077f93abc80c98374bab17 authored over 9 years ago

Packaging stuff

9aaaba17149600c787217e444d4428b8b075ddfd authored over 9 years ago

Prepare for Python packaging - move to ocrmypdf folder

9adb0d696f6d0112bca8fe7ae6376ed1048bcd12 authored over 9 years ago

Update release notes so far

c270f1ba5fd14d41f62524fcbb4abbf1c0972d50 authored over 9 years ago

Metadata override from command lien

7b255b575abb0fbd6f47c766cb0892ee9d2ab532 authored over 9 years ago

Transfer Unicode document information from input PDF to output PDF

What a pain getting Unicode right, but there it is.

I cannot find anything to confirm that it i...

d7a9f3a2ab4622410b50afb9b948a533e5f9d69e authored over 9 years ago

Copy document metadata from source document into output (untested)

This works for ASCII only; will do Unicode version.

abf2e7e9bb2a62052ae6531100169f3df91f84ef authored over 9 years ago

Reimplement debug pages

72e5fa9ba04900214545bc21602fc09625214e03 authored over 9 years ago

Reimplement skip text pages

32c1078d2cd721c4ea441cbd6978c997ce5551d0 authored over 9 years ago

Change @subdivide to @split

@split is for "1 to many" operations, so it's the right tool for this
case.

133f901a6912860afca8300c837774745afe5981 authored over 9 years ago

Try to make pdfinfo less obnoxious by printing too many decimals

42cd683ec072275a36b125191c18a362a384c26a authored over 9 years ago

For now, unpaper is the only deskew provider

151eb0537751c32ce5d5144e38607f14f30fb40c authored over 9 years ago

Remove ability to override temporary (working) folder

Little point to this feature - on most platforms the environment
variable can be overridden if d...

16177d0a5208dde52a2f3abe194aebc80aeecfae authored over 9 years ago

Automatically try to use all available CPUs

5ce544289fd206485d8f8f1b0ad75e0cf2e11f46 authored over 9 years ago

Remove duplicate test folder

77bd35c3c7d91456c612db68d760b9e16b1e90fe authored over 9 years ago

Goodbye, so long, farewell, shell...

0c5c208db0ac18fbf309bb6a3e065adcfd4a30f4 authored over 9 years ago

Split selecting final image and render PDF result into separate tasks

Simplifies the logic - one deals with all images, the other details
with an image and .hocr. Als...

60eb745331cda44b2c4db3a65af7cb905442740a authored over 9 years ago

Modularize unpaper; get -d and -c working again

9f90b5cb0a31927617e23728ece2e162cadf09df authored over 9 years ago

Remove more dead/old code

5adff94545ea2df71da5932260fc7b22456f2547 authored over 9 years ago

Implement deskew and clean using unpaper

aa2baabfa9914fc06276f8a5c6dac1fa8404c1c8 authored over 9 years ago

Cleanup externals

75c2b23efcf32a0e291d08bd405ae101e7ea744f authored over 9 years ago

Implement oversample

6451017962a2b1c83d7906349193f3f9333ce9c8 authored over 9 years ago

Put .rendered.pdf files into temp folder

0f857a6a3459973bc8f0f688ef947523f81b69e5 authored over 9 years ago

Change 'clean' to 'repair' for clarity since 'clean' is what unpaper does

7638a88a6a32e59f48bfd8ff474ba0cfbf35470c authored over 9 years ago

Remove 'pdftoppm' renderer

Ghostscript is more reliable than Poppler's pdftoppm renderer. gs is
also a hard dependency, as ...

bed12d20218e640b712a75ebff91fc675e6f7a4c authored over 9 years ago

Tidy up

587569fcb6a302af61e8cb3160c1719fb5f0a5b1 authored over 9 years ago

Platform independent search for iccprofiles for PDF/A

8c0dc9a06da9a9fac10a220570fcce22dacd3136 authored over 9 years ago

First successful PDF/A produced by new pipeline

289e4025ad61df4fbdec2ff2d340b39b4b38cb17 authored over 9 years ago

Rasterize PDF pages and generate .hocr files

5476eafe4ca4046fc38a02894bb151b227f3bb99 authored over 9 years ago

Langauge checking

df32f283cd01f713d9b8c59a440aa26371d6b2fc authored over 9 years ago

Add tesseract version check

68ecaac9cca59765b6b6e6c31429d4850c500463 authored over 9 years ago

Add PDF/A validation

cffd4623ca77d49c54e9f0c827bb9b69c9ac7bd9 authored over 9 years ago

Can now generate PDF/A files, multipage and single page

6dc2782e806d4a7d5df7032550a44d15ad1a7560 authored over 9 years ago

Wrap a proxy around pdfinfo block so it can be passed around processes

5df187c086ad306347b5e221dc6ee0bed1aea231 authored over 9 years ago

Get rid of chdir, replace deprecated @split with @subdivide

7fd172e41e5f277f8dffdced3dcb1bc991f42b42 authored over 9 years ago

Try a method for passing along the pdfinfo struct

619528a1b57ee6e765d0d94c7f5b4f7e7d4ba88d authored over 9 years ago

Reinstate WrapperLogger with more multiprocessing fixes

596d468c1457eb45f532d0a36e96a06d333ca4cf authored over 9 years ago

diff --git a/src/ocrmypdf.py b/src/ocrmypdf.py

index 68d1591..95afa8f 100755
--- a/src/ocrmypdf.py
+++ b/src/ocrmypdf.py
@@ -24,6 +24,7 @@ impor...

eddbf1060a8dcbf6ec2e7c6dca8dc5d7d9471a1b authored over 9 years ago

Move pageinfo code out of the pipeline

33731a686448768ef328e75a3ce8a73d0679ab42 authored over 9 years ago

Fix errors related to use working directory

Mainly workaround lack of @split(...output_dir) in ruffus

0c36cd2e24bb48b296cc6cfbcc29237f00a90a10 authored over 9 years ago

New pipeline runs, splits pages

5cef1be26d4b9b047d711bb59d242f96d5908ea8 authored over 9 years ago

Fixes from early testing of new pipeline

e89f482c3dc09c654317ca42bff51dfb23f835c6 authored over 9 years ago

Learn to split PDF into pages

fe3e40305dcd3976d9da2f9e4938639d8ddcd1ef authored over 9 years ago

Begin unifying main script and page script

a92b5ceb6b944a285e2c1d16a68f874998c2c659 authored over 9 years ago

Suppress the xref warning for now

0e7e7d843794dfd6266ee70332fa4f2fd60524e7 authored over 9 years ago

Fixes to colorspace and other inquiries

f47fa98f3343166feca0ad50ec363c28038475b8 authored over 9 years ago

Replace pdfimages -list call to poppler with PyPDF test for image

The immediate reason for doing this is that (newer?) versions of parse()
seem to choke on the pa...

d3d5879911fb3a8fc3c8e0232ddbccbf7d3fa77a authored over 9 years ago

Require Py3 for tests

b2168e11db59d5c73a1e1fb059ef9e5e589b1560 authored over 9 years ago

New test: check skew

6d5d8be70897c31526512d2f73740340f6a97d62 authored over 9 years ago

Add another test

ce2dbdf372ec711f6c134914363e035943f5cbb8 authored over 9 years ago

Basic test cases

ec8a35a7a676dc32d8e2563c3d8adb6bc2639eee authored over 9 years ago

Complete wrapping of logger/logger_mutex

f6577c22c3c2de8cf6ba818e369699cb47dbdb58 authored over 9 years ago

Implement oversampling in ocrpage.py

43d6c030930ca14eeb09798b7a33b3dec6a32e53 authored almost 10 years ago

More consistent spacing

1870f116bbb0fea87c7e5d64046dd53dad561860 authored almost 10 years ago

Don't presume two jobs

8b87def013489c7f1817ba0f81762ff5db3cc413 authored almost 10 years ago

Tidy up readme

de599d97b5eb97854e39fe3660314af847f6824b authored almost 10 years ago

Cleanup logger

5d7e6b45c4b0634ba56d344213b6341f1dca8575 authored almost 10 years ago

Change python2 -> python3 for readlink()

c6091bcfe18bd3e32309118ea8716514af3b0336 authored almost 10 years ago

It's now py3 that uses lxml, reportlab

466a8a13186bf7f49c25bb420cca429da8a6ea54 authored almost 10 years ago

Add rudimentary support for combining OCR layer with existing content

It appears to be very fragile due to weaknesses in PyPDF. Better
option is probably to use pdftk...

a99ba3b6966cd73f718af35ae54d7eb0b9c9a0db authored almost 10 years ago

Add option to render text as invisible OCR text

Prior to this change, hocrtransform would render printable text (black
on white) and then a full...

9229f7c6cc2306fe16d628bd984f2e9afb2843c0 authored almost 10 years ago

Clean up pixel transform logic with namedtuple

bf114bb1883c2cf0d3ca6312f62a003fee685e3c authored almost 10 years ago

More PEP8/lint

b8eed2f8612c9fbbab15e3fe5e2b0543043bef7f authored almost 10 years ago

Call HocrTransform directly instead of through a subprocess

ccb1e347be58f03eb73cb02a5140a52945893967 authored almost 10 years ago

Rename hocrTransform -> hocrtransform

8698974f11830b73fafc7264a0b2f0d36388761b authored almost 10 years ago

Convert hocrtransform to py3

f2c79c4341f9e09cf6b7c887a0746aac61bb91c4 authored almost 10 years ago

Module marker for src folder

4966d1346b0f9e633dfed26ae49466a4861dc98f authored almost 10 years ago

PEP8

4a9337f757b25dd767217cb26b1e50d1d70a879f authored almost 10 years ago