Ecosyste.ms: OpenCollective

An open API service for software projects hosted on Open Collective.

github.com/ocrmypdf/OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
https://github.com/ocrmypdf/OCRmyPDF

Actually link the release notes

6e6f918630bba7077ba9a50d75a138767422bce7 authored over 9 years ago
Fix git clone command with one I tested ;)

46338122461b510add63fd83c39693ff96a52020 authored over 9 years ago
Update README with more detailed instructions

14bd1555aa2f5b214e662b036fe8774236aa6d3c authored over 9 years ago
Fixes: clarify install instructions and reactivate external program checks

b9d7687fa096cca819adcc1800cc652238857af2 authored over 9 years ago
Merge branch 'develop'

# Conflicts:
# RELEASE_NOTES.md
# src/config.sh
# src/hocrTransform.py
# src/ocrPage.sh

93b36965e2b33bed12603b2c4bdad749fb74a2c1 authored over 9 years ago
-rc2: because pypi won't accept -rc1

9e0c443c2f2d9923feeb05c3e5c0778bf1c209d6 authored over 9 years ago
Don't mess with options

60832152b1d25698ef8d9b7de7dc96a8256a77de authored over 9 years ago
Update release notes, add copyrights

6a160d22fe4a5712156013fd6aeace9fc5ed3fe7 authored over 9 years ago
More test cases

e35526192ce0de18713a75b6bc20b1cf01ee6ad5 authored over 9 years ago
More test cases for other parameters

bea57bdded53ddd117740f289861e45b37dcbcd4 authored over 9 years ago
Minor tweaks to uncommon arguments

2a9da225e4636c0b5b81776f4dfb5a8f99380883 authored over 9 years ago
Test cases for --tesseract-timeout

a3f37de9b5f24323dfc34e17d60a302b050b5ed9 authored over 9 years ago
Get rid of subprocess call on import of tesseract, unpaper -- bit nasty

606416095389c170d43bdb1962f9bc1423c7c6b9 authored over 9 years ago
Drop nose, all tests working reasonably again

Although the real issue was that the ruffus pipeline cannot be executed
twice in the same proces...

850814131426c0e009472665c8253b625bb134f7 authored over 9 years ago
nose can't really handle external tests so looking into py.test instead

Specifically it trips over the need to reimport ocrmypdf.main. That in
turn raises questions ab...

1c9559788297984628514a346e22edf987da7df2 authored over 9 years ago
--oversample: Default to 0

587fa63c8e8e0364c33e4384a1fad8794bb2f2f3 authored over 9 years ago
Add --oversample test for hocr rendering

b40eec4cb0b07d30b8dec32b6ebfbe1efc6c9559 authored over 9 years ago
Add test to confirm that metadata is transferred to final PDF/A

7bcd48c26924b79d988248db4ac1890c209f1597 authored over 9 years ago
Improve argument handling, test cases

2e7cd52c0f7002f9fa17003b97e7465c5c02dd3e authored over 9 years ago
Put ghostscript in a module

77d4cb367e14b7f86371e28e8795a28989084448 authored over 9 years ago
Implement tesseract timeout

2c45c5abc63d29cc91b8caa956163a6ee26b6ec0 authored over 9 years ago
Implement tesseract PDF rendering as an alternative

It's much better a rendering text baselines than hocr and seems to
produce small file sizes, so ...

a89afabd798329c451f0099fa1a9756392ddffd9 authored over 9 years ago
setup.py: Only do program checks when installing

03f7c9bf07aaa381cc752e4ddd34cad2b6322b59 authored over 9 years ago
setup.py: check for third party program requirements

d5f4862749eba85a1d309a676c5551f269f7d469 authored over 9 years ago
More testing: JPEG

8aced0b6d3c5d003d11a309651ed24ac1b61e2e0 authored over 9 years ago
Don't create inline images in output PDFs

...except that Ghostscript will sometimes turn out of line images into
inline images on its own,...

6b9adef6849f4013b0caee82787d947bd924dcee authored over 9 years ago
Make this PDF a whole image page

Originally it had a smaller image centred in a page, which is not quite
supported.

5440d988fc57cf3457a9103c981b551d6fe5185d authored over 9 years ago
pageinfo: drop pdftotext and use PyPDF instead

30da4fc569ac3f8ca68cc32510b1b5cc10384bd1 authored over 9 years ago
Test cases for pageinfo; complain about inline images

2c1b5e100b79d7c620a21cdef75e9741e6b4159e authored over 9 years ago
Add some pageinfo test cases; found problem with inline images

3684f278ed6b00d51bf2177eb3369387a07bcf98 authored over 9 years ago
Remove redundant *res_render

6c3cb6acba9856bd5af312537cdb6875c06d5b77 authored over 9 years ago
Replace .md with .rst

Github supports both, and PyPI expects .rst files, so use .rst and make
everyone happy.

Auto-co...

b98ba8d17451e1a89a4fa710abcc1fd53d63077e authored over 9 years ago
More packaging changes: move jhove, fix console script

d3088829af87192719077f93abc80c98374bab17 authored over 9 years ago
Packaging stuff

9aaaba17149600c787217e444d4428b8b075ddfd authored over 9 years ago
Prepare for Python packaging - move to ocrmypdf folder

9adb0d696f6d0112bca8fe7ae6376ed1048bcd12 authored over 9 years ago
Update release notes so far

c270f1ba5fd14d41f62524fcbb4abbf1c0972d50 authored over 9 years ago
Metadata override from command lien

7b255b575abb0fbd6f47c766cb0892ee9d2ab532 authored over 9 years ago
Transfer Unicode document information from input PDF to output PDF

What a pain getting Unicode right, but there it is.

I cannot find anything to confirm that it i...

d7a9f3a2ab4622410b50afb9b948a533e5f9d69e authored over 9 years ago
Copy document metadata from source document into output (untested)

This works for ASCII only; will do Unicode version.

abf2e7e9bb2a62052ae6531100169f3df91f84ef authored over 9 years ago
Reimplement debug pages

72e5fa9ba04900214545bc21602fc09625214e03 authored over 9 years ago
Reimplement skip text pages

32c1078d2cd721c4ea441cbd6978c997ce5551d0 authored over 9 years ago
Change @subdivide to @split

@split is for "1 to many" operations, so it's the right tool for this
case.

133f901a6912860afca8300c837774745afe5981 authored over 9 years ago
Try to make pdfinfo less obnoxious by printing too many decimals

42cd683ec072275a36b125191c18a362a384c26a authored over 9 years ago
For now, unpaper is the only deskew provider

151eb0537751c32ce5d5144e38607f14f30fb40c authored over 9 years ago
Remove ability to override temporary (working) folder

Little point to this feature - on most platforms the environment
variable can be overridden if d...

16177d0a5208dde52a2f3abe194aebc80aeecfae authored over 9 years ago
Automatically try to use all available CPUs

5ce544289fd206485d8f8f1b0ad75e0cf2e11f46 authored over 9 years ago
Remove duplicate test folder

77bd35c3c7d91456c612db68d760b9e16b1e90fe authored over 9 years ago
Goodbye, so long, farewell, shell...

0c5c208db0ac18fbf309bb6a3e065adcfd4a30f4 authored over 9 years ago
Split selecting final image and render PDF result into separate tasks

Simplifies the logic - one deals with all images, the other details
with an image and .hocr. Als...

60eb745331cda44b2c4db3a65af7cb905442740a authored over 9 years ago
Modularize unpaper; get -d and -c working again

9f90b5cb0a31927617e23728ece2e162cadf09df authored over 9 years ago
Remove more dead/old code

5adff94545ea2df71da5932260fc7b22456f2547 authored over 9 years ago
Implement deskew and clean using unpaper

aa2baabfa9914fc06276f8a5c6dac1fa8404c1c8 authored over 9 years ago
Cleanup externals

75c2b23efcf32a0e291d08bd405ae101e7ea744f authored over 9 years ago
Implement oversample

6451017962a2b1c83d7906349193f3f9333ce9c8 authored over 9 years ago
Put .rendered.pdf files into temp folder

0f857a6a3459973bc8f0f688ef947523f81b69e5 authored over 9 years ago
Change 'clean' to 'repair' for clarity since 'clean' is what unpaper does

7638a88a6a32e59f48bfd8ff474ba0cfbf35470c authored over 9 years ago
Remove 'pdftoppm' renderer

Ghostscript is more reliable than Poppler's pdftoppm renderer. gs is
also a hard dependency, as ...

bed12d20218e640b712a75ebff91fc675e6f7a4c authored over 9 years ago
Tidy up

587569fcb6a302af61e8cb3160c1719fb5f0a5b1 authored over 9 years ago
Platform independent search for iccprofiles for PDF/A

8c0dc9a06da9a9fac10a220570fcce22dacd3136 authored over 9 years ago
First successful PDF/A produced by new pipeline

289e4025ad61df4fbdec2ff2d340b39b4b38cb17 authored over 9 years ago
Rasterize PDF pages and generate .hocr files

5476eafe4ca4046fc38a02894bb151b227f3bb99 authored over 9 years ago
Langauge checking

df32f283cd01f713d9b8c59a440aa26371d6b2fc authored over 9 years ago
Add tesseract version check

68ecaac9cca59765b6b6e6c31429d4850c500463 authored over 9 years ago
Add PDF/A validation

cffd4623ca77d49c54e9f0c827bb9b69c9ac7bd9 authored over 9 years ago
Can now generate PDF/A files, multipage and single page

6dc2782e806d4a7d5df7032550a44d15ad1a7560 authored over 9 years ago
Wrap a proxy around pdfinfo block so it can be passed around processes

5df187c086ad306347b5e221dc6ee0bed1aea231 authored over 9 years ago
Get rid of chdir, replace deprecated @split with @subdivide

7fd172e41e5f277f8dffdced3dcb1bc991f42b42 authored over 9 years ago
Try a method for passing along the pdfinfo struct

619528a1b57ee6e765d0d94c7f5b4f7e7d4ba88d authored over 9 years ago
Reinstate WrapperLogger with more multiprocessing fixes

596d468c1457eb45f532d0a36e96a06d333ca4cf authored over 9 years ago
diff --git a/src/ocrmypdf.py b/src/ocrmypdf.py

index 68d1591..95afa8f 100755
--- a/src/ocrmypdf.py
+++ b/src/ocrmypdf.py
@@ -24,6 +24,7 @@ impor...

eddbf1060a8dcbf6ec2e7c6dca8dc5d7d9471a1b authored over 9 years ago
Move pageinfo code out of the pipeline

33731a686448768ef328e75a3ce8a73d0679ab42 authored over 9 years ago
Fix errors related to use working directory

Mainly workaround lack of @split(...output_dir) in ruffus

0c36cd2e24bb48b296cc6cfbcc29237f00a90a10 authored over 9 years ago
New pipeline runs, splits pages

5cef1be26d4b9b047d711bb59d242f96d5908ea8 authored over 9 years ago
Fixes from early testing of new pipeline

e89f482c3dc09c654317ca42bff51dfb23f835c6 authored over 9 years ago
Learn to split PDF into pages

fe3e40305dcd3976d9da2f9e4938639d8ddcd1ef authored over 9 years ago
Begin unifying main script and page script

a92b5ceb6b944a285e2c1d16a68f874998c2c659 authored over 9 years ago
Suppress the xref warning for now

0e7e7d843794dfd6266ee70332fa4f2fd60524e7 authored over 9 years ago
Fixes to colorspace and other inquiries

f47fa98f3343166feca0ad50ec363c28038475b8 authored over 9 years ago
Replace pdfimages -list call to poppler with PyPDF test for image

The immediate reason for doing this is that (newer?) versions of parse()
seem to choke on the pa...

d3d5879911fb3a8fc3c8e0232ddbccbf7d3fa77a authored over 9 years ago
Require Py3 for tests

b2168e11db59d5c73a1e1fb059ef9e5e589b1560 authored over 9 years ago
New test: check skew

6d5d8be70897c31526512d2f73740340f6a97d62 authored over 9 years ago
Add another test

ce2dbdf372ec711f6c134914363e035943f5cbb8 authored over 9 years ago
Basic test cases

ec8a35a7a676dc32d8e2563c3d8adb6bc2639eee authored over 9 years ago
Complete wrapping of logger/logger_mutex

f6577c22c3c2de8cf6ba818e369699cb47dbdb58 authored over 9 years ago
Implement oversampling in ocrpage.py

43d6c030930ca14eeb09798b7a33b3dec6a32e53 authored almost 10 years ago
More consistent spacing

1870f116bbb0fea87c7e5d64046dd53dad561860 authored almost 10 years ago
Don't presume two jobs

8b87def013489c7f1817ba0f81762ff5db3cc413 authored almost 10 years ago
Tidy up readme

de599d97b5eb97854e39fe3660314af847f6824b authored almost 10 years ago
Cleanup logger

5d7e6b45c4b0634ba56d344213b6341f1dca8575 authored almost 10 years ago
Change python2 -> python3 for readlink()

c6091bcfe18bd3e32309118ea8716514af3b0336 authored almost 10 years ago
It's now py3 that uses lxml, reportlab

466a8a13186bf7f49c25bb420cca429da8a6ea54 authored almost 10 years ago
Add rudimentary support for combining OCR layer with existing content

It appears to be very fragile due to weaknesses in PyPDF. Better
option is probably to use pdftk...

a99ba3b6966cd73f718af35ae54d7eb0b9c9a0db authored almost 10 years ago
Add option to render text as invisible OCR text

Prior to this change, hocrtransform would render printable text (black
on white) and then a full...

9229f7c6cc2306fe16d628bd984f2e9afb2843c0 authored almost 10 years ago
Clean up pixel transform logic with namedtuple

bf114bb1883c2cf0d3ca6312f62a003fee685e3c authored almost 10 years ago
More PEP8/lint

b8eed2f8612c9fbbab15e3fe5e2b0543043bef7f authored almost 10 years ago
Call HocrTransform directly instead of through a subprocess

ccb1e347be58f03eb73cb02a5140a52945893967 authored almost 10 years ago
Rename hocrTransform -> hocrtransform

8698974f11830b73fafc7264a0b2f0d36388761b authored almost 10 years ago
Convert hocrtransform to py3

f2c79c4341f9e09cf6b7c887a0746aac61bb91c4 authored almost 10 years ago
Module marker for src folder

4966d1346b0f9e633dfed26ae49466a4861dc98f authored almost 10 years ago
PEP8

4a9337f757b25dd767217cb26b1e50d1d70a879f authored almost 10 years ago