Tesseract 4 has got a new long short-term memory neural networking based
OCR engine which really helps a lot in terms of accuracy and our VM
tests.
I ran the new version across a bunch of different screenshots and
comparing the results to the 3.x branch and it really makes a big
difference, especially with various font rendering settings.
The only downside of this is that version 4 hasn't been released yet and
is in alpha state right now, but it will eventually get there and the
only solutions that came into my mind sticking to version 3 were really
sub-par:
* Use several passes with different color negation on the screenshots.
* Train Tesseract 3 specifically for screenshots. This is sub-par
because we'd need to do it for Tesseract 4 from scratch again.
* Change the test systems so that it specifically uses *only* OCR an
font when displaying. I've actually tried this but this also isn't
accurate enough with our default font rendering setup.
* Turn off special font rendering settings for our tests. In
conjunction with changing to an OCR font this might work but it won't
catch all the cases, because applications might use their own font
rendering.
Given that version 4 is faster[1] when it comes to OCR detection and also
the points just mentioned I think even using the alpha version just for
tests isn't going to hurt anybody.
[1]: https://github.com/tesseract-ocr/tesseract/wiki/4.0-Accuracy-and-Performance
Signed-off-by: aszlig <aszlig@redmoonstudios.org>
This is from the commit message I've written for the upstream pull
request (jflesch/pyocr#62):
This is a bit more involved, because Tesseract 3.05.00 comes not
only with improvements but also with a few quirks we need to deal
with.
The first quirk is that the order arguments of the `tesseract'
command now matters and the list of configurations has to be at the
end of the command line. So we add a new attribute tesseract_flags
to the BaseBuilder class that contains a list of all the flags to
pass to `tesseract', the tesseract_configs attribute however remains
pretty much the same but now only really contains a list of configs
instead of being mixed with flag arguments.
Another quirk has to do with Leptonica >= 1.74 which Tesseract
3.05.00 now requires. Leptonica has special handling of files that
reside in /tmp and assumes that it's an internal temporary file of
Leptonica. In order to deal with it, we now run Tesseract in a
temporary directory, which contains the input/output files and use
the relative name of these files because Leptonica only searches for
path names beginning with /tmp.
Fortunately the last item we need to address is not really a quirk,
but an API change. In Tesseract 3.05.00 there is now a new function
called TessBaseAPIDetectOrientationScript(), which doesn't fill the
OSResults object anymore but now allows to pass the values we're
interested in directly by reference. We need to use this new
function because the old function TessBaseAPIDetectOS() now *always*
returns false.
I've tested this specifically on NixOS and in conjunction with Paperwork
(the only package that's using pyocr so far) and all the tests of the
dependency chain are now succeeding. However, I didn't do manual tests
of Paperwork though.
Signed-off-by: aszlig <aszlig@redmoonstudios.org>
Upstream changes for version 0.4.5:
* Clean up exceptions raised when OCR fails:
* Now, all tools raise only exceptions inheriting from
pyocr.PyocrException
* There is now one and only one TesseractError (shared between
pyocr.libtesseract and pyocr.tesseract)
Upstream changes for version 0.4.6:
* hOCR outputs: Generate valid XHTML files
The full upstream changelog can be found at:
https://github.com/jflesch/pyocr/blob/master/ChangeLog
Note that because of the version bump of Tesseract neither version 0.4.4
nor version 0.4.6 succeed to build, so we need to fix this up soon.
Signed-off-by: aszlig <aszlig@redmoonstudios.org>
The changes are a bit too big to include it here in the commit message,
so if you want the details of what changed, please visit this URL:
http://leptonica.org/source/version-notes.html
I have also provided openjpeg, giflib and libwebp as dependencies so
that Leptonica is able to read/write those file formats.
Additionally I've added a patch that uses pkgconfig to resolve all
dependencies (except giflib), because unlike AC_CHECK_LIB() the
PKG_CHECK_MODULES() macro defines *_LIBS variables to include the linker
search path.
Unfortunately that patch alone is not enough, because the *_LIBS
variable are substituted by the upstream configure.ac to *not* include
the linker search paths, so we need to remove the AC_SUBST() calls
within PKG_CHECK_MODULES().
The only dependency that's not yet using PKG_CHECK_MODULES() is giflib,
because giflib doesn't have a pkg-config description file, therefore
we're using substituteInPlace to insert the linker search path after the
lept.pc file was generated by configure.
Another thing that we no longer need is the dependency on libpng version
1.2, because Leptonica now also works with more recent libpng versions.
Tested by building the package itself and also the following packages
that immediately depend on leptonica:
* k2pdfopt
* tesseract
* jbig2enc
All of these packages succeeded to build on x86_64-linux.
The main reason why I'm bumping Leptonica to version 1.74.1 is that we
need at least version 1.74 to bump Tesseract to the latest upstream
version.
Signed-off-by: aszlig <aszlig@redmoonstudios.org>
There are a few dozen new failures on Darwin, probably related to
updates of stdenv's llvm and/or pkgconfig.
Still the total number of successes increases.
Including apple_sdk.sdk is generally a recipe for a bad time on LLVM 3.8
and above, since you end up with bad headers in the wrong place that hurt
the new libc++ in 3.8 and above. In this case, qt only wanted the super-
generic SDK for CUPS headers, which we can just depend on directly now.
According to ABI report https://abi-laboratory.pro/tracker/timeline/ffmpeg/
I see only one removed function and one removed field - both should be
detected during compile-time. The rest are changes that don't matter
when everything rebuilds.
The idea is to have an almost-automatic conversion from QuickLisp, the
definitive Common Lisp package repository, to Nix. The benefit over just
using lispPackages.quicklisp is automatic installation of non-Lisp
dependencies from NixPkgs (and integration with Nix package management).
The benefit over lispPackages for normal Lisp packages is packaging just
a snapshot of QuickLisp which is known to be tested for version
compatibility between libraries.
There are some packages in lispPackages that are not from QuickLisp (for
example, the installable wrapper of QuickLisp itself). My hope is to
replace the rest with the expressions converted from QuickLisp.
Note that the current commit is a mere addition.
Now works with newer version of vim youcompleteme plugin.
Details:
- The OS X patch is no longer necessary as that code was removed upstream.
- It seems to want LLVM version 4 now.
- It annoyingly wants to symlink libclang.4 to libclang.4.0; nix already
did this.
All 20 tests did fail because no gpg binary was found. With gnupg1 as
build input they never finish. Deactivating them might be the best
option for now (and it improves the current situation since they never
actually succeeded anyway -> build was failing, I noticed this while
running nox-review for #24390).
Additional tools:
- gpg-key2latex
- gpgdir
- gpgwrap
This module is really hacky and the dependencies are very messy... :o
However I tried my best at testing all 19 individual tools and they
should (hopefully) all work now (apart from sendmail which can be
provided by multiple packages) :)
The code is very redundant (sorry) but imho it's easier to read and
maintain it that way.
TODO: There are some additional manual pages that could be included (I'm
too exhausted for that atm...). And there might be a lot of stuff that
could be improved in the future.
This patch restructures the expression and wrapper to minimize Nix store
references captured by the user's state directory.
The previous version would write lots of references to the Nix store into
the user's state directory, resulting in synchronization issues between
the Store and the local state directory. At best, this would cause TBB to
stop working when the version used to instantiate the local state was
garbage collected; at worst, a user would continue to use the old version
even after an upgrade.
To solve the issue, hard-code as much as possible at the Store side and
minimize the amount of stuff being copied into the local state dir.
Currently, only a few files generated at firefox startup and fontconfig
cache files end up capturing store paths; these files are simply removed
upon every startup. Otherwise, no capture should occur and the user
should always be using the TBB associated with the tor-browser wrapper
script.
To check for stale Store paths, do
`grep -Ero '/nix/store/[^/]+' ~/.local/share/tor-browser`
This command should *never* return any other store path than the one
associated with the current tor-browser wrapper script, even after an
update (assuming you've run tor-browser at least once after updating).
Deviations from this general rule are considered bugs from now on.
Note that no attempt has been made to support pluggable transports; they
are still broken with this patch (to be fixed in a follow-up patch).
User visible changes:
- Wrapper retains only environment variables required for TBB to work
- pulseaudioSupport can be toggled independently of mediaSupport (the
latter weakly implies the former).
- Store local state under $TBB_HOME. Defaults to $XDG_DATA_HOME/tor-browser
- Stop obnoxious first-run stuff (NoScript redirect, in particular)
- Set desktop item GenericName to Web Browser
Some minor enhancements:
- Disable Hydra builds
- Specify system -> source mapping to make it easier to
extend supported platforms.
The community support window for Qt 5.5 has ended. All packages should
- update to Qt 5.8, or
- pin to Qt 5.6 (the 3-year long-term support release), or
- for proprietary software, use the vendored libraries.
The community support window for Qt 5.7 has ended. All packages should
- update to Qt 5.8, or
- pin to Qt 5.6 (the 3-year long-term support release), or
- for proprietary software, use the vendored libraries.
Ppx_ast selects a specific version of the OCaml Abstract Syntax
Tree from the migrate-parsetree project that is not necessarily
the same one as the one being used by the compiler.
Homepage: https://github.com/janestreet/ppx_ast
It's effectively required for GTK3 applications because various parts of the library use GIO to store settings.
Also propagate GTK for clarity (it should be there anyway).
* renderdoc: init at version 0.34pre
Initialising a few commits after the latest release due to some upstream
improvements to the build system.
* fix maintainer
This should eliminate the branched logic for gfortran on Darwin, as well
as preventing accidental inclusion of impure paths in gcc and gfortran
builds.