nixpkgs/pkgs/applications/graphics/tesseract/tesseract4.nix

{ lib, stdenv, fetchFromGitHub, autoreconfHook, autoconf-archive, pkg-config
, leptonica, libpng, libtiff, icu, pango, opencl-headers, fetchpatch }:

stdenv.mkDerivation rec {
  pname = "tesseract";
  version = "4.1.1";

  src = fetchFromGitHub {
    owner = "tesseract-ocr";
    repo = "tesseract";
    rev = version;
    hash = "sha256-lu/Y5mlCI8AajhiWaID0fGo5PghEQZdgt2X0K9c/QrE=";
  };

  patches = [
    # https://github.com/tesseract-ocr/tesseract/issues/3447
    (fetchpatch {
      url = "https://github.com/tesseract-ocr/tesseract/commit/dbc79b09d195490dfa3f7d338eadac07ad6683f7.patch";
      sha256 = "sha256-lGlg0etuU4RXfdq1QH2bYObdeGrFHKf9O8zMUAbfNIQ=";
    })
    (fetchpatch {
      url = "https://github.com/tesseract-ocr/tesseract/commit/6dc4b184b1ebf2e68461f6b63f63a033bc7245f7.patch";
      sha256 = "sha256-DwIX3r5NmeajI6WgIVHDbkhLH/ygJIjPO5XrbzWQhSw=";
    })
  ];

  enableParallelBuilding = true;

  nativeBuildInputs = [
    pkg-config
    autoreconfHook
    autoconf-archive
  ];

  buildInputs = [
    leptonica
    libpng
    libtiff
    icu
    pango
    opencl-headers
  ];

  meta = {
    description = "OCR engine";
    homepage = "https://github.com/tesseract-ocr/tesseract";
    license = lib.licenses.asl20;
    maintainers = with lib.maintainers; [ viric earvstedt ];
    platforms = with lib.platforms; linux ++ darwin;
  };
}
treewide: pkgs.pkgconfig -> pkgs.pkg-config, move pkgconfig to alias.nix continuation of #109595 pkgconfig was aliased in 2018, however, it remained in all-packages.nix due to its wide usage. This cleans up the remaining references to pkgs.pkgsconfig and moves the entry to aliases.nix. python3Packages.pkgconfig remained unchanged because it's the canonical name of the upstream package on pypi. 2021-01-19 06:50:56 +00:00			`{ lib, stdenv, fetchFromGitHub, autoreconfHook, autoconf-archive, pkg-config`
tesseract4: apply patches to fix build on aarch64-darwin 2021-11-12 17:09:14 +00:00			`, leptonica, libpng, libtiff, icu, pango, opencl-headers, fetchpatch }:`
tesseract: Package version 4.x from Git master Tesseract 4 has got a new long short-term memory neural networking based OCR engine which really helps a lot in terms of accuracy and our VM tests. I ran the new version across a bunch of different screenshots and comparing the results to the 3.x branch and it really makes a big difference, especially with various font rendering settings. The only downside of this is that version 4 hasn't been released yet and is in alpha state right now, but it will eventually get there and the only solutions that came into my mind sticking to version 3 were really sub-par: * Use several passes with different color negation on the screenshots. * Train Tesseract 3 specifically for screenshots. This is sub-par because we'd need to do it for Tesseract 4 from scratch again. * Change the test systems so that it specifically uses only OCR an font when displaying. I've actually tried this but this also isn't accurate enough with our default font rendering setup. * Turn off special font rendering settings for our tests. In conjunction with changing to an OCR font this might work but it won't catch all the cases, because applications might use their own font rendering. Given that version 4 is faster[1] when it comes to OCR detection and also the points just mentioned I think even using the alpha version just for tests isn't going to hurt anybody. [1]: https://github.com/tesseract-ocr/tesseract/wiki/4.0-Accuracy-and-Performance Signed-off-by: aszlig <aszlig@redmoonstudios.org> 2017-04-11 01:30:45 +01:00
			`stdenv.mkDerivation rec {`
treewide: name -> pname (easy cases) (#66585) treewide replacement of stdenv.mkDerivation rec { name = "-${version}"; version = ""; to pname 2019-08-15 13:41:18 +01:00			`pname = "tesseract";`
tesseract: 4.1.0 -> 4.1.1 2020-01-17 12:04:34 +00:00			`version = "4.1.1";`
tesseract: Package version 4.x from Git master Tesseract 4 has got a new long short-term memory neural networking based OCR engine which really helps a lot in terms of accuracy and our VM tests. I ran the new version across a bunch of different screenshots and comparing the results to the 3.x branch and it really makes a big difference, especially with various font rendering settings. The only downside of this is that version 4 hasn't been released yet and is in alpha state right now, but it will eventually get there and the only solutions that came into my mind sticking to version 3 were really sub-par: * Use several passes with different color negation on the screenshots. * Train Tesseract 3 specifically for screenshots. This is sub-par because we'd need to do it for Tesseract 4 from scratch again. * Change the test systems so that it specifically uses only OCR an font when displaying. I've actually tried this but this also isn't accurate enough with our default font rendering setup. * Turn off special font rendering settings for our tests. In conjunction with changing to an OCR font this might work but it won't catch all the cases, because applications might use their own font rendering. Given that version 4 is faster[1] when it comes to OCR detection and also the points just mentioned I think even using the alpha version just for tests isn't going to hurt anybody. [1]: https://github.com/tesseract-ocr/tesseract/wiki/4.0-Accuracy-and-Performance Signed-off-by: aszlig <aszlig@redmoonstudios.org> 2017-04-11 01:30:45 +01:00
			`src = fetchFromGitHub {`
			`owner = "tesseract-ocr";`
			`repo = "tesseract";`
tesseract_4: 4.00.00alpha-git-20170410 -> 4.0.0 The 4.0.0 stable release is out. Changelog: https://github.com/tesseract-ocr/tesseract/wiki/4.0x-Changelog 2018-11-24 23:23:18 +00:00			`rev = version;`
tesseract: switch to SRI hash format 2022-05-02 11:38:49 +01:00			`hash = "sha256-lu/Y5mlCI8AajhiWaID0fGo5PghEQZdgt2X0K9c/QrE=";`
tesseract: Package version 4.x from Git master Tesseract 4 has got a new long short-term memory neural networking based OCR engine which really helps a lot in terms of accuracy and our VM tests. I ran the new version across a bunch of different screenshots and comparing the results to the 3.x branch and it really makes a big difference, especially with various font rendering settings. The only downside of this is that version 4 hasn't been released yet and is in alpha state right now, but it will eventually get there and the only solutions that came into my mind sticking to version 3 were really sub-par: * Use several passes with different color negation on the screenshots. * Train Tesseract 3 specifically for screenshots. This is sub-par because we'd need to do it for Tesseract 4 from scratch again. * Change the test systems so that it specifically uses only OCR an font when displaying. I've actually tried this but this also isn't accurate enough with our default font rendering setup. * Turn off special font rendering settings for our tests. In conjunction with changing to an OCR font this might work but it won't catch all the cases, because applications might use their own font rendering. Given that version 4 is faster[1] when it comes to OCR detection and also the points just mentioned I think even using the alpha version just for tests isn't going to hurt anybody. [1]: https://github.com/tesseract-ocr/tesseract/wiki/4.0-Accuracy-and-Performance Signed-off-by: aszlig <aszlig@redmoonstudios.org> 2017-04-11 01:30:45 +01:00			`};`

tesseract4: apply patches to fix build on aarch64-darwin 2021-11-12 17:09:14 +00:00			`patches = [`
			`# https://github.com/tesseract-ocr/tesseract/issues/3447`
			`(fetchpatch {`
			`url = "https://github.com/tesseract-ocr/tesseract/commit/dbc79b09d195490dfa3f7d338eadac07ad6683f7.patch";`
			`sha256 = "sha256-lGlg0etuU4RXfdq1QH2bYObdeGrFHKf9O8zMUAbfNIQ=";`
			`})`
			`(fetchpatch {`
			`url = "https://github.com/tesseract-ocr/tesseract/commit/6dc4b184b1ebf2e68461f6b63f63a033bc7245f7.patch";`
			`sha256 = "sha256-DwIX3r5NmeajI6WgIVHDbkhLH/ygJIjPO5XrbzWQhSw=";`
			`})`
			`];`

tesseract: add a wrapper to setup languages Tesseract is now decoupled from the tessdata language corpus. This avoids recompilation when building Tesseract with a custom set of languages. Update k2pdfopt to use the new wrapper interface. 2018-12-18 18:02:13 +00:00			`enableParallelBuilding = true;`
tesseract: Package version 4.x from Git master Tesseract 4 has got a new long short-term memory neural networking based OCR engine which really helps a lot in terms of accuracy and our VM tests. I ran the new version across a bunch of different screenshots and comparing the results to the 3.x branch and it really makes a big difference, especially with various font rendering settings. The only downside of this is that version 4 hasn't been released yet and is in alpha state right now, but it will eventually get there and the only solutions that came into my mind sticking to version 3 were really sub-par: * Use several passes with different color negation on the screenshots. * Train Tesseract 3 specifically for screenshots. This is sub-par because we'd need to do it for Tesseract 4 from scratch again. * Change the test systems so that it specifically uses only OCR an font when displaying. I've actually tried this but this also isn't accurate enough with our default font rendering setup. * Turn off special font rendering settings for our tests. In conjunction with changing to an OCR font this might work but it won't catch all the cases, because applications might use their own font rendering. Given that version 4 is faster[1] when it comes to OCR detection and also the points just mentioned I think even using the alpha version just for tests isn't going to hurt anybody. [1]: https://github.com/tesseract-ocr/tesseract/wiki/4.0-Accuracy-and-Performance Signed-off-by: aszlig <aszlig@redmoonstudios.org> 2017-04-11 01:30:45 +01:00
tesseract: use multi-line build inputs format 2022-05-02 11:38:50 +01:00			`nativeBuildInputs = [`
			`pkg-config`
			`autoreconfHook`
			`autoconf-archive`
			`];`

			`buildInputs = [`
			`leptonica`
			`libpng`
			`libtiff`
			`icu`
			`pango`
			`opencl-headers`
			`];`
tesseract: Package version 4.x from Git master Tesseract 4 has got a new long short-term memory neural networking based OCR engine which really helps a lot in terms of accuracy and our VM tests. I ran the new version across a bunch of different screenshots and comparing the results to the 3.x branch and it really makes a big difference, especially with various font rendering settings. The only downside of this is that version 4 hasn't been released yet and is in alpha state right now, but it will eventually get there and the only solutions that came into my mind sticking to version 3 were really sub-par: * Use several passes with different color negation on the screenshots. * Train Tesseract 3 specifically for screenshots. This is sub-par because we'd need to do it for Tesseract 4 from scratch again. * Change the test systems so that it specifically uses only OCR an font when displaying. I've actually tried this but this also isn't accurate enough with our default font rendering setup. * Turn off special font rendering settings for our tests. In conjunction with changing to an OCR font this might work but it won't catch all the cases, because applications might use their own font rendering. Given that version 4 is faster[1] when it comes to OCR detection and also the points just mentioned I think even using the alpha version just for tests isn't going to hurt anybody. [1]: https://github.com/tesseract-ocr/tesseract/wiki/4.0-Accuracy-and-Performance Signed-off-by: aszlig <aszlig@redmoonstudios.org> 2017-04-11 01:30:45 +01:00
			`meta = {`
			`description = "OCR engine";`
treewide: Per RFC45, remove all unquoted URLs 2020-04-01 02:11:51 +01:00			`homepage = "https://github.com/tesseract-ocr/tesseract";`
treewide: stdenv.lib -> lib 2021-01-15 13:21:58 +00:00			`license = lib.licenses.asl20;`
			`maintainers = with lib.maintainers; [ viric earvstedt ];`
			`platforms = with lib.platforms; linux ++ darwin;`
tesseract: Package version 4.x from Git master Tesseract 4 has got a new long short-term memory neural networking based OCR engine which really helps a lot in terms of accuracy and our VM tests. I ran the new version across a bunch of different screenshots and comparing the results to the 3.x branch and it really makes a big difference, especially with various font rendering settings. The only downside of this is that version 4 hasn't been released yet and is in alpha state right now, but it will eventually get there and the only solutions that came into my mind sticking to version 3 were really sub-par: * Use several passes with different color negation on the screenshots. * Train Tesseract 3 specifically for screenshots. This is sub-par because we'd need to do it for Tesseract 4 from scratch again. * Change the test systems so that it specifically uses only OCR an font when displaying. I've actually tried this but this also isn't accurate enough with our default font rendering setup. * Turn off special font rendering settings for our tests. In conjunction with changing to an OCR font this might work but it won't catch all the cases, because applications might use their own font rendering. Given that version 4 is faster[1] when it comes to OCR detection and also the points just mentioned I think even using the alpha version just for tests isn't going to hurt anybody. [1]: https://github.com/tesseract-ocr/tesseract/wiki/4.0-Accuracy-and-Performance Signed-off-by: aszlig <aszlig@redmoonstudios.org> 2017-04-11 01:30:45 +01:00			`};`
			`}`