Some PDF documents I have to work with are not well-made… resulting in a very inconsistent quoting/highlighting experience, making the text extraction from PDF sometimes not feasible or impractical.
I sometimes end up with
textthathasnospacesandsudden lythereisagiantspacebecausethereisabreakline
I’m learning that not all PDF are made equal and for some of them (the “bad” ones), the PDF “drivers” actually needs to interpret how much physical space there is between letters to infer 0, 1 or more spaces to output when copying text. (I’ve just learned that PDF don’t even have the concept of words of anything, it’s just a mapping of characters and figures )
To remedy to this, I’ve been using the free version of ABBYY software, but I have the feeling that there must be open-source software (via server even?) that could manage to create proper text from poorly created PDF… right?
Any suggestion?