Wrong characters / difference between extraction and display

I noticed that characters are displayed correctly but extracted wrongly in old PFDs (probably digitized ones). Since the purpose of pdfalto and GROBID is to extract text from PDF whenever the original text is not available, paying attention to this issue might be of great importance. However, I am not sure if there is a solution for that.

I came across many examples, but this [PDF](https://we.tl/t-PHY4sD4wpM) is an excellent example. All instances of the word "_awkward_" are displayed identically by PDF viewers, but the last one is extracted by pdfalto as "_awkM'ard_". 

As I inspected, the text object is actually _awkM'ard_, but it would be very beneficial if mapping to the correct or at least meaningful character.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong characters / difference between extraction and display #160

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Wrong characters / difference between extraction and display #160

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions