-
Notifications
You must be signed in to change notification settings - Fork 572
Open
Labels
Description
Some PDFs in a project could not be read by the parser.
After a closer examination of the binary data, it was noticed that there is often a space before the reference byte.
After a brief search on the Internet, I could not find any information as to whether this space may be included or not. Perhaps someone here who is more familiar with the subject knows more.
By inserting an optional space in the RegEx at this point, the PDF is recognized again.
RegEx would then look like the following:
'/(?<=[\r\n])startxref[\s]*[\r\n]+[\s]*([0-9]+)[\s]*[\r\n]+%%EOF/i'
pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php
Lines 884 to 891 in f44ada0
| // Find all startxref tables from this $offset forward | |
| $startxrefPreg = preg_match_all( | |
| '/(?<=[\r\n])startxref[\s]*[\r\n]+([0-9]+)[\s]*[\r\n]+%%EOF/i', | |
| $pdfData, | |
| $startxrefMatches, | |
| \PREG_SET_ORDER, | |
| $offset | |
| ); |
Reactions are currently unavailable