Possible startxrefPreg extension

Some PDFs in a project could not be read by the parser.

After a closer examination of the binary data, it was noticed that there is often a space before the reference byte.

After a brief search on the Internet, I could not find any information as to whether this space may be included or not. Perhaps someone here who is more familiar with the subject knows more.

By inserting an optional space in the RegEx at this point, the PDF is recognized again.

RegEx would then look like the following: 

`'/(?<=[\r\n])startxref[\s]*[\r\n]+[\s]*([0-9]+)[\s]*[\r\n]+%%EOF/i'`

https://github.com/smalot/pdfparser/blob/f44ada017eac4f607ffeb1caca96a2347d48f38f/src/Smalot/PdfParser/RawData/RawDataParser.php#L884-L891

	// Find all startxref tables from this $offset forward
	$startxrefPreg = preg_match_all(
	'/(?<=[\r\n])startxref[\s][\r\n]+([0-9]+)[\s][\r\n]+%%EOF/i',
	$pdfData,
	$startxrefMatches,
	\PREG_SET_ORDER,
	$offset
	);

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible startxrefPreg extension #756

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Possible startxrefPreg extension #756

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions