Extract text from non-top-to-bottom rendered PDF

05 Jun 2022

Quality checks on PDF documents can be challenging, as the text can't be easily copied for verification. There are few ways in which you can try extracting text from the PDF document

Simple method: Right click > Select All > Copy and then paste it on your text editor.
Adobe suggested method: Go to Edit > Copy file to Clipboard and paste it on your text editor
Programmatic text extraction: You can use the PDF manipulation libraries like iText to extract the text contents and further you can automate the quality checks.

This post is focused on the programmatic text extraction and the challenges you may face.

iText library offers a various PDF creation/manipulation method, which helps to create complex PDF documents. It also provides support methods to extract the text content from PDF document. The implementation is simple for the consuming application.

There are various text extraction strategies out of which following are useful based on your text extraction requirements

SimpleTextExtractionStrategy - follows the exact same order in which the PDF render application places the content.
LocationTextExtractionStrategy - uses the x, y co-ordinates (vector points) of the text blocks to determine the text extraction order.

SimpleTextExtractionStrategy is amazingly simple and suitable for most of the use cases. The implementation is as below

var reader = new PdfReader(openFileDialog1.FileName);
var pdfDocument = new PdfDocument(reader);
var strategy = new SimpleTextExtractionStrategy();
var pagesCount = pdfDocument.GetNumberOfPages();

for (var i = 1; i <= pagesCount; i++)
{
        // text content from current page
        var text = PdfTextExtractor.GetTextFromPage(pdfDocument.GetPage(i), strategy);
}

SimpleTextExtractionStrategy works fine for documents that are rendered from top to bottom, e.g., the text blocks are placed in exact same order as it is displayed in the PDF document. A properly/orderly constructed document can be extracted perfectly using this strategy, and quality checked without any issues.

Consider if you are using a third-party PDF conversion software e.g.., HTML to PDF convertor, Word to PDF convertor, then the order of text placements within a page is not within your control. If the software places the text paragraphs from the end of the page, then the text extraction will start from that text block and follows the PDF file rendered order. This creates problem for quality checks.

For this issue, I've implemented a custom text extraction strategy that tries to extract the PDF contents in same order as you see from top to bottom of a page. It uses the combination of document flow and the co-ordinates to determine where a text block should be placed.

using System.Text;
using iText.Kernel.Geom;
using iText.Kernel.Pdf.Canvas.Parser;
using iText.Kernel.Pdf.Canvas.Parser.Data;
using iText.Kernel.Pdf.Canvas.Parser.Listener;

namespace ITextReadPdf;

/// <summary>
///     The top to bottom text extraction strategy.
/// </summary>
public class TopToBottomTextExtractionStrategy : ITextExtractionStrategy
{
    private readonly SortedList<PdfTextBlocks, PdfTextBlocks> _textBlocks = new(new PdfTextBlocksComparer());
    private PdfTextBlocks _currentTextBlock = new();
    private int secondaryOrderCount = 1;

    private Vector? _lastEnd;
    private Vector? _lastStart;
    private readonly bool _debug = false;
    private bool firstRender = true;

    /// <inheritdoc />
    public void EventOccurred(IEventData data, EventType type)
    {
        if (type.Equals(EventType.RENDER_TEXT))
        {
            var renderInfo = (TextRenderInfo)data;
            var segment = renderInfo.GetBaseline();
            var start = segment.GetStartPoint();
            var end = segment.GetEndPoint();

            var sameLine = false;
            if (!firstRender)
            {
                var x0 = start;
                var x1 = _lastStart;
                var x2 = _lastEnd;

                var dist = x2!.Subtract(x1).Cross(x1!.Subtract(x0)).LengthSquared() / x2.Subtract(x1).LengthSquared();

                var sameLineThreshold = 2f;
                // If we've detected that we're still on the same
                if (dist <= sameLineThreshold)
                    sameLine = true;

                if ((int) Math.Floor(x0.Get(1)) > (int) Math.Floor(x1!.Get(1)))
                {
                    if (_debug)
                        _currentTextBlock.Content.Append($"\n == Order Change [{x0.Get(1)} > {x1.Get(1)}] == \n");
                    // There seems to be change in the order make it as a new block
                    _textBlocks.Add(_currentTextBlock, _currentTextBlock);
                    secondaryOrderCount++;
                    _currentTextBlock = new PdfTextBlocks
                    {
                        StartX = (int) Math.Floor(x0.Get(0)),
                        StartY = (int) Math.Floor(x0.Get(1)),
                        SecondaryOrder = secondaryOrderCount
                    };
                }
                // Check for sequential content
                // Content that changes in the distance by 2 * sameLineThreshold
                else if (dist > 400)
                {
                    if (_debug)
                        _currentTextBlock.Content.Append($"\n == Large Gaps [{dist} >= {2 * sameLineThreshold}] == \n");
                    _textBlocks.Add(_currentTextBlock, _currentTextBlock);
                    _currentTextBlock = new PdfTextBlocks
                    {
                        StartX = (int) Math.Floor(x0.Get(0)),
                        StartY = (int) Math.Floor(x0.Get(1)),
                        SecondaryOrder = secondaryOrderCount
                    };
                }
            }


            if (firstRender)
            {
                _currentTextBlock.StartX = (int) Math.Floor(start.Get(0));
                _currentTextBlock.StartY = (int) Math.Floor(start.Get(1));
                _currentTextBlock.SecondaryOrder = secondaryOrderCount;
                firstRender = false;
            }
            else
            {
                // Don't append if the new next is empty
                if (renderInfo.GetText().Length > 0 && !renderInfo.GetText().StartsWith(" "))
                {
                    //Don't append if the new text starts with a space
                    //Calculate the distance between the two blocks
                    var spacing = _lastEnd!.Subtract(start).Length();
                    //If it "looks" like it should be a space
                    if (spacing > renderInfo.GetSingleSpaceWidth() / 2f) //Add a space
                        _currentTextBlock.Content.Append(" ");
                }
            }

            _currentTextBlock.Content.Append((sameLine ? string.Empty : "\n") + renderInfo.GetText());

            _lastStart = start;
            _lastEnd = end;
        }
    }

    /// <inheritdoc />
    public ICollection<EventType> GetSupportedEvents()
    {
        return null!;
    }

    /// <inheritdoc />
    public virtual string GetResultantText()
    {
        var buf = new StringBuilder();
        if (_currentTextBlock.Content.Length > 0) _textBlocks.Add(_currentTextBlock, _currentTextBlock);
        foreach (var sortedBlock in _textBlocks) buf.AppendLine(sortedBlock.Value.ToString());
        Reset();
        return buf.ToString();
    }

    /// <summary>
    ///     Reset current page buffer
    /// </summary>
    public void Reset()
    {
        _currentTextBlock = new PdfTextBlocks();
        firstRender = true;
        _textBlocks.Clear();
        secondaryOrderCount = 1;
    }
}

You can download a test windows application that uses this strategy and extracts text from PDF, from following GitHub project

Ref: davidsekar/iText-TopToBottomTextExtractionStrategy (github.com)

Notes: There are few harmless assumptions (distance > 400) that may result in additional line breaks. But that shouldn't be a deal breaker for the quality check ;). Also, when there are 2-column layouts in PDF, and the text in the second column comes first based on the Y position then that gets extracted in front of the left /first column.

Further, this code can be tweaked, and fixes can be made to make this as accurate as possible. Let me know if this code helps you and do suggest if you have some ideas to make this accurate.