Registering custom text extractor does not work #21

ghost · 2020-04-29T15:34:53Z

I have registered a custom text extractor (PdfPig).

However it doesn't hit any of my break points, and it doesn't seem to return any results.

I have registered it as below:

using Examine;
using Examine.LuceneEngine.Providers;
using Umbraco.Core;
using Umbraco.Core.Composing;
using UmbracoExamine.PDF;
using UmbracoExaminePDF.Extractors;

namespace UmbracoExaminePDF.Composers
{
    [ComposeAfter(typeof(ExaminePdfComposer))] //this must execute after the ExaminePdfComposer composer
    public class ExaminePdfComposer : ComponentComposer<ExaminePdfComponent>, IUserComposer
    {
        public override void Compose(Composition composition)
        {
            composition.RegisterUnique<IPdfTextExtractor, PdfPigTextExtractor>();
        }
    }

    public class ExaminePdfComponent : IComponent
    {
        private readonly IExamineManager _examineManager;

        public ExaminePdfComponent(IExamineManager examineManager)
        {
            _examineManager = examineManager;
        }

        public void Initialize()
        {
            //Get both the external and pdf index
            if (_examineManager.TryGetIndex(Constants.UmbracoIndexes.ExternalIndexName, out var externalIndex)
                && _examineManager.TryGetIndex(PdfIndexConstants.PdfIndexName, out var pdfIndex))
            {
                //register a multi searcher for both of them
                var multiSearcher = new MultiIndexSearcher("MultiSearcher", new IIndex[] { externalIndex, pdfIndex });
                _examineManager.AddSearcher(multiSearcher);
            }
        }

        public void Terminate() { }
    }
}

And the Pdf pig extractor is pretty simple:

using System.IO;
using System.Text;
using UglyToad.PdfPig;
using UglyToad.PdfPig.Content;
using UmbracoExamine.PDF;

namespace UmbracoExaminePDF.Extractors
{
    /// <summary>
    /// Extracts text from a PDF using PdfPig
    /// https://github.com/UglyToad/PdfPig
    /// </summary>
    public class PdfPigTextExtractor : IPdfTextExtractor
    {
        public string GetTextFromPdf(Stream pdfFileStream)
        {
            using (PdfDocument document = PdfDocument.Open(pdfFileStream))
            {
                var result = new StringBuilder();
                foreach (Page page in document.GetPages())
                {
                    result.AppendLine(page.Text);
                }

                return result.ToString();
            }
        }
    }
}

Any help would be appreciated

The text was updated successfully, but these errors were encountered:

cleversolutions · 2020-09-15T20:13:51Z

I'll take a look at this and get back to you. PDFPig looks really promising, I have been putting a ton of effort into adding text extraction to PDFSharp, and this seems to do a pretty decent job out of the box, and it's Apache 2.0 licensed.

kdx-perbol · 2022-01-14T09:42:54Z

This works for us. We ReqisterUnique in a composer that ComposeAfters ExaminePdfComposer, and our extractor runs. Code is trivial, but let me know if needed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Registering custom text extractor does not work #21

Registering custom text extractor does not work #21

ghost commented Apr 29, 2020 •

edited by ghost

Loading

cleversolutions commented Sep 15, 2020

kdx-perbol commented Jan 14, 2022 •

edited

Loading

Registering custom text extractor does not work #21

Registering custom text extractor does not work #21

Comments

ghost commented Apr 29, 2020 • edited by ghost Loading

cleversolutions commented Sep 15, 2020

kdx-perbol commented Jan 14, 2022 • edited Loading

ghost commented Apr 29, 2020 •

edited by ghost

Loading

kdx-perbol commented Jan 14, 2022 •

edited

Loading