Extract Highlighted Text & Removing All Text from PDF Document using .NET

Aspose's picture

It has has introduced new features related to text manipulations and PDF/UA validation. This release supports extracting highlighted Text from PDF Documents. This release also supports removing all text from PDF Document. A new method has been introduced in this release, in order to remove all text from PDF pages. Therefore, it is recommended to use this method for removing all text from PDF document, as it surely minimizes the time and works very fast.


Lane Cove, NSW, 2066 Australia., July 16, 2018 - (PressReleasePoint) -

\

What's New in this Release?

Aspose team is very excited to announce the new version of Aspose.PDF for .NET 18.6. This new release has introduced new features related to text manipulations and PDF/UA validation. Along with that, it has also made some fixes to the bugs, reported in earlier versions of the API. It has been an essential requirement to extract highlighted text from PDF documents. Earlier it was possible to extract text from PDF documents on the basis of some specific regular expressions or by specifying a string to be searched. TextFragmentAbsorber and TextAbsorber classes of the API, have been being used quite often and efficiently to serve the purpose. However, regarding the requirement of extracting highlighted text from PDF document, it has investigated the feature and introduced TextMarkupAnnotation.GetMarkedText() and TextMarkupAnnotation.GetMarkedTextFragments() methods in API. Users can extract highlighted text from PDF document by filtering TextMarkupAnnotation and using mentioned methods. An example, demonstrating the feature usage has also been showcased in the API documentation. While removing text from PDF documents using earlier versions of the API, users needed to set found text as empty string. The performance overhead in this case was, to invoke a number of checks and adjustment operations of text position. Which was why, several performance issues were observed while performing such operations. It could not minimize the number of checks and adjustment operations, as they are essential in text editing scenarios. Moreover, users cannot determine, how many of text fragments will removed and adjusted when they are processed in loop. In Aspose.PDF for .NET 18.6, new Aspose.Pdf Operators TextShowOperator() method has been introduced, in order to remove all text from PDF pages. Therefore, it is recommended using this method to remove all text from PDF document, as it surely minimizes the time and works very fast. In latest release of Aspose.PDF for .NET, all descendants of Aspose.Pdf.Operator were moved into namespace Aspose Pdf Operators. Thus ‘new Aspose Pdf Operators GSave()’ should be used, instead of ‘new Aspose.Pdf Operator GSave()’. While upgrading to latest version of the API, users will need to upgrade an existing code where users has used previous Aspose.Pdf Operator namespace. It has have also worked for introducing Accessibility Features, thus introduced new features as part of work on 508 compliance (WCAG) such as PDF/UA validation feature was added and Tagged PDF support was added.  The list of important new and improved features are given below

  • Add feature "Extract Highlighted Text from HighlightTextMarkUpAnnotations" to the TextFragmentAbsorber class
  • Add support of OTF font when embedding in PDF
  • Text Extraction - Spaces are improperly embedded inside words
  • TableAbsorber throws exception while trying to access any row other than first row of first table or any other table than first
  • PDF to Image - Some contents are overlapping
  • PDF to JPEG - Incorrect output
  • TableAbsorber: incorrect table count in PDF
  • Text is overlapped when saving particular document as image or HTML
  • PDF to HTML - Object reference not set to an instance of an object
  • Conversion HTML to PDF produces incorrect output
  • PDF to PDFA - Comments are broken in resultant document
  • Flattening Fields is not flattening the Print button inside PDF
  • The output is too big after conversion to PDFA_1B format
  • After conversion PDF-to-PDFA the output contains corrupted diagram
  • The document loaded from HMTL file looks different then original
  • PDF to PDF/A-1b - the output PDF does not pass compliance test
  • PDF to PDF/A-1b - the output PDF does not pass compliance test
  • PDF to JPG - Blue gradient is darker in the JPG compared to the PPT slide PDF
  • PDF to JPG - Objects fading to transparent
  • PDF to JPG - transparent turns to white
  • DF to JPG - Objects fading to transparent causes image differences
  • PDF to JPG - Objects fading to transparent causes image differences
  • Yellow background not same after converting PDF to PDF/A
  • JPEG output loses the fade effect on the source document
  • The document image loses fading to transparent in PDF output
  • Blank pages added after HTML to PDF rendition
  • PDF to PDF/A-2b - the chart labels are rotated
  • PDF to PDF/A-2b - some labels get blurred
  • Duplicated evaluation watermarks when saving EPUB document
  • Output image or html is filled with black color
  • HTML to PDF - exception thrown
  • Flattening Fields is not flattening the buttons inside PDF
  • Multi byte characters not displayed in PDF
  • Header added but footer is missing (HTML->PDF)
  • The header and the footer exist only on the first page.
  • Missing table after adding to Footer
  • PDF to PDF/A-2b
  • Unable to load OTF Font from a resource stream 

Other most recent bug fixes are also included in this release.

Newly added documentation pages and articles

Some new tips and articles have now been added into Aspose.PDF for .NET documentation that may guide users riefly how to use Aspose.PDF for performing different tasks like the followings.

Overview: Aspose.Pdf for .NET

Aspose.Pdf is a .Net Pdf component for the creation and manipulation of Pdf documents without using Adobe Acrobat. Create PDF by API, XML templates & XSL-FO files. It supports form field creation, PDF compression options, table creation & manipulation, graph objects, extensive hyperlink functionality, extexnded security controls, custom font handling, add or remove bookmarks; TOC; attachments & annotations; import or export PDF form data and many more. Also convert HTML, XSL-FO and MS WORD to PDF.

More about Aspose.Pdf for .NET

Contact Information

Aspose Pty Ltd, Suite 163,
79 Longueville Road
Lane Cove, NSW, 2066
Australia
Aspose - The File Format Experts
sales@aspose.com
Phone: 888.277 6734
Fax: 866.810 9465


Press Contact:
David
Aspose Pty Ltd, Suite 163,
79 Longueville Road
Lane Cove, NSW, 2066
Australia
888.277.6734
*********.*****@*s**s*.com
Email partially hidden to block spam. Please use the contact form here.
Contact David
Email the contact person for this press release. Do not send spam or irrelevant message.
1 + 0 =


Copy this html code to your website/blog to embed this press release.

Comments

Post new comment

6 + 3 =

To prevent automated spam submissions leave this field empty.