Recently Updated Documents

Adobe-Acrobat-Redaction-Capability

Last updated 1 month ago

Download From Source

An Examination of the
Redaction Functionality of
Adobe Acrobat Pro DC 2017
Introduction
There have been numerous cases of security breaches resulting from a failure to effectively redact sensitive or private
information from documents prior to release into the public domain 1 2. To assist in mitigating this security risk, Adobe
Acrobat Pro DC 2017 provides redaction and sanitisation functionality that aims to completely remove undesirable
information and other hidden information (e.g. metadata) from PDF documents.
This document provides guidance on the efficacy of redaction facilities within Adobe Acrobat Pro DC 2017 and is
intended for information technology and information security professionals within organisations looking to redact
sensitive or personal information from PDF documents before releasing them into the public domain or to other third
parties.

Scope of testing
The Australian Signals Directorate (ASD) previously examined the redaction functionality in Adobe Acrobat Pro 10 to
determine if any redacted information could be recovered from PDF documents. The current round of testing by the
Australian Cyber Security Centre (ACSC) aimed to examine the same functionality previously tested but in Adobe
Acrobat Pro DC 2017.
For the purposes of this document, the definition of successful redaction was the complete removal of redacted data
from every location in a PDF document’s file structure.
As part of testing, a number of test cases were considered that represented some of the different ways that
information could be stored within a PDF document. This included:

1
2



embedded text



embedded image



data from historical editing



interactive form



embedded text obscured by an embedded image



embedded text in an encrypted PDF



embedded metadata.

https://nakedsecurity.sophos.com/2011/04/18/how-not-to-redact-a-pdf-nuclear-submarine-secrets-spilled/
https://www.pcworld.com/article/183788/article.html

Unless otherwise stated, the following application versions were used:


Adobe Acrobat Pro DC 2017 (2017.012.20093) which installs Adobe PDF Library 15 and Adobe Acrobat Distiller 17



Microsoft Word 2010 (14.0.6023.1)



LibreOffice Writer 5.1.6.2



CutePDF Writer 3.2 which installs Ghostscript 8.15.

The Calibri font was used in Microsoft Office documents in Microsoft Windows. This font was installed in Ubuntu Linux
so that test files could be opened in LibreOffice Writer.
PDF documents were generated using each of the below rendering engines:


Adobe Acrobat (Using the ‘Create PDF’ Microsoft Word add-in or native PDF authoring within Adobe Acrobat,
both of which use the Adobe PDF Library)



Adobe Acrobat Distiller (Printing to the Adobe PDF printer)



Microsoft Word (Using ‘Save As’ PDF functionality)



CutePDF (Printing to the CutePDF Writer printer)



LibreOffice Writer (Using the ‘Export As PDF’ functionality).

In some cases, not all rendering engines were tested as not all possessed the necessary functionality.
For the purposes of testing, the application used to create the PDF documents refers to the rendering engine that did
the PDF conversion. For example, using the ‘Create PDF’ Microsoft Word add-in installed by Adobe Acrobat, the PDF
conversion is performed by the Adobe Acrobat rendering engine (Adobe PDF Library). Similarly, when printing to the
Adobe PDF printer installed by Adobe Acrobat within Microsoft Word, the file conversion is performed by the Adobe
Acrobat Distiller rendering engine. Only choosing to save the file by selecting the ‘Save As’ PDF option in Microsoft
Word results in a PDF being rendered by Microsoft Word.
The previous testing conducted by ASD in 2011 used a single rendering engine to create PDF documents. In the current
round of testing, PDF documents using different rendering engines were used to determine whether the source of the
PDF document had any impact upon successful redaction.
Depending on the rendering engines used, PDF documents were generated using different versions of the PDF
standard:


Adobe Acrobat


Adobe PDF Library (PDF version 1.5)



Adobe Acrobat Distiller (PDF version 1.5)



Microsoft Word (PDF version 1.5)



Cute PDF (PDF version 1.4)



LibreOffice Writer (PDF version 1.4).

Using the functions of Adobe Acrobat, PDF documents were generated using different versions of the PDF standard:


interactive form (PDF version 1.6)



encryption (PDF version 1.6)



sanitisation (PDF version 1.6)



redaction (PDF version 1.7).

2

The redacted PDF documents were analysed with free or open source tools to determine whether any redacted
information could be recovered:


Pdfminer toolkit written by Yusuke Shinyama (pdf2txt) 3



Poppler toolkit (pdfimages)4



Origami toolkit (pdfwalker)5



PDF Stream Dumper 9.3 by David Zimmer6.

Testing results and recommendations
Successful redaction outcomes
No redacted information was recovered from PDF documents created with Adobe Acrobat, Adobe Acrobat Distiller and
Microsoft Word. This result is similar to that of previous testing conducted by ASD in 2011 which examined the
redaction functionality of Adobe Acrobat Pro 10.

Failures in redacting information
Remnants of redacted information were recovered from PDF documents created with CutePDF and LibreOffice Writer.
The remnants of redacted information were located within objects containing embedded font maps (CMap objects) 7.
The ability to recover these data remnants was the result of differences in the mechanisms used by CutePDF and
LibreOffice Writer to embed font maps, and the Adobe Acrobat redaction functionality’s inability to identify and
remove them. The Adobe Acrobat sanitisation functionality also failed to remove these data remnants.
It is not known whether PDF documents created by rendering engines not tested during this round of testing will also
fail to be successfully redacted. Until this is known, assurance that data remnants cannot be recovered from redacted
PDF documents requires that the creation of PDF documents be restricted to Adobe Acrobat, Adobe Acrobat Distiller
and Microsoft Word.
To assist in identifying the software used to create a PDF document, the metadata can be examined via the document’s
properties in Adobe Acrobat or Adobe Acrobat Reader. If the PDF document was created with Adobe Acrobat, Adobe
Acrobat Distiller or Microsoft Word, the PDF Producer field will contain ‘Adobe PDF Library’, ‘Acrobat Distiller’ or
‘Microsoft Word’ respectively. If the PDF Producer field contains something else, there is a chance that redaction of
sensitive or private information might fail. Note that if the PDF document had been previously sanitised, the metadata
would have been deleted and the PDF Producer field will be empty. In these cases it should be treated as if it was
created by a rendering engine that cannot be successfully redacted by Adobe Acrobat.

Detailed testing results
For a full breakdown and discussion of the testing results see Appendix A and B respectively.

3

https://euske.github.io/pdfminer/
https://poppler.freedesktop.org/
5

https://www.aldeid.com/wiki/Origami/pdfwalker
6

https://zeltser.com/pdf-stream-dumper-malicious-file-analysis/
7

https://www.adobe.com/content/dam/acom/en/devnet/font/pdfs/5014.CIDFont_Spec.pdf
4

3

Recommendations
When a requirement exists to redact sensitive or private information from PDF documents before releasing them into
the public domain or to other third parties, organisations should:


verify the original PDF document was created using Adobe Acrobat, Adobe Acrobat Distiller or Microsoft Word by
checking the metadata of the PDF document



perform redaction and sanitisation of the document using Adobe Acrobat Pro DC 2017.

Further information
A guide to redacting sensitive information from PDF documents, including step-by-step instructions, is available from
Adobe at https://helpx.adobe.com/acrobat/using/removing-sensitive-content-pdfs.html.
The latest PDF standard (ISO 32000-2:2017) is available for purchase from the International Organization for
Standardization (ISO) at https://www.iso.org/standard/63534.html.

Contact details
Organisations or individuals with questions regarding this advice can contact the ACSC by emailing
asd.assist@defence.gov.au or calling 1300 CYBER1 (1300 292 371).

4

Appendix A: Detailed testing results
Test 1: Redaction of embedded text
The aim of Test 1 was to determine whether remnants of redacted text could be found in PDF documents redacted with
Adobe Acrobat Pro DC 2017.
A Microsoft Word document was created that contained a title and three lines of text. The last two lines represented
sensitive information.

5

A corresponding PDF document was created using each of the rendering engines being examined: Adobe Acrobat,
Adobe Acrobat Distiller, Microsoft Word, CutePDF and LibreOffice Writer. This represented five PDF documents.
The internal structures of the PDF documents were parsed with the PDF Stream Dumper tool. For each of the PDF
documents, the objects within the file structures that contained embedded text were identified. For example, the
embedded text object from the file generated using Adobe Acrobat is shown below with embedded text highlighted in
green.

Examination of the embedded structural objects within each PDF document revealed that all rendering engines utilised
font subsets to reduce file size. However, in regard to the mapping of character codes to character selectors (glyphs),
different engines used alternate mechanisms.
Microsoft Word did not embed a CMap but instead used pre-defined WinAnsiEncoding. Adobe Acrobat Distiller only
embedded a CMap for a single character mapping and for the remaining characters utilised pre-defined
WinAnsiEncoding.

6

Adobe Acrobat embedded a custom ToUnicode CMap within the PDF document 8. The order of the mappings in the
CMap reflected that in Unicode.

8

https://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/5411.ToUnicode.pdf

7

In contrast, the PDF documents created with CutePDF (using open source Ghostscript), or open source LibreOffice
Writer, both embedded ToUnicode CMap objects where the order of mappings reflected the order that the characters
first appeared in text. The CMap object from the former is shown.

Apart from facilitating the mapping of character codes to character selectors, the CMap created an artefact where the
order of the mappings itself encoded a text string. The string contained the password from the PDF document which is
highlighted in red below.
Test PDFilhdanvcbmurw:@3946$(%)

8

The bottom two lines of text were then redacted from each PDF document with Adobe Acrobat.

9

The internal structures of the redacted PDF documents were parsed with the PDF Stream Dumper tool. In all cases, the
redacted text was successfully removed from the embedded PDF text objects. For example, the object from the PDF
document produced by Adobe Acrobat is shown below with embedded text highlighted in green.

10

However, examination of the CMap objects within the redacted PDF documents created by CutePDF or LibreOffice
Writer revealed that remnants of redacted text remained. For example, the CMap object from the redacted PDF
document that was generated with LibreOffice Writer is shown below with the data artefact that reveals the password
string shown in red.

It appears that the redaction functionality of Adobe Acrobat does not identify artefacts of redacted text in CMap
objects when the PDF document was generated by CutePDF or LibreOffice Writer. In this test case, the redacted
password was fully recoverable.

11

Test 2: Redaction of text within an embedded image
The aim of Test 2 was to verify that an embedded image containing text within a PDF document was edited by the
redaction process and not simply obscured.
A representation of the text from the PDF document in Test 1 was copied into an image file. The image file was in turn
used to create five separate PDF documents with the rendering engines used in Test 1.
Each PDF document was analysed with the pdf2txt tool which extracts embedded text. In all cases, no extractable text
was found. This was the expected result, given that all text was represented within an embedded image. For example,
the result of running the pdf2txt tool against the PDF document generated by Adobe Acrobat is shown below. To verify
this result, each PDF document was also parsed with the PDF Stream Dumper tool.

The pdfimages tool, which can detect and analyse embedded images, was run against the PDF documents. The tool
successfully identified the embedded images. The output of the tool when run against the PDF document generated by
Adobe Acrobat is shown below.

To test the redaction functionality, the last two lines of the PDF documents (representing sensitive information) were
redacted using Adobe Acrobat. The redacted PDF documents were again analysed with the pdfimages tool which
revealed that the embedded image file had changed in size. For example, the image file in the redacted PDF document
created with Adobe Acrobat had decreased from 22.9Kb to 21.1 Kb indicating it had been edited by the redaction
process.

12

To verify that the sensitive information within the embedded image file had been successfully redacted, the embedded
image was extracted from the PDF documents using the pdfimages tool and examined. For example, the embedded
image file from the PDF document created with Adobe Acrobat is shown below. It had been edited by the redaction
process to remove the sensitive information.

Test 3: Redaction of historical revisions of text
PDF documents store historical revisions of edited text. The aim of Test 3 was to verify that all historical revisions of
sensitive text were removed by redaction with Adobe Acrobat.
The PDF documents from Test 1 were opened in Adobe Acrobat. In each case, one of the lines representing sensitive
text was edited multiple times, making sure that each edit was saved.
The PDF documents were then parsed with the pdfwalker tool with the revisions of the PDF documents being reflected
within the file structures. For example, the file generated by Adobe Acrobat is shown below.

13

Sensitive text was redacted using Acrobat and the PDF documents were again parsed with the pdfwalker tool. The
output of the pdfwalker tool indicated that previous revisions of the sensitive text had been removed. For example, the
output from parsing the PDF document generated by Adobe Acrobat is shown below. Note the pdfwalker tool always
has ’Revision 1’ as an artefact.

This result was verified by using the PDF Stream Dumper tool to identify the number of file objects in the redacted PDF
documents. In the case of the PDF document generated with Adobe Acrobat, the number of file objects had been
reduced from 105 to 18, before and after redaction respectively.
Use of the PDF Stream Dumper tool to examine the embedded text objects showed that in every case historical
revisions of sensitive text were removed by redaction. In contrast, the data remnant within CMap objects remained for
PDF documents generated with CutePDF or LibreOffice Writer, as was found in Test 1.

Test 4: Redaction of text within a PDF form
Using Adobe Acrobat, text was entered into two PDF form fields.

14

The PDF documents were subsequently analysed with the pdfwalker tool. Text was found in three objects shown
below. Two of the objects corresponded to each of the two form fields. The third object with text data corresponded to
the form dictionary.

The embedded text within sections of the form dictionary is shown below. Embedded text is highlighted in green.

15

For the first test, both the form fields in the PDF document were fully redacted using Adobe Acrobat.

For the second test, the form fields in the PDF document were partially redacted using Adobe Acrobat.

16

The redacted PDF documents were then parsed with the pdfwalker tool. In both cases, the form objects were deleted
leaving only the object that had contained the form dictionary. The output from parsing the partially redacted PDF
document is shown below.

The contents of the remaining form dictionary objects were further analysed with the PDF Stream Dumper tool and no
text remnants were found. For example, the content of the form dictionary from the partially redacted PDF document
is shown below.

The result was the same for the PDF document where the form fields were fully redacted.

17

Test 5: Embedded text obscured with an image
On occasions, attempts at redaction have failed when underlying text was merely obscured by covering it with another
layer in the form of a blackened rectangle or image9.
Starting with the Microsoft Word file used in Test 1, text was obscured by inserting an overlying image file as shown
below. A PDF document was then generated using each of the five rendering engines.

9

https://www.sharevault.com/resources/glossary/how-to-redact

18

Using the pdf2txt tool, it was identified that the underlying text remained within the PDF documents despite being
obscured with an overlying image. For example, the output of the pdf2txt tool for the PDF document generated with
CutePDF is shown below.

This demonstrated that obscuring information in a PDF document using an image is not an effective way to redact
information.

Test 6: Redacting encrypted PDF documents
PDF documents from previous tests were encrypted with Adobe Acrobat, redacted and then parsed with PDF analysis
tools. The aim of this test was to verify whether encryption changed the underlying file structure of PDF documents or
not, and thus if there was any impact on the effectiveness of redaction activities.
For all previous test cases, the results were similar and were unaffected by starting with an encrypted PDF document.
For PDF documents created with CutePDF or LibreOffice Writer, as found previously, remnants of redacted text were
left in the ToUnicode CMap objects. No other remnants of redacted data were found.

19

Test 7: Sanitising PDF documents
Adobe Acrobat offers a sanitisation feature (i.e. ‘Remove Hidden Information’) that removes hidden data. For example,
metadata that might identify the author of a document.
Prior to sanitisation, the PDF documents contained objects with embedded metadata. The PDF documents were each
parsed with the PDF Stream Dumper tool and the metadata analysed. The metadata from the PDF document produced
by CutePDF is shown below. Useful information is highlighted in green and includes the rendering engine (CutePDF or
Ghostscript), the software used to access the rendering engine (Microsoft Word) and the author’s user account
(IEUser).

20

The metadata is also available from the PDF document’s properties within a PDF reader. The metadata from the PDF
document produced by CutePDF and read by Adobe Acrobat Reader is shown below.

21

To check the efficacy of the sanitisation feature, the PDF documents were sanitised and parsed with the pdfwalker tool.
In all test cases, sanitisation deleted the object containing metadata. For example, the file structure from the PDF
document produced by CutePDF before (left) and after (right) sanitisation is shown below. The object containing
metadata is indicated.

The successful sanitisation of metadata was also confirmed by checking the PDF document’s properties with Adobe
Acrobat Reader as shown below. Empty metadata fields are highlighted.

22

This test also investigated whether sanitising a PDF document would affect the CMap data remnants identified in
previous tests. The PDF documents that were rendered with CutePDF and LibreOffice Writer were sanitised and the
resultant PDF documents were parsed with the PDF Stream Dumper tool. In both cases, this had no effect on the CMap
remnants in the redacted PDF documents.
Furthermore, this test also investigated whether sanitising a PDF document would remove multiple historical revisions
of text as seen previously in Test 3. When the PDF documents were sanitised and parsed with the pdfwalker tool, it was
evident that historical revisions of text had been removed. The structure of the sanitised PDF document is shown
below.

23

Appendix B: Discussion of testing results
It was demonstrated that there was a difference between the CMap objects generated by the different rendering
engines. Microsoft Word did not embed a custom CMap and Adobe Acrobat Distiller only did so for a single character.
Adobe Acrobat, CutePDF and LibreOffice Writer all embedded custom CMap objects. Adobe Acrobat did not customise
the order of character code to character selector mappings and instead used the order of mappings as they appear in
Unicode. In contrast, CutePDF using open source Ghostscript and LibreOffice Writer both customised the order of
mappings so that it reflected the order that characters first appeared in text. Thus, the order of mappings in the CMap
objects created an encoding mechanism from which meaningful data could be extracted.
Within CMap data structures, mappings are created the first time a character appears in a per-font context. The CMap
will remain unless all characters of a particular font are deleted from the PDF document. If a single character remains,
the CMap will remain. Thus, if redaction removes all characters that are mapped within a CMap, it can be expected that
the CMap will be deleted and redaction will be successful. This remains to be tested. In this case, the analysis only
partially redacted the text and this left the CMap objects in place.
PDF documents created with Adobe Acrobat, Adobe Acrobat Distiller or Microsoft Word were able to be successfully
redacted by Adobe Acrobat. Parsing of PDF documents failed to identify any remnants of redacted data. This is a result
of the fact that these rendering engines either did not use embedded CMap objects or if they did, the order of
mappings did not reflect the order that characters first appeared in text. In contrast, PDF documents created with
CutePDF or LibreOffice Writer were not successfully redacted. Parsing of these PDF documents found remnants of
redacted data within CMap data structures.
Data remnants that were found in redacted PDF documents were the result of:


the rendering engine creating CMap objects in which the order of mappings was determined by the order that
characters first appeared in text



redaction failing to reorder the mappings within CMap objects



redaction failing to delete orphaned mappings for characters that no longer existed



CMap objects remaining as all text of a particular font was not redacted.

This represents a security vulnerability which occurs if the following pre-conditions are met:


the PDF document was rendered by CutePDF (Ghostscript) or LibreOffice Writer



Adobe Acrobat was used to redact the PDF document.

The PDF standard was checked for a requirement in regard to input characters and the order of mappings within a
CMap object but none was found. Neither was a requirement found within the Adobe CMap and CIDFont Files
Specification10. Adobe has released a developer technical note that specifies that the order of character selectors in a
CMap must be in increasing byte order but not how the order of mappings relates to input characters 11.
If the relationship between input characters and the order of mappings within a CMap is arbitrary, it explains the
results in this document and means that custom mechanisms are not prohibited by any specification. It might also
mean that the redaction software should be responsible for identifying and removing artefacts of redacted data from a
CMap object. In this regard, the Application Software Extended Package for Redaction Tools protection profile12,
published by the National Information Assurance Partnership (NIAP), is ambiguous. There is a requirement for the
target of evaluation to remove all references and indicators in the structural data to objects that are ‘completely

10

https://www.adobe.com/content/dam/acom/en/devnet/font/pdfs/5014.CIDFont_Spec.pdf
https://www.adobe.com/content/dam/acom/en/devnet/font/pdfs/5099.CMapResources.pdf
12

https://www.niap-ccevs.org/Profile/Info.cfm?id=390
11

24

redacted’. If text is partially redacted on a per-font basis, this could mean that data remnants in a CMap would be
allowed. The protection profile hasn’t been assigned to any software products.
Since CMap objects are created on a per-font basis, the likelihood of recovering remnants increases the closer the
redacted text is located to the first incidence of a character from a particular font. In addition, it was observed during
testing that mappings within CMaps were unique, although there is no requirement that this be the case. As a result, no
matter how much previous text exists on a per-font basis, the likelihood of recovering remnants increases if the
redacted text is composed of unique characters that occur for the first time. The likelihood of recovering remnants also
increases as the amount of text in the PDF document decreases.
It is simple to demonstrate this security vulnerability in a test environment. In real-world PDF documents however, the
likelihood of recovering redacted text from CMap objects would vary. Real world examples that could be vulnerable to
the recovery of remnants might include:


redacted text occurring at the beginning of a paragraph where a font is used for the first time (e.g. at the
beginning of information quoted from another source which is highlighted via use of a different font)



the redacted text is a password, key or passphrase that is comprised of unusual characters that occur nowhere
else within the PDF document



a small PDF document with very little text.

This security vulnerability could be mitigated if Adobe Acrobat’s redaction functionality:


randomised the order of mappings in CMap objects or used a pre-existing order such as that found in Unicode



parsed CMap objects and deleted orphaned mappings.

This security vulnerability could also be mitigated if the CutePDF and LibreOffice Writer rendering engines changed the
order of mappings in CMap objects so that it did not reflect the order in which characters first appear in text.
Due to the above, the highest assurance that remnants of redacted data will not remain in PDF documents requires
organisations to:


verify the original PDF document was created using Adobe Acrobat, Adobe Acrobat Distiller or Microsoft Word by
checking the metadata of the PDF document



perform redaction and sanitisation of the document using Adobe Acrobat Pro DC 2017.

25