Search This Blog

Friday, 12 March 2021

Is PDF accessible?

Overview

The Portable Document Format (PDF) is a file format developed by Adobe Systems. PDF makes it possible to distribute documents with original formatting intact. PDF files are created by scanning an original print document or by using a variety of popular software applications. 

Accessibility

The popularity of PDF has created concerns about accessibility, particularly for users of screen readers and for those who have low vision. While Adobe has taken steps to permit access to those who use screen readers, it is essential that documents be correctly marked up (commonly referred to as “tagged”) so that screen readers have the information they need to identify items such as headings and alt text for images. Tables must also be marked up so that screen reader users can navigate them and clearly understand the association of data with appropriate column and row names.

Tagged PDF

Few authors are currently creating tagged PDF files, either because this requires additional effort or because of lack of awareness. Authors are also limited by the capabilities of their word processing or desktop publishing tools, many of which have PDF export capabilities that do not currently support tagged PDF. Microsoft Office, particularly with its most recent versions, does provide good PDF exporting, assuming that appropriate styles are used when first creating a document in Word.

Available Documentation

Adobe provides accessibility documentation at adobe.com/accessibility. Among other resources available from this site, Adobe has developed a variety of Acrobat accessibility training resources that describe in detail the process of creating accessible PDF documents using Word, InDesign, and Acrobat. 

Support In Operating Systems

PDF accessibility also requires support from operating system and assistive technology developers. In Microsoft Windows, both JAWS and NVDA support tagged PDF. However, there is currently no support for tagged PDF in other operating systems.

Is PDF the Correct Choice of Format

Despite advances in accessibility, many users and advocacy groups continue to recommend that PDF documents be accompanied, or replaced, by alternative format documents that are more universally accessible, such as HTML. PDF unfortunately is still not indexed as well as HTML and so if content is to be used for SEO then it is often converted to HTML. 

Adobe PDF Base-14 Fonts

 

A number of fonts are included with Adobe Acrobat and therefore don't need to be embedded in PDF files. In our products Impress, Impress Pro and TOCBuilder these fonts are marked in the font lists in Red:

  • 4 font sets in the Helvetica family: normal, bold, and bold italic, with any size. XSL-FO "sans-serif" font family is normally mapped to "Helvetica".
  • 4 font sets in the Times family: normal, bold, and bold italic, with any size XSL-FO "serif" font family is normally mapped to "Times".
  • 4 font sets in the Courier family: normal, bold, and bold italic, with any size. XSL-FO "monospace" font family is normally mapped to "Courier".
  • 1 font sets in the Symbol family: normal, with any size. "Symbol" is normally used for Greek alphabets and some symbols like: Ω, φ, ≠, ©.
  • 1 font sets in the ZapfDingbats family: normal, with any size. "ZapfDingbats" is normally used for Zapf dingbats like: ✌, ✍, ❀, ☺.

Camelot Project - the Precursor to PDF and Acrobat

 The Camelot Project

 J. Warnock

This document describes the base technology and ideas behind the project named “Camelot.” This project’s goal is to solve a fundamental problem that confronts today’s companies. The problem is concerned with our ability to communicate visual material between different computer applications and systems. The specific problem is that most programs print to a wide range of printers, but there is no universal way to communicate and view this printed information electronically. The popularity of FAX machines has given us a way to send images around to produce remote paper, but the lack of quality, the high communication bandwidth and the device specific nature of FAX has made the solution less than desirable. What industries badly need is a universal way to communicate documents across a wide variety of machine configurations, operating systems and communication networks. These documents should be viewable on any display and should be printable on any modern printers. If this problem can be solved, then the fundamental way people work will change. 

The invention of the PostScript language has gone a long way to solving this problem. PostScript is a device independent page description language. Adobe’s PostScript interpreter has been implemented on over 100 commercially available printer products. These printer products include color machines, high resolution machines, high speed machines and low-cost machines. Over 4000 applications output their printed material to PostScript machines. This support for PostScript as a standard make the PostScript solution a candidate for this electronic document interchange. 

Within the PostScript and Display PostScript context the “view and print anywhere” problem has been implemented and solved. Since most applications have PostScript print drivers, documents from a wide variety of applications can be viewed from operating systems that use Display PostScript. PostScript files can be shipped around communication networks and printed remotely. “Encapsulated PostScript” is a type of PostScript file that can be used by many applications to include a PostScript image as part of a page the application builds. 

The reason the Display PostScript and PostScript solutions are not a total solution in today’s world is that this solution requires powerful desktop machines and PostScript printers. The Display PostScript and PostScript solutions are the correct long-term solution as the power of machines increases over time, but this solution offers little help for the vast majority of today’s users with today’s machines. 

The Camelot Project is an attempt to define technologies and products that will give the value that Display PostScript and PostScript delivers to the vast number of installed machines that exists today. For the purposes of this discussion these machines include 640K Intel 286/386/486 machines (PC compatibles), Apple Macintosh machines, mainframes, and workstations. The displays must include CGA, EGA, VGA and any other higher resolution or color displays supported by the above machines.

Our vision for Camelot is to provide a collection of utilities, applications, and system software so that a corporation can effectively capture documents from any application, send electronic versions of these documents anywhere, and view and print these documents on any machines. 

There are at least two technical approaches to the Camelot project. Both solutions depend on the PostScript technology. One approach is to try to make Display PostScript and PostScript implementations smaller and faster so that they can run on the vast majority of today’s machines. This approach has been tried and is extremely difficult. 

A second approach is to divide the problem into smaller problems. This approach would allow each piece to run independently on the smaller machines while achieving acceptable performance and a solution for the complete problem. This latter approach requires that the problem be divided in a way that is natural for users, and provides a solution for every user. An approach to the Camelot project will now be described that will divide the problem into smaller pieces. This solution depends on a unique property of the PostScript language. 

PostScript, as an interpretive language, has some properties that other interpretive languages do not have. In particular, the semantics of operators is not fixed. Operators can be redefined to have any desired behavior. This property of PostScript allows the execution of a PostScript file to have side effects that are very different from the normal printing of a page. An example might be instructive. Suppose a PostScript file draws 10 sided polygon with the following PostScript procedure: 

/poly 

    {1 0 moveto 

        /ang 36 def 

        10 {ang cos ang sin lineto 

         /ang ang 36 add def 

     }repeat 

 }def 

This procedure will build a path that is a ten sided polygon. In this procedure the verbs: “moveto” and “lineto” have the standard semantics of building a PostScript path within the PostScript Language. 

By redefining “moveto” and “lineto” very different things can happen. For example, if these operators are defined as follows: 

/moveto 

    {exch writenumber writenumber (moveto) writestring}def 

/lineto 

    {exch writenumber writenumber (lineto) writestring}def 

then when the “poly” procedure is executed a file is written that has the following contents: 

1.0 0.0 moveto 

0.809 0.588 

lineto 0.309 0.951 

lineto -0.309 0.951 

lineto -0.809 0.588 

lineto -1.0 0.0 

lineto -0.809 -0.588 

lineto -0.309 -0.951 

lineto 0.309 -0.951 

lineto 0.809 -0.588 

lineto 1.0 0.0 

lineto In this example the new redefined “moveto” and “lineto” definitions don’t build a path. Instead they write out the coordinates they have been given and then write out the names of their own operations. The resulting file that is written by these new definitions draws the same polygon as the original file but only uses the “moveto” and “lineto” operators. Here, the execution of the PostScript file has allowed a derivative file to be generated. In some sense this derivative file is simpler and uses fewer operators than the original PostScript file but has the same net effect. We will call this operation of processing one PostScript file into another form of PostScript file “rebinding.“ 

The above example illustrates a capability of the PostScript language that is not frequently used. This “rebinding” of the language, however, is extremely valuable. The Camelot project depends on variations on this idea. 

The approach we will take with Camelot is to define a new language of operators and conventions. For the purposes of this discussion we will call this language “Interchange PostScript” or IPS. IPS will primarily contain the graphics and imaging operators of PostScript. The language will be defined so that any IPS file is a valid PostScript file. The file will have the appropriate baggage so that it is a valid EPS file. IPS files will print on PostScript printer and will be able to be used by applications that accept EPS files. IPS will also be structured so that the complete PostScript parser is not necessary to read any file written in IPS. IPS will have an adequate set of operators so that any practical document expressed in PostScript can be represented in IPS. There will be situations in IPS where the IPS file cannot represent visual situations that can be theoretically generated in PostScript. However we believe these situations are extremely rare, and all practical application documents can be represented efficiently in IPS. The right way to think about IPS is as it relates to English. No person in the world knows every English word, but a small subset of the English words, and certain usage patterns enable people to consistently communicate. 

Once we have defined IPS, we will build a version of the PostScript interpreter (IPS binder) that will read any PostScript file and rebind that file into an IPS file. The IPS binder can be quite small in that it does not need the graphics, font or device machinery contained in full PostScript interpreter. Another function of the IPS binder will be to include reconstituted fonts into the IPS file. The idea here is to include just the characters of a font that are actually used in the document. A result of including the necessary characters from the fonts used is that an IPS file will be completely self contained. In other words, when I send a file around the country, I don’t have to worry about whether the receiving location has all the fonts required by the document. The current situation is that complex font substitution schemes are used to deal with locations not having the appropriate fonts. 

Once IPS is defined and the IPS binder implemented, then users can capture any PostScript file emitted by a PostScript driver, and convert that file to a self contained IPS file. This file can be shipped anywhere around the network and printed on any PostScript machine (management utilities will be written to ease this printing process.) 

In addition to the IPS binder, a viewer and browser will be written that will read IPS files, and render those files on displays or to dumb raster printers. It is believed that IPS interpreters can be substantially simpler, and smaller than full PostScript interpreters. It is also believed that an IPS interpreter can have acceptable performance on small machines. The real hope is to make the IPS viewer and browser small enough so that it can co-exist with other applications. It is interesting to think about what those applications can be. 

One obvious application for the IPS viewer is in its use in electronic mail systems. Imagine being able to send full text and graphics documents (newspapers, magazine articles, technical manuals etc.) over electronic mail distribution networks. These documents could be viewed on any machine and any selected document could be printed locally. This capability would truly change the way information is managed. Large centrally maintained databases of documents could be accessed remotely and selectively printed remotely. This would save millions of dollars in document inventory costs. 

Specific large visual data bases like the value-line stock charts, encyclopedias, atlases, Military maps, Service Manuals, Time-Life Books etc. could be shipped on CD-ROM’s with a viewer. This would allow full publication (text, graphics, images and all) to be viewed and printed across a very large base of machines. 

Imagine if the IPS viewer is also equipped with text searching capabilities. In this case the user could find all documents that contain a certain word or phrase, and then view that word or phrase in context within the document. Entire libraries could be archived in electronic form, and since IPS files are self-contained, would be printable at any location. 

One of the central requirements of the Camelot Project is that the IPS file format is device independent. This is essential because it is necessary to be able to print the documents on color or black and white machines — on low or high resolution machines. This requirement is also essential in order to visualize the documents at various magnifications on the screen. For example, it is imperative that the user be able to magnify portions of complex maps, so that subportions of the image are easy to read even on low resolution displays. 

To accomplish the above requirement it is necessary that consistent font rendering machinery be available to the viewer. For this reason the viewers will need to contain the full ATM implementations as part of each system. 

In considering all the requirements of corporations regarding documents, it is important to structure Camelot components so that they can be sold in ways that are useful to the corporations. Several ideas have come to mind. 

Components of Camelot are generally not interesting to single users. The exception to this is in the distribution of large generally useful databases. If someone produced a CD-ROM with “maps of the world” on it, then one can imagine selling a retail package with one viewer and the CD-ROM. 

In most other applications, the distribution of information is to many people. In these latter cases a corporation would like a copy of the viewer for every PC. One can imagine viewers integrated into mail systems, or as general stand-alone browsing systems. In any event corporations should be interested in site-licensing arrangements. (more to come)

History of PDF


A Short History of PDF

Adobe Systems made the PDF specification available free of charge in 1993. In the early years PDF was popular mainly in desktop publishing workflows and the first PDF Export was created for PageMaker 5 by Mapsoft. PDF competed with a variety of formats such as DjVu, Envoy, Common Ground Digital Paper, Farallon Replica and even Adobe's own PostScript format.

Released as an ISO standard

PDF was a proprietary format controlled by Adobe until it was released as an open standard on July 1, 2008, and published by the International Organization for Standardization as ISO 32000-1:2008,[5][6] at which time control of the specification passed to an ISO Committee of volunteer industry experts. In 2008, Adobe published a Public Patent License to ISO 32000-1 granting royalty-free rights for all patents owned by Adobe that are necessary to make, use, sell, and distribute PDF-compliant implementations.[7]

PDF 1.7, the sixth edition of the PDF specification and the version accompaning Acrobat version 8 became ISO 32000-1, includes some proprietary technologies defined only by Adobe, such as Adobe XML Forms Architecture (XFA) and JavaScript extension for Acrobat, which are referenced by ISO 32000-1 as normative and indispensable for the full implementation of the ISO 32000-1 specification. These proprietary technologies are not standardized and their specification is published only on Adobe's website and many of them are also not supported by popular third-party implementations of PDF.

In December, 2020, the second edition of PDF 2.0, ISO 32000-2:2020, was published, including clarifications, corrections and critical updates to normative references.[13] ISO 32000-2 does not include any proprietary technologies as normative references.[14]

Information taken in part from Wikipedia

https://www.mapsoft.com


Thursday, 21 May 2020

Summary of the Structure of PDF files

PDF can be looked upon as a combination of different file types presented in a single container. The reason for this is that a PDF file contains Text, vector art, images, fonts and other file format can be embedded - even the native files that were used to create the PDF in the first place.

An object orientated file format with were items can be connected directly or indirectly to each other. 



The objects within a PDF file can be divided into the following types:

Dictionaries

A group containing direct or references to indirect objects. Dictionaries can be seen as the glue holding together the elements in a PDF files. The example below shows the structure of a typical page dictionary:



The Contents stream has an attributes dictionary that contains a filter name and the length of the stream
The CropBox array contains the coordinates of the rectangle that defines the area that is visible on the page.
The MediaBox array contains the coordinates of the rectangle that defines the media size. This will typically match a standard media size such as Letter or A4 and will allow the PDF page to be reliably printed on a device that contains these standard media sizes.
The Resources dictionary contains references and information for elements that are needed to reliably output the visual elements of the page such as colors, fonts and Images.
 
Streams

The collection of operators outputting information onto the page. Normally the stream will also require elements of the page resources dictionary such as colors and fonts. Streams are either stored as a single element or in an array.

q
567.48 61.011 -540 720 re
W* n
q
/GS0 gs
0 720 -541.1399536 0 567.4799194 61.0105438 cm
/Im0 Do
Q
Q
/CS0 cs 0.302 0.302 0.302  scn
1 i 
/GS1 gs
56.7 286.911 m
56.7 295.191 56.7 303.471 56.7 311.751 c
59.1 311.751 61.5 311.751 63.9 311.751 c
63.9 306.831 63.9 301.911 63.9 296.991 c
65.88 296.991 67.8 296.991 69.72 296.991 c
69.72 301.191 69.72 305.391 69.72 309.591 c
72 309.591 74.22 309.591 76.5 309.591 c
76.5 305.391 76.5 301.191 76.5 296.991 c
81.06 296.991 85.62 296.991 90.18 296.991 c
90.18 293.631 90.18 290.271 90.18 286.911 c
79.02 286.911 67.86 286.911 56.7 286.911 c
f*

You can see that there are several references to items in the page resources dictionary:
GS0 is a reference to a graphics state and gs is the operator that sets it.
Im0 is an XObject image and the Do operator draws the image.
CS0 is a reference to a color dictionary and the scn operator assigns it to strokes.

You can also see usage of several path operators re - rectangle, m - moveto, c - curve f* - fill.

Text strings

These can either be ANSI (single byte characters) or Unicode (multi-byte). The example here is the representation of the last date modified in the catalog dictionary.





Images

Images are normally held within the page resources and the stream will also have an associated Attributes dictionary that will describe the attributes of the data within the stream. BitsPerComponent size of the data that is used to define a single pixel (dot) within the image. The ColorSpace dictionary describes the colour model that is used to define the colors within the image.



Names

Used normally to provide a name that can be used to refer to a dictionary or dictionary item. For example, the pages dictionary has a name "Type" with the value "Pages" and a single page has a name of "Type" with a value of"Page".




Arrays

Fixed length data holding types and/or references to other elements. For an example see the Real Numbers example below.

Real numbers

Decimal numbers. In this example they are being used to define the rectangle of the page media box:


Integers

Whole numbers. For example to show the total number of  pages in the PDF file.



For further details see the PDF Specification at https://www.adobe.com/devnet/pdf/pdf_reference.html

Contact:

Michael Peters

Wednesday, 20 May 2020

Understanding of Colour and Colour models

There are a number of color  models but I am only going to cover 2 here as they are the most often used. 

RGB

This color model is primarily used to describe light. It is used mainly in cameras and scanners. It has 3 color elements that when added together at 100% represent white or pure light. The 3 different colors are Red, Green and Blue. The color model is almost infinite in its range and this in itself is ok until printing is required and that printing is being done through the CMYK color model. The model uses 3 values with each being in a range between 0 and 255 as in the Windows and applications such as Photoshop or as a decimal number up to a maximum of 1 in PDF for example. 

RGB is an additive color model. Adding all of the colors in equal amounts will result in white.

RGB Colour merge
In the web world RGB colours are represented by hex number combinations (the numbering system is ). So for example Red would be #FF0000, Green would be #00FF00 and Blue would be #0000FF. Black is #000000 and White is #FFFFFF. 

CMYK

Cyan/Magenta/Yellow/Knockout used primarily in printing.

The colors are created by printing the colors on top of each other to achieve the required shades. There may may overlaps required on the edges (trapping) to ensure that spaces are not seen as different paper types can expand and shrink when the ink/toner is applied. The color model is much more limited in its range than RGB and therefore care needs to be taken when converting from RGB to CMYK. This can be achieved through color management systems, adding additional colors to the printing run (such as Hexachrome) or using Spot colors that are usually already mixed colors such as Pantone. Printing is effected bu the resolution of the input and output and the paper stock that is being used to print onto both in the surface quality and base color of the media type and also the attributes of the inks being used. Additionally output effects and colors can be modified and enhanced through varnishes such as UV and foils to provide metallic effects.

The model uses 4 values each as a percentage of the 

CMYK is a subtractive color model. Adding all of the colors in equal amounts results in black. However in CMYK this will more than likely result in a dirty color and so with the addition of the K in CMYK the printers also have a real black in order to print a true black.

CMYK Colour merge

This is a simple look at color and I will expand on this in a future blog.

Contact info:

Michael Peters

Tuesday, 19 May 2020

What is an Acrobat Plug-in?


A way for software developers to add additional functionality to Acrobat or to modify current functionality.

Why are plug-ins required?

Adobe provides a product that is intended to be used across multiple industries and organisations. Supporting all multiple vertical markets bloats the application in proving features that would only be used by relatively few people when compared with the whole Acrobat market.

Can Acrobat plug-ins be used in the Adobe Reader?

Special support needs to be added to the plug-in so that it can run under Adobe Reader. However the Reader plug-in will require a special license and needs to go through an approval process with Adobe Systems Inc. - https://www.adobe.com/devnet/reader/ikla.html.

Are plug-ins specific to a particular version of Adobe Acrobat?

We have plug-ins that we developed for Acrobat 6 that still run without modification in Acrobat DC. However, if new features are used that are specific to a later version then it won't work under later versions. If earlier versions used the Adobe Dialog Manager (ADM) then they won't now work in current versions of Acrobat.

Examples of Plug-ins
  • New security handlers that might be specific to a particular organisation. For example, we have developed security handlers that do not allow PDF files to be viewed outside a particular organisations offices. 
  • New annotations. For example, we created a plug-in that supported all of the British Standard Markups.
  • Flattening annotations and form fields into the main document. This ensured that they could not be changed or modified and that they would print as part of the document even if the printing of annotations was switched off.
  • Adding text and images to PDF files.
  • Creating a table of Contents for PDF files
  • Adding fields for variable data printing
  • Hardware integration of Adobe Acrobat into whiteboards and interactive tables

Contact Info:

Michael Peters