Embed Understanding How Components Fail - Donald J.
Preface: I wish to acknowledge that this article was written with full reference to. Most of all that I have learned about PDFs are from the above reference. If you are really interested take time to read it. Surprisingly, it is easy and interesting to read!
Then prepare a hypothesis—or possible chain of events—that could have caused the “crime.” The analyst may also be compared to a coroner per. *Understanding How Components Fail, Second Edition, by Donald J. Wulpi, Chapter 1, Techniques of Failure Analysis, ASM International, 1999, reviewed and revised by Daniel.Missing.
I am writing this tutorial out of my interest in knowing the PDF specification. My quest started when I tried hard but failed to extract text from a simple PDF file that contained a single page of text. Please let me know () if you find any errors. I have relied on the PDF specification (link on page top) to create this tutorial. This tutorial covers PDF files conforming to the ISO 32000-1 specification (Pages vi to viii in the PDF.pdf give more information on this).
' At the core of PDF is an advanced imaging model derived from the PostScript® page description language. This PDF Imaging Model enables the description of text and graphics in a device-independent and resolution-independent manner. To improve performance for interactive viewing, PDF defines a more structured format than that used by most PostScript language programs. Unlike Postscript, which is a programming language, PDF is based on a structured binary file format that is optimized for high performance in interactive viewing. PDF also includes objects, such as annotations and hypertext links, that are not part of the page content itself but are useful for interactive viewing and document interchange.'
(Quoted from Page vii of PDF.pdf). White-space characters: Null, Horizontal tab, Line feed, Form feed, Carriage return and Space. Tip: If you need to remember them, imagine seeing them in sequence (except for null of course) being typed on an imaginary typewriter. White-space characters separate names and other objects from each other. Interestingly, PDF treats all white-space characters outside a comment, string or stream the same. Outside a comment, string or stream, PDF considers any sequence of consecutive white-space characters as one character. What this means is that you may have 5 spaces but in reality it is considered as one.
Note that this does not apply to white-space characters within strings, streams and comments. The Carriage return and Line feed are considered as end-of-line (EOL) markers. EOL markers play a very important role in showing where a new line starts. Carriage return followed immediately by a Line feed is considered as one EOL marker. An interesting fact to note is that a PDF may consist entirely of just ASCII characters or can consist of ASCII characters and Binary data. In simple terms, characters in ASCII files use only 7 out of the 8 bits in a byte while characters in the Binary files use all the 8 bits in the byte.
This allows a possibility of 128 unique characters for ASCII files and 256 unique characters for Binary files. Most PDF files that are encrypted or contain images will have binary data (images are represented in binary). PDF files that contain binary data get corrupted when edited or even opened and saved in normal text editors like Notepad.
It is critical that we understand the difference between ASCII and Binary files as other areas of the PDF specification touch on them. Here is an interesting link that explains the difference between Binary and ASCII files. Boolean objects: The keywords true and false. Numeric objects: There are two types; integer & real. Integers are numbers without any decimal points and can have a + or – symbol preceding it. For instance, the number 10.
Real numbers must have a decimal point. For instance, 10.5, 0.0, +5.0, -1.0. Real numbers cannot be expressed in exponential format. String Objects: Strings contain characters (can be zero characters as well).
They can be literal characters within parenthesis or hexadecimal data within angle brackets. Notice that the parenthesis and angle brackets are delimiter characters.
(I love Java (and PDF)) There are escape characters that can be used. Refer to the PDF specification for more details. The sequence ddd where ddd is an octal character code can be used to represent characters outside the printable ASCII character set. Be aware that, some of the escape characters, especially the ones that cause characters to move, for instance n (newline), did not have any visual effects when I added them to a string displayed by the PDF.
You can replace one of the characters in the string I have used in the sample PDF and notice no difference. We can also use octal characters (usually to represent character outside the printable ASCII character set) when using parenthesis. An octal character is represented in the format ddd. In the following example 052 represents * (asterisk), a printable ASCII character.
Stream Objects: Streams are similar to strings except that can be of unlimited length. They are usually used to represent large data that cannot fit into a String. An image, for instance, can be represented as a stream. As you will see later, the contents of each page in PDF is represented as a 'Contents' stream. It consists of a dictionary followed by the keyword ‘stream’, newline, the stream’s data and the keyword ‘endstream’.
Dictionary stream. Endstream While the stream consists of the data, the dictionary contains information about the stream itself. Here are the keys that are common for most stream dictionaries. Out of these the only mandatory key is Length. Length - Mandatory entry that contains the length (number of bytes) of the stream.
An error occurs if the stream has more bytes of data that the length mentioned in the dictionary. Filter - The name(s) of the filter(s) used to decode the data on the stream. Can be a name or an array. The filters are used in the order supplied DecodeParms - Parameter dictionary used by the filters. Can be a dictionary or array. Parameters are values used by the filters. If the filter uses the default parameters, it can be skipped.
F - From PDF specification version 1.2, the stream contents can be stored in an external file. This shows the file where it is stored. In this case the contents between stream and endstream are ignored. Nissan Terrano D21 Workshop Manual Download. FFilter - Similar to Filter entry but for the stream's external file.
FDecodeParms - Similar to DecodeParms but for stream's external file's filter DL - An approximate size of the contents (after decoding) in the stream. This will help determine if there is enough disk space for the stream. Beginning with version 1.4 the document's catalog dictionary is used instead of this. If the file has binary data, there will be at least four binary characters, immediately after the header.
This is to show PDF reading applications (like Adobe reader) that the PDF has binary data. Again, when opening some PDF files in their raw form (as in a text editor) you may notice the four binary characters just after the first comment. File Body: The File body consists of indirect objects (discussed earlier). These objects represent text and other details (like font type etc.) used in displaying PDF. As of version 1.5, the body can also contain object streams. Cross-Reference Table: This table is similar to a directory.
It contains the location of each object within the PDF file. By looking at the entries in this table, the PDF reading application (for example, Adobe Reader), can easily locate an object within the file.
This saves time as the object is accessed in a random manner (rather than reading every line of the file). The cross-reference table can have one or more sections.
Each of these sections can have one or more subsections. Note: From PDF version 1.5 onward the cross-reference table can be stored as a stream and if so you will not be able to view the table as shown below when opening with a text editor.
Each section begins with the word ' xref'. Following this line are two numbers separated by a single space. The first number is the object number of the first of the series of objects listed below it.
The second number refers to the number of entries in that subsection. For a PDF file that has been created for the first time or a PDF file that has not been incrementally updated, there shall be only one subsection and the object numbering starts with 0. Notice that the object numbers have to be consecutive. In the example below, it is safe to assume that entries for objects 1, 2, 3 & 4 will follow.
Nnnnnnnnnn ggggg meol nnnnnnnnnn - This is a ten digit value. This reveals how far the object is from the start of the file. For instance, the value 100 denotes that the object is 100 bytes from the start of the file. Ggggg - 5-digit generation number m - can be either 'n' or 'f'. 'n' denotes that the object is still in use and 'f' denotes that the object has been deleted and is free.
Eol - end of line. Consists of 2 chars. The ten digits, followed by space, followed by five digits, followed by space, followed by a single character and the eol make exactly 20 digits. If the first two numbers are not long enough, to be ten and five digits respectively, zeroes are added to the front. Let's come back to the 0 5 that we saw in the example earlier.The 0 denotes the object number of the first object in this subsection.
The value 5 denotes that there are 5 entries (including the one for 0) and that the remaining four entries are for objects with object numbers 1, 2, 3 and 4. The first entry at the cross-reference table is for object 0. Object 0 will have as its first ten digits (if there are no other free objects) and will always have 65535 as its 5-digit generation number. It also shall have 'f' as the character. The 0 4 denotes that there are four entries - Entry for object 0 followed by entries for objects 1, 2 & 3. The first ten digits () of the first entry for object 0 points to the next free object, which is, object 3.
If there had been another free object, then the 4th entry will have the object number of the next free object. In this case, as there are no other free objects it points back to object 0.
Objects 1 & 2 are 15 & 75 bytes (respectively) away from the start of the file. This basically informs lets the PDF application know that object 3 is free and therefore it can be used to refer to another data.