Monday, August 03, 2009

Working with PDF documents in Ubuntu

PDF files can be a pain in the butt. The format is fantastic for keeping the layout of a page together, but irritating if you want to search the document, copy text or read on a small screen (especially a smartphone like the Nexus One).

There are a few Linux tools that can make working with PDFs a little easier.

Ubuntu comes with Evince, which it simply calls "Document Viewer." It doesn't by default show up in the Applications menu; it just opens when you double click on a pdf file. (If you want to open it without clicking on a pdf, just open application launcher (Alt-F2), type "evince" without the quotes, and click Run).

Evince is quick and it lets you search, print and copy from a pdf document, but its copy function has some limitations.

To copy, you highlight text line by line. That's fine for simple documents, but it can cause problems if the document has columns or other odd formatting.

Rather than just getting the lines from one column, you'll often also get the adjacent lines from the next column. Sometimes you'll get lines from an entirely different part of the document (I assume that has something to do with the way the pdf was created).

The great thing about Linux is that you have options to make working with pdf files easier.

I use both Evince and a KDE app called Okular. Okular has some nice features, like the ability to annotate documents, and a useful copy function.

Copying with Okular
When you click the Selection option on the menu bar, it tells you to "Draw a rectangle around the text/graphics to copy."

That lets you select exactly the part of the document you want (as long as it fits inside a rectangle).

Once you've selected the part of the document you want to copy, Okular opens a menu with three options:
  • Text - Copy to clipboard
  • Image - Copy to clipboard
  • Image - Save to file

The first option (Text - Copy to clipboard) lets you paste the text into another document in roughly the same format as the pdf. Depending on what you grabbed, you may have to add or delete spaces to get it looking right.

"Image - Copy to clipboard" lets you paste whatever you grabbed into another document as an image.

"Image - Save to file" is really useful. When you select this option, it automatically creates an image file that you can save to your hard disk. You can save it as either .jpg or .png just by adding the extension to the file name.

Okular also lets you add annotations to pdf and other documents. This is handy, but it has some limitations.

The annotations are in a format that's unique to Okular. You can save them, but they're saved as metadata within Okular. They won't show up if you open the document with another applications. (Okular stores the annotations in xml files in ~/.kde/share/apps/okular/docudata if you want to see it -- it's interesting how the developers keep the positioning and other information).

Copying Tables
For copying tables from pdf docs I have much better luck with Evince. It's because Evince keeps the row and column spacing better than Okular.

I often copy tables from pdf files into other documents and the easiest way I've found to do it is by first copying them into OpenOffice Calc.

In Evince, I highlight the part of the table I want and copy it. Then I open a new Calc spreadsheet and right-click on the cell that will become the top-left-most cell of the table. From the menu I select Paste special.

That opens I dialog box from which I select Unformatted Text. That opens another box with a bunch of options.

The pasted table shows up in the Fields box, almost always in the same format as in the original pdf document. It won't necessarily show up that way when you paste it, but it looks nice the the box.

To preserve that layout, select Fixed width from the Separator options. Now, if you put your cursor in the ruler at the top of the Fields box, you'll see a line extending down through the table. When the line is up against the left edge of a column, click the mouse and it will stay there. Do the same thing for each column (except the left-most one -- it's already up against the left edge).

If you want to move one of the lines, just put your cursor over it in the ruler and the cursor will turn into an arrow. Click and you can move the line left or right. Double-click and the line disappears.

The left edge of columns containing numbers often won't align. Bigger numbers will extend farther to the left. To make sure you're capturing the longest numbers, scroll down through the table.

Sometimes, depending on the formatting of the original table, parts of one column will extend into parts of the next column. This mostly happens when you copy a table that extends over more than one page in the original document.

You have two options: If it's just a line or two, leave it and just clean up the spreadsheet after you paste. If it's a lot, note where the overlaps starts, then go back to the original document and copy from the top down to the problem part. Paste that. Then copy and paste the rest of the table.

Getting Okular
To get Okular, you can use Synaptic. Just search for Okular.

You may notice an options called okular-extra-backends, which lets Okular open other document types, like DjVu, TIFF, ePub and FAX. You can try it, but as of now it's not working in Ubuntu.
Reblog this post [with Zemanta]

No comments: