When you work in research (perhaps with life in general) you start to amass a lot of stuff. In research this tends to be journal articles. In the old days these would all be photocopies which you would spend an age putting into folders and writing index cards for. Now everything pretty much is electronic and with that you have switched the arch-lever folders for directories of pdf files.

There are a lot of commercial packages out there for managing all these pdfs. Some cost a lot of money and are not very good. Others are free and don’t do quite what you want others still only work on you phone or pad and not your computer. Did I mention very few are available for Linux?

So even if you do find a package you like it is doubtful that it will be available for all your platforms. So if you spend many, many, many days importing your PDF mess into their product and tagging each file with metadata it is almost heartbreaking to have to do it again a week later for another device/platform.

The obvious answer is to store the metadata with each pdf file. In fact Adobe or at least the pdf format has lots of scope for metadata storage. Things like the article authors, title, abstract, keywords, dates are all easily stored within the file.

However for some unknown reason publishers never utiliser this capability to populate these metadata storage tags.

With over 2000 pdf files it would be a massive job to try and do it by hand. One I could not actually face doing. So I sat down one day and wrote a program to do it for me! PDFtidy is the result.

You see all the metadata you probably want for a journal article already exists on the web in resources such as crossref. The problem is linking that metadata with your pdf file and then importing/implanting it into the file itself. This is rather challenging. Again there is a way to do this if you simply know the Digital Object Identifier (DOI) of the journal article itself. This address allows you to link to the article via the web using dx.doi.org or do a lookup using crossref.

PDFTidy simply utilises several routines to try and discover its DOI. Once done it automatically does the crossref lookup, pulls the information down for storage in a xml file and populates the empty tags in the pdf file (or more precisely a duplicate of the pdf file). This metadata is then readable by 99% of the pdf readers out there. If the reader is part of an ebook collection package such as Mantano then the data can also be searched.

PDFTidy is not is a long way from perfect. I am still working on the code base to improve the DOI detection. In some of the earlier journal articles they did not use the DOI as well it didn’t exist in those case I am working on a fallback to title searching using Google and crossref.

Finally if you have an ebook collection it is sometime useful to have a keyword base for that book. A lot of packages will generate this for you such as Zotero if you import the PDF file. So I thought again just like title and author information having the keyword populated tags in the pdf file would be really useful. So I am currently working on implementing a keyword generator to PDFtidy.

At the moment the keyword generator is utilising R and tm to do the analysis to generate a keyword file to be used in a subsequent step. It is somewhat process intensive so watch this space.