Jailbreaking the PDF

Submitted by caseym on Tue June 25, 2013

tl;dr - We've created a community getting scholarly information out of PDFs: Jailbreaking the PDF.  It was a lot of fun, and we're going to do it again at Codefest 2013 and Hack4Ac.

I recently attended the Extended Semantic Web Conference in Montpellier, France, which was great fun and very educational.  The highlight of the event was meeting people and getting stuff done.  To that end, my colleague Alex Garcia-Castro and I decided to organize a hackathon at the ABES Agency nearby the conference.  About 15 of us came together with the common mission of automating extraction of information from PDFs.  


Tools and Tasks from the Event

  • PDFX (http://pdfx.cs.man.ac.uk/) - This is the most comprehensive tool at the hackathon.  It accepts PDFs as input and and attempts to automate conversion to a structured XML format similar to JATS.  Unfortunately, PDFX is not open source at this time. However, it is available as a free RESTful service, which I have been using quite a lot since the hackathon.
  • LAPDFText (https://code.google.com/p/lapdftext/) - This is another very mature and well-documented open source tool for identifying and classifying content inside of PDFs for extraction purposes.  The developer, Gully Burns, participated from nine time-zones away via Google Hangout, which we all appreciated very much!
  • CiTaLo - (http://wit.istc.cnr.it:8080/tools/citalo) -  This is a tool that automatically annotates citations with properties dened in CiTO (Citation Typing Ontology).  The tool uses PDFX to attempt citation identification/classification in PDFS, and then annotates the citations.
  • PDF2SVG (https://bitbucket.org/petermr/pdf2svg-dev) - Part of a collection of JAVA libraries that perform various extractions on PDF data.
  • XtractPDF (http://xtractpdf.com) - My tool, which I'll write a longer post about later, provides a web interface for humans to use in order to ease the process of cleaning up automated output.  It accepts PDFs as input, runs them through the PDFX web service, and produces a structured data model that can be worked on through the interface.

..and others

Next Steps - Systematic Evalution of Tools

Also at the hackathon, we talked about the importance of performing an evaluation of existing tools and building a benchmark by which systematic analysis can be done.  Some of the things we identified were:

  1. The ability to identify contiguous blocks of text, based on position, font, and other characteristics
  2. The ability to classify identified blocks as headings, subheadings, figure captions, etc.

The community meets at http://pdfjailbreak.com.  If you're interested, join the discussion on our mailing list.