OJ Webcrawler

One of the weirder sources that we retrieved at Reuters News was the Official Journal of the EU. We needed a specialised webcrawler to run daily and forage for new files

The OJEU is a list of rules and recommendations. It was published in 10 languages, each with a web page containing links to dozens of PDFs. Initially, two long-suffering guys in London (Hi Dave. Hi Tony) had the chore of collecting the several daily issues, a nightmare task

My crawler was developed in Delphi, but harnessed Internet Explorer like an embedded web browser inside the app, so that it could parse the HTML. The operator simply navigated to the required date on a calendar, and the crawler then worked out what issues were available, and what languages. Then it invoked HTML GETs for the links - but the OJ server being so under-powered, these requests would often fail. The crawler had to keep track of timeouts and re-issue requests for a few times, then give up.

The resulting files were automatically compiled by the crawler into one zip file per language and issue, using a free zipper component.

And for status indicators, there were assortments of red, amber and green lights to entertain Dave and Tony, who magically had time on their hands.

If I had to do this today, naturally I would use PHP and CURL, but at that time (1998), Delphi made a great equivalent.

Delphi
Date: 1998

Top of page