darkoshi: (Default)
[personal profile] darkoshi
About 6 years ago at work, I set up an online group on one of the corporate websites intended for that purpose. It was a group for our developers to share information and post questions & answers. Many coworkers joined the group, and it got a good bit of use in the first years (about 180 posts/threads). But the activity eventually lessened, and the last post was 2 years ago (for various reasons, I suppose).

Recently all group owners were notified that the website was being shut down soon, in favor of some other new site on different technology. We were told that if we wanted to save our group's content, no tools were being provided for doing so, but that we could copy and paste the content into Word documents.

I harrumphed at the thought. Opening each and every post, and copying/pasting it into a Word doc? You've got to be kidding. As the group hadn't even been used in 2 years, and much of the info there was no longer pertinent, there didn't seem much point in trying to save the content.

But yesterday I took some screenshots of the pages which listed the post titles, for memory's sake, or nostalgia, or because maybe that could somehow be useful.

Today a coworker emailed me a question. It reminded me of one of those posts, which explained how to find the foreign key relationships of a table in SQL Explorer. So I went back and read that post. It helped me answer the question.

Then I wondered if I could find an easier way to save the group data after all. I discovered that each thread had an option for saving to a PDF file - and to get that PDF, you only had to append ".pdf" to the URL of the thread's page.

If I could get a list of all those URLs, then I could save off the PDFs. Scrolling through the posts, 20 titles & URLs are shown per page. So I saved off about 10 HTML pages like that. Then I used File Locator Pro (an awesome tool; I highly recommend it) to parse out the URLs along with the titles. I used a reg-ex search query, and saved off the matches, using this method: export just the content found by a regex expression.

Then I determined how to save off the PDFs from the URLs. After logging into the website in my browser, entering the command "start firefox [URL]" in a command window would open the URL in a new tab of the browser. So I divided the URLs into groups of 10, and used a batch file to open the URLs, ten at a time. (I didn't want to do all 180 at once, as I had a feeling that would either crash the browser and/or get me into some kind of trouble, as in who's this person fetching a zillion pages from our webserver all at once?).

Then I used a Firefox plugin, Mozilla Archive Format, to save all open tabs to a MAFF file. A MAFF file is a zip file containing a folder for each tab. Each folder has an index.html (or in my case index.pdf) file, along with a RDF file which has metadata including the page's original filename.

So, once I had saved off MAFF files for all the URLs (about 18 MAFF files), I unzipped them all, extracted the PDFs, used another batch file to rename them back to the original numeric filenames (which puts the posts in order by date), and to include the post titles as part of the filenames.

For creating the batch files, I use Notepad++'s column editing to edit a bunch of lines at once, and macros to apply the same changes to each line.

And voila, I now have the group's entire content exported as PDF files which can be browsed or searched. And it only took me a few hours to do, most of which was figuring out how to do it as opposed to actually doing it.

I'm not sure what I'm going to do with the files now, but at least I have them.

Figuring out how to do things like that makes me feel clever.

Date: Saturday, May 13th, 2017 02:03 pm (UTC)
andrewducker: (Default)
From: [personal profile] andrewducker
Nicely done!

Date: Saturday, May 13th, 2017 11:33 pm (UTC)
randomdreams: riding up mini slickrock (Default)
From: [personal profile] randomdreams
That is in fact a quite lovely solution.