|
Bulk Loading Data into DrupalBulk Loading Data into DrupalThis notebook entry is to document (as much for myself as anyone else) how I finally figured out to 'bulk load' pages. It became obvious after some time that the only type of input for me was a Drupal 'page'. Queries to the Drupal mailinglists got me nowhere. Finally I stumbled on this post and after some very simple tests, I found out I could write a very simple PHP program to pound a file into the database (e.g. create a page). What follows is an outline of what I did and is very very specific to my environment.
My first step was to write little PHP program so see if I could actually create a Drupal page. I've long since lost the original code, but it survives as a function in my existing code. The original was pretty close to this:
I copied this HTML file to the document root of my Drupal site and then invoked that page (with a web browser or something like wget). Returning to the Drupal management screens, I checked that a node was created and that it appeared to be correct. Great! After that I keep modifying the PHP code to do more and more, until I had it working to read a file from the local file system of the web server. The final result was loadhtml.php which expects a simple ASCII file containing the name of a local file (on the web server), the title of the page and a string of three taxonomy values (which I'm not going to tell you anything about). An input file might look like:
I copy this control file and the input HTML files (e.g. Basic_Talks_01.htm.stripped) to the expected directory on the web server and then invoke loadhtml.php. The program gets a list of files to process, figures out the body, title and taxonomy values and calls a function to create the Drupal pages. Wow, that was easier than all that mousing around! In my case I had lots of existing HTML files at an old web site. They had all sorts of navigation HTML embedded in them. I wrote a Perl script to convert these files into something more Drupal-friendly. This involved stripping out the HTML headers, clean up the original HTML and find just the body of the HTML page. It also attempted to identify the 'title' of the page. I did not attempt to do everything possible... after all, I could always use the Drupal management screens to edit the HTML. Here's the Perl program (stripit.pl) to clean up the HTML. My last step was to write a shell script to drive the whole process. This script (not shown) did the following:
This is certainly a thousand times faster than doing it manually. The down side was that once I had hundreds of new Drupal pages I still had piles of menus to create by hand :-(. |