Bulk Loading Data into Drupal

Bulk Loading Data into Drupal

This notebook entry is to document (as much for myself as anyone else) how I finally figured out to 'bulk load' pages. It became obvious after some time that the only type of input for me was a Drupal 'page'.

Queries to the Drupal mailinglists got me nowhere. Finally I stumbled on this post and after some very simple tests, I found out I could write a very simple PHP program to pound a file into the database (e.g. create a page). What follows is an outline of what I did and is very very specific to my environment.

The following will make most sense to those wou are pretty comfortable with PHP and Perl. I am not pretending this code as anything like a general solution, but really as an outline of a one-time hack to pound files of HTML into Drupal. The code is filled with dozens of special cases for my environment. If you use any of this code, be very very careful to change it to fit your needs. My examples are all based on Linux, but there's nothing here tied to Linux, it's just my environment of choice.

My first step was to write little PHP program so see if I could actually create a Drupal page. I've long since lost the original code, but it survives as a function in my existing code. The original was pretty close to this:

  $title = "Terry's Test node";
  $body = "<h2>Some text from Terry<h2>\n<p>here is a paragraph<\p>\n";

  $mynode = array();
  $mynode['title'] = $title;
  $mynode['type'] = 'page';
  $mynode['body'] = $body;
  //  published=1 or unpublished=0 content
  $mynode['status'] = 1;
  //  uid is user id, the user id 1 being the id of the one who makes first id after
  //  a drupal installation, uid 1 has all prvilleged, make sure your user id comes
  //   with all privileges,preferably use userid 1 to save yourself from the hassle.
  $mynode['uid'] = 1;
  //  promote =0 doesn't promote the content to the front page
  //  whereas promote=1 promotes the content to the front page
  $mynode['promote'] = 0; 
  //  comment 0=off , comment 1=readonly, comment 2=allowed
  $mynode['comment'] = '0';
  //  inputformat, format=0 means Filtered HTML,format=1 means PHP code , format=2 means Full HTML
  $mynode['format'] = '2'; 

  //  Create the node and save it, print out object so you can see node number
  $newnode = node_submit($o);
  print print_r($newnode);

I copied this HTML file to the document root of my Drupal site and then invoked that page (with a web browser or something like wget). Returning to the Drupal management screens, I checked that a node was created and that it appeared to be correct. Great! After that I keep modifying the PHP code to do more and more, until I had it working to read a file from the local file system of the web server. The final result was loadhtml.php which expects a simple ASCII file containing the name of a local file (on the web server), the title of the page and a string of three taxonomy values (which I'm not going to tell you anything about). An input file might look like:

file=050919DharmaBB_web.htm.stripped ; title=                      ; taxonomy=aaron,Dharma,2005;
file=051017DharmaBB-web.htm.stripped ; title=                      ; taxonomy=aaron,Dharma,2005;
file=Basic_Talks_01.htm.stripped     ; title=Empowerment           ; taxonomy=aaron,Dharma,2005;
file=Basic_Talks_02.htm.stripped     ; title=Aaron:                ; taxonomy=aaron,Dharma,2005;
file=Basic_Talks_06.htm.stripped     ; title=Karma and Liberation  ; taxonomy=aaron,Dharma,2005;
file=Basic_Talks_07.htm.stripped     ; title=Dharma in the Belly   ; taxonomy=aaron,Dharma,2005;

I copy this control file and the input HTML files (e.g. Basic_Talks_01.htm.stripped) to the expected directory on the web server and then invoke loadhtml.php. The program gets a list of files to process, figures out the body, title and taxonomy values and calls a function to create the Drupal pages. Wow, that was easier than all that mousing around!

In my case I had lots of existing HTML files at an old web site. They had all sorts of navigation HTML embedded in them. I wrote a Perl script to convert these files into something more Drupal-friendly. This involved stripping out the HTML headers, clean up the original HTML and find just the body of the HTML page. It also attempted to identify the 'title' of the page. I did not attempt to do everything possible... after all, I could always use the Drupal management screens to edit the HTML. Here's the Perl program (stripit.pl) to clean up the HTML.

My last step was to write a shell script to drive the whole process. This script (not shown) did the following:

  • Create a temporary directory on my machine
  • Copy the HTML to the temporary directory
  • Invoke stripit.pl on the HTML in the temporary directory
  • Copy the stripped HTML files, the input config file and loadhtml.php to the document root of the Drupal site
  • Invoke loadhtml.php on the web site

This is certainly a thousand times faster than doing it manually. The down side was that once I had hundreds of new Drupal pages I still had piles of menus to create by hand :-(.