Converting Tags To Categories In WordPress : Using The XML File

I had a made a post a few weeks ago on how to convert your tags back into categories on in WordPress, in case you got into trouble because of its conversion tool. That particular method, however, needed you tweak the MySQL database for your WordPress blog. The technique I mentioned earlier is something which would not work for people who do not have access to the database, say, people using WordPress.com. Andrew Patterson faced such a problem recently, and his idea of using the WordPress export file to achieve this sounds interesting. So here’s how to go about it. I haven’t tried it myself, but given the nature of what I’m going to suggest I’m pretty sure it will work. To be on the safe side though, I advise you to create a new WordPress.com blog just to test this procedure before trying it out on your actual blog. Here’s what to do after you’ve created the new blog:

Login to your WordPress.com production blog’s (your current, actual blog) dashboard.
Go to Manage > Export. Choose to export the backup file for your blog, and save it on your computer.
Go the folder where you created it, and open the file in Notepad. This is important – do NOT open this in WordPad or Word because they insert extraneous markup which will make your file useless.
In case you hate Notepad, and want stuff like automatic highlighting etc – which will be VERY handy for what you’re going – download Notepad++. It’s a free software and a pretty small download, and far more advanced than Notepad in features. I recommend you to do this. In case you’re like (like me) and use Linux, the default editors like Gedit and Kate / Kwrite have these features already.
After opening the file you have download, the first thing you should do is to create a copy under a different name (File > Save As…). This is because there’s a possibility that you may make an error while editing, and it’s always good to have a backup copy to go back to for starting over.

Now for a bit of theory, which I think is important for you to understand what you’re going to do. The file that you downloaded from WordPress is what they call a ‘WordPress eXtended RSS (WXRSS)‘ file. You’ve probably noticed the ‘feed’ feature, on blogs, right? Well, such feeds are published in a file format called XML conforming to different standards like Atom or RSS. Basically, an XML (which stands for ‘eXtensible Markup Language‘) file contains data whose relation with other contents within that data is defined. Confused? I’ll give an example. Consider the following psuedo-XML type code.

<food>

<foodname>Lasagna</foodname>

<foodrecipe>Take tomato…</foodrecipe>

</food>

This looks a lot like HTML, doesn’t it? That’s because both XML and HTML are both based on another language called SGML. The difference with XML is that any content specified under it has to be rigidly formatted to pre-defined specifications. You can’t mix lowercase and uppercase – it’s case sensitive, you MUST close tags, etc etc. In the example I gave above, the relation between the items would have been defined elsewhere, which would tell any program using it what the relation between the items is. In this case, it tells that Lasagna is the foodname of a particular food, and it’s recipe is defined is defined inside foodrecipe. A ‘schema‘ is where the relation between these items is defined.

RSS feeds published by a blog also use XML, but naturally, they use a different schema (since they don’t need to define relations between food items). Instead, they’d have stuff like ‘post title’, ‘post body’, ‘post comments’, etc. This data, in a normal blog feed would be restricted to only the bare minimum. When you export a WordPress XRSS file, you get an XML file containing ALL the data stored in the database associated with your blog. Apart from the stuff you’d find in a normal blog feed, it also contains data on what time the post was published, who the author was, the comments posted under an article, etc. The way in which this is present in the file and how it should be interpreted is defined by WordPress in its own schema. Which means, you can’t go ahead and start editing it any way you like – it MUST follow the schema, otherwise WordPress will refuse to accept it.

That said, let’s get started with editing the WXRSS file you’ve got. I don’t use tags, so I don’t have a file to work with (and I’m too lazy to add tags and check it out). However, given the very nature of XML – rigidly defined structures – I’m sure this would work out. Have a look at the beginning of the file. At one point, near the beginning, you will find that all categories / tags are defined. Here’s a snippet from mine:

<wp:category>

<wp:category_nicename>motion-pictures</wp:category_nicename>

<wp:category_parent />

<wp:cat_name>

– <![CDATA[ Motion Pictures ]]>

</wp:cat_name>

</wp:category>

<wp:category>

<wp:category_nicename>pix-sells</wp:category_nicename>

<wp:category_parent />

<wp:cat_name>

– <![CDATA[ Pix Sells ]]>

</wp:cat_name>

</wp:category>

As you can see, there is a clearly-defined structure. It begins with wp:category, which is opened and closed for each different category.I’m assuming you know a bit of HTML and know that tags are <like this>opened and closed</like this>
Then comes wp:category_nicename. This defines the ‘category slug’ – which is basically the URL that appears in your browser when you browse the page. In my case for example, this ‘slug’ would correspond to https://ankurb.net/category/motion-pictures/
Next up is wp:cat_name. This holds the ‘human readable’ (or publicly displayed) version of the category name. This is what is shown to your visitors – like the category names in my blog’s sidebar. The data to be held in this field is included under CDATA, which the general way to storing data items like these. Take a note that the text contained in this area needs to have a space AFTER the opening square bracket of CDATA; and a trailing space, i.e., a space between the last letter of your category name and the closing square bracket.
Tags and categories can have ‘parent’ tags and categories too. For example, you might have used ‘Barack Obama’ as a sub-category under ‘Politics’. I don’t use such sub-categories, therefore, in the snippet above you’ll find wp:category_parent which isn’t opened and closed in the ‘normal way’. This is because it has no data to store under it, and such elements in an XML file need to be ‘self-closing’, i.e., the <elementname /> followed by a space, then a forward slash, then close brackets. The space between the slash and element is important – remember, this is XML. However, in case you HAVE used parent-child sub-categorization, this element would be opened and closed in the ‘normal way’ and contain data regarding this.

All you have to do now is to have a look at YOUR file, and search out the corresponding elements for tags. Given the the fact that this is an XML file (and borrowing a bit from WordPress’ database structure), I won’t be surprised if they are named wp:post_tag, wp:tag_nicename, wp:tag_name – but I could be wrong. Have a look at your file. After you’ve found these, all you need to do now is to use the Replace feature in your text editor and replace the elements associated with tags for their corresponding category equivalents explained above. So you need to replace the opening and closing elements of the tags defined with the opening (<wp:category>) and closing (</wp:category>) elements of a category. THIS is where you’ll find the advanced features of Notepad++ like syntax highlighting and replacing useful. Do it for each and every corresponding element, even empty ones like <wp:category_parent />

The ones at beginning were just for WordPress to display categories in your administration panel, dashboard, etc. Scroll down a bit, and you will find the section where all your posts begin. As this is an extended RSS file, you’ll find a lot of extra info – but what I want you to concentrate on is the place above which your post content is present. This will be present in a CDATA element, above which you’ll have the guid isPermaLink element. Above this, you will find the elements related to categories and tags. Here’s a snippet from my file.

– <![CDATA[ Reviews ]]>

</category>

– <![CDATA[ Reviews ]]>

</category>

– <![CDATA[ Tech Takes ]]>

</category>

– <![CDATA[ Tech Takes ]]>

</category>

See that? Once again, all you need to do is to check in your file the corresponding elements for tags, and use the replace feature to replace it with the corresponding category element. I think this bit here should be self-explanatory, because this is just an extension of what you did earlier with the tag / category definitions at the top. I’ll just point out two things here: each category is declared separately; and the CDATA section is repeated twice for each category.

That sure sounds like a lot of work, but believe me, it isn’t. Once you identify what the corresponding elements are, all you have to do is to use the find and replace feature in your text editor to make the necessary replacement, and it’ll do it for you, no matter how many times the words appear. With some luck, you’ll have the edited file ready within 15 minutes.

As I said earlier, try this out on a test blog first. If you’ve gotten anything wrong during editing, it’s better not to have your production blog destroyed. Once you’re satisfied with the results, login to your current blog, and delete each and every post. This is because WordPress only picks out those posts from an XML file which aren’t already there. After deleting each post (and this can take quite some time if you have many), go to Manage > Import, and choose the option related to importing from a WordPress XML file. Hope that helps.

I have a request. If you find that this works, please leave a comment telling me what the ACTUAL elements related to tags are. That way, I can update this article to reflect what replacements need to be made; and you’ll be helping fellow users like you in a similar predicament. That’s the biggest draw of using open source software like WordPress – the support of from the community.

PS – Sorry for the different font et al. There was a lot of stuff that I would’ve needed to convert into HTML markup for it to display properly, so I had to use Abiword.

Related

Leave a ReplyCancel reply