Just a heads up: On March 24, 2025, starting at 4:30pm CDT / 19:30 UTC, the site will be undergoing scheduled maintenance for a few hours. During this time, the site might be unavailable for a short while. Thanks for your patience.
×I have imported many documents from Word (.docx). It is the same for all:
The pargraph break at the end of paragraph that is used by Word seems to be different from Confluence's pargraph sign.
There is a line break in the Confluence page wherever there used to be paragraph break in the Word document. But it is not a "real" paragraph break.
For example:
This is the result of the import:
In order to make it appear right I must do this:
1. move the cursor on a line wich needs a paragraph break at its end
2. press the "End" button
3. press the "Return" button to create a "real Confluence" paragraph break
4. press the "Del" button to delete the imported Word paragraph break
Repeat 1-4 for each paragraph on the page.
After this treatment, the Confluence page will look like this:
Doing this manually is a real chore. Especially as I have hundres of pages to treat :-(
Is there a chance to do this with a Search/Replace function?
Why aren't the pargraph breaks converted correctly during the import, anyway?
Yes, I have the seen annoyance. What is happening (don't know why) is that the end of paragraphs are getting a </br> tag. Maybe this is a Confluence bug?
How I handle this is using the confluence-source-editor, then edit the page, and using RegEx find, replace all </br> with </p><p> tags.
If you have a long misbehaving document, I bring in the entire doc to one page so I can then clean it up quickly with the source editor, rather than allow Confluence to split on headings (which often don't exist in Word docs).
I think this affects docs that were properly formatted (paragraph spacing controlled by the style) and not when people control spacing like with a typewrite (two returns).
OK, so I am not that far off. I tried this and it did not work well for me. But I might have done it wrongly, searched and replaced the wrong things. So, I will give it another try. Lest someone else comes up with something smarter ;-)
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
As a matter of fact we're on a cloud installation. So the confluence-source-editor is not available to us.
The one that I tried was the Source Editor for Confluence, which is not free. But I guess, I'll just try it again. Search & replace should work in any editor, shouldn't it?
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Well you have to perform this task on the source for the page (XHTML) and not in the normal view. And the free source editor supports RegEx which is really helpful as you will get tags with additional classes, IDs, style, etc. (even for </br>). A lot easier to search for <br.+?> then all the combinations. ;-)
As far as the commercial plugin, it is cheap when you think of the hours to process really long Word docs.
Another suggestion is to set up a small server version of Confluence ($10 license, could be local to your machine). Then you have access to all the tools. You can import the doc into a space, then export and reimport to your Cloud instance.
And be sure to do a cleanup pass on the Word doc. I typically do the following:
Spending a few hours on cleaning up a Word doc before importing pays YOOGE dividends.
Or you have to pay a professional to do it for you. ;-)
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Thank you Bill, for your comments!
This is pretty much, what I do to those Word documents before I upload them. So, it's great to see that I am on the right track.
I believe that our company can afford the $10 that the commercial editor costs. So, if I can do the task with that, I'll be fine. I might need some support for this. I already tried search&replace with the commercial editor but I probably did something wrong which led to undesirable results... So, I think that I'll give it another try. Watch out for me coming back for some advice ;-) Currently, the paragraph breaks are the stuff that mostly concern me. So, if the commerical editor can handle that I'd be happy enough.
On the other hand - setting up my own little server might be an idea, too. Let me think about that.
Best regards
Robert
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Bill, I now have set up my own little Confluence server. And I installed the free Source editor that you suggested on it.
Could you please, be a little more specific about using RegEx to replace the paragragh ends?
And one more question: which would be the best way to move a page from the Cloud system to my server for treatment and then back? In the page-menu on the top right I find "Export to Word" but not a "simple" export...
--Robert
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Starting from back to front:
Transferring content
On regex
Boom, now you have a cleaner page. One note: the source editor has a quirk where when you reopen, it will show your last value for find, but it wont work. You need to delete and renter on character for it to activate.
And when you have a mo, accept my answer please. ;-)
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Yess, that did the trick. Thanks a million, Bill!
Best regards
Robert
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
I'm seeing the same behavior in my environment as well. A workaround, though not necessarily a great one, would be instead to attach the Word document to the page and embed it with the Word macro. That does show the paragraphs and line breaks correctly.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Well, I am afraid that this would impose new problems with that Word adapter and anyone who wants to edit the documents must have the adapter installed on her PC. I don't consider this feasible in this organization...
Also: the documents are pretty long (500+ pages) and the import function splits them up to give me one Confluence page for each level 1 chapter. If I used the the embed function I'd have to split the documents manually. That, too, looks like a bit of work that I would like to circumvent.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Good morning Bill,
it's me again...
Your advice works very well, but I'd like to come back to two extra questions:
1. I find that sometimes the import trims a blank after a bold word. Loks like this:
Orientation<noblankhere>This parameter defines the appearance...
of course it should look like this:
Orientation This parameter defines the appearance...
In the source editor it looks like this:
<strong>Orientation</strong>This parameter defines....
Of course I could replace "</strong>" with "</strong> ". But there are lots of </strong>s that already have a space after them and this would double the spaces.
So my question: as I have no experience with RegExs: can use a Find/Replace RegEx to only replace those </strong> - tags that have any character (except dot, comma, semicolon) directly behind them?
2. Do you know, whether the Source Editor has a "macro" feature that would allow me to automatize the Find/Replace operations?
3. I use Notepad++ a lot. That editor also has the capability for RegEx Find/Replaces - and I believe that it can do macros. Would it be an idea to go like this:
a) open the document in the Cloud in the commercial editor (which obviously cannot do RegEx)
b) copy all content into a Notepad++ file
c) do the S/R there
d) copy the content back to the commercial editor, wiping out the former content
Could that work?
Thanks for your patience with me!!
Robert
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi Bill,
please disregard my last message.
I found that the macros in Notepad++ can help me here a lot.
Now I got my boss to buy the commercial editor's licence. So I can
That's as swiftly as I can get it :-)
As for my </strong> - Problem: I tried finding "</strong>\w" and replacing it with "</strong> \w" but that didn't work. The RegEx seems not to "store" the character it found with the "\w" tag.
So I edited my NP++ - macro (inside shortcuts.xml) and added a search/replace sequence for each character and number. That's not very elegant, but it does the trick.
But, if you have a suggestion how I could achieve this in a RegEx, I'd be very happy. As I said, I have little to no experience with RegExs
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Yeah, the free source editor has that limitation, Not sure about the commercial version.
I have also done the same thing for more complex routines, copying back and forth between notepad++ and Confluence.
In Regex, it works for find, but you cannot use it for replace. And it does take a while to come up to speed on it. There is an online tester to help with developing RegEx strings:
What you want now is a RegEx string to find the tag followed by a non space, which would be <\/strong>\S, but that also replaces the first non-whitespace character. So what you want to do is go ahead and insert the redundant space, THEN replace all multiple spaces after the tags with a single space -- something like <\/strong>\s\s+ in find (regex) and "</strong> " in replace (that is the space after >). Boom, problem solved.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Great stuff! Thank you very much, Bill!
This has saved me a couple of tons of hours. I am down to a few seconds per page, now.
So, the combination of the commercial editor and Notepad++ - macros really helps.
Thank you sooo much!
Best regards
Robert
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi Bill,
it's me again. Meanwhile I have built a macro in Notepad++ that does all the regex-conversions. So cleaning up is a three step process for me now:
1. Open the commercial editor and cut the complete content
2. Open a Notepad++ window and press CTRL+SHIFT+C (this inserts the clipboard and runs my macro)
3. Cut and paste the content back from the Notepad++ window into the editor
That's nice! And it works for many documents, let's say about 4 out of 5.
But sometimes I get a syntax error when I save the document in the editor. I tried to fix these manually, but it just takes me from one error to the next.
To me it looks as if it is the same error all the time:
"Error validating XHTML x :
Error parsing xhtml: Unexpected dose tag
</p>; expected </span>. at [row,col {unknown-source}]: [1506,91]"
(with different line/column numbers, of course ;-) )
These are lines 1505 to 1507 of this particular source, I highligted position 91 and 92:
<p><span style="color: rgb(0,47,90);">14=Bitmap</span></p>
<h2><span style="color: rgb(0,47,90);">Grafiken für directfax verfügbar machen</p><p></span></h2>
<p>Eine neue DirectFax-Grafik können Sie wie folgt erstellen:</p>
From what I understand it looks like the order in which some </p> and </span> tags appear that gets disturbed.
I would like to upload the source but I cannot find an "upload" button around here. So I put the macro-code and a before- and an after-macro source into my Google drive. You can find it here: https://drive.google.com/open?id=1668n2s66BPFn3En0MLL8PfFdI5rYdehQ
May I ask you to cast a look at them and tell me how I can refine my macro to avoid that syntax error? I'd really appreciate that!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Sure, I wonder if you are clearing out all tags with your regex. Does your regex account for class names for example. If it doesn't, you could end up with open tags.
As an example, the following regex will clear out all span tags, both opening and closing with any number of attributes, in one go:
</?span.*?>
If you are getting this error with ONLY p tags, It could be that you are not handling br tags properly (some have a class name). Are you finding <br.+?> and replacing with </p><p>?
Once I hear back, I will take a look at the code.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi Bill,
during our last conversation, you gave me a couple of S/R to do with regex and I put them in a NP++ macro
These are:
S: <br /> R: <br/> (remove the blank between "br" and the slash)
S: <br.+?> R: </p><p> (yes, I do)
S: <p>\s+</p> R: nothing
S: <h.>/s+</h.> R: nothing
S: <\/strong>\s\s+ R: "</strong> " (one blank after the closing >)
That's all I do.
I do not touch any <span> or </span> in this macro.
And yes, as much as I can see, it keeps tripping over p-tags in conjunction with span in some way.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Is <h.>/s+</h.> a typo? It should be
<h.>\s+</h.>
A slicker version that deletes empty and p and h tags, try this:
<[ph]\d*>\s+<\/[ph]\d*>
But looking at the code, here is the offending structure:
<h2><span style="color: rgb(0,47,90);">Grafiken für directfax verfügbar machen<br /></span></h2>
Here the break tag should be just removed, rather than replaced (I wish people would stop using Word like a typewriter ;-)
So we need a routine to find these first:
<h.>.+?<br.+?>.+?<\/h.>
Now that you can find them, you can manually massage the code before going on.
BTW, you do not need to remove the blank between br and /. And I would strongly suggest you remove all the span tags -- they will interfere with the Confluence CSS (plus they clutter the code).
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi Bill,
thank you for your advice!
you are right: <h.>/s+</h.> is a typo. I am actually searching for the correct string: <h.>\s+</h.>
About the removing the span-tags: these do things to the colouring etc. Wouldn't I remove this along with the spans, too? What would be the S/R strings?
I cannot find the sequence <h.>.+?<br.+?>.+?<\/h.> in my cleaned up code (after having run the macro).
But when I search in the original source, before running the macro, I can find it. In my example page it is fould for two times (see 01 - Original source before cleanup-macro-run.txt on the Google drive).
So: should I start by replacing <h.>.+?<br.+?>.+?<\/h.> by something before I do all the other S/Rs? By what shall I replace it?
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hello Robert,
I suggest removing the span tags as it overrides Confluence styling. Basically it is philosophical/best practices issue. In order to keep the formatting of documents consistent within Confluence, I remove these low-level formatting overrides. If you need some special formatting, I believe it is better to wrap these in a user macro. Then as opinions change, you just have to update the CSS file, and boom all instances are changed.
And yes, you should find <h.>.+?<br.+?>.+?<\/h.> before replacing br tags. I am not aware of any ReGex that you could use that would only highlight these br tags in heading tags. There is lookahead syntax, but it is very restrictive - or I am not smart enough to know how to write that expression.
So unfortunately, you have to run this find manually. OR have to instruct authors using Word to not insert soft returns in headings to control formatting (probably won't happen)
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hello Bill,
thank you for your advice, again!
Meanwhile I did a couple of pages "manually". That is: I first ran my normal process and then, when I stored the page I received the error message. I then looked up the places that were marked as offending and I found this:
Each time it looked like this: <p></p></strong>.
I manually deleted the <p></p> and that put everything back into order.
So I wonder if I can just run this "normal" (not regex) S/R as the last step in my NP++ - macro:
S: <p></p></strong> R: </strong>
Well, I'll check it out, and come back, lest you already have a suggestion now.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Wouldn't running a Regex to delete empty p tags take care of it?
Run things maybe in this order:
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.