Just a heads up: On March 24, 2025, starting at 4:30pm CDT / 19:30 UTC, the site will be undergoing scheduled maintenance for a few hours. During this time, the site might be unavailable for a short while. Thanks for your patience.

×
Create
cancel
Showing results for 
Search instead for 
Did you mean: 
Sign up Log in

Import of Word doc - paragraph breaks are not cleanly converted

Robert E_ Schneider November 16, 2018

I have imported many documents from Word (.docx). It is the same for all:

The pargraph break at the end of paragraph that is used by Word seems to be different from Confluence's pargraph sign.

There is a line break in the Confluence page wherever there used to be paragraph break in the Word document. But it is not a "real" paragraph break.

For example:

This is the result of the import:
hc_018.jpg

 

In order to make it appear right I must do this:
1. move the cursor on a line wich needs a paragraph break at its end

2. press the "End" button

3. press the "Return" button to create a "real Confluence" paragraph break

4. press the "Del" button to delete the imported Word paragraph break

Repeat 1-4 for each paragraph on the page.

After this treatment, the Confluence page will look like this:
hc_020.jpg


Doing this manually is a real chore. Especially as I have hundres of pages to treat :-(

Is there a chance to do this with a Search/Replace function?

Why aren't the pargraph breaks converted correctly during the import, anyway?

2 answers

1 accepted

0 votes
Answer accepted
Bill Bailey
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
November 17, 2018

Yes, I have the seen annoyance. What is happening (don't know why) is that the end of paragraphs are getting a </br> tag. Maybe this is a Confluence bug?

How I handle this is using the confluence-source-editor, then edit the page, and using RegEx find, replace all </br> with </p><p> tags.

If you have a long misbehaving document, I bring in the entire doc to one page so I can then clean it up quickly with the source editor, rather than allow Confluence to split on headings (which often don't exist in Word docs).

I think this affects docs that were properly formatted (paragraph spacing controlled by the style) and not when people control spacing like with a typewrite (two returns).

Robert E_ Schneider November 18, 2018

OK, so I am not that far off. I tried this and it did not work well for me. But I might have done it wrongly, searched and  replaced the wrong things. So, I will give it another try. Lest someone else comes up with something smarter ;-)

Robert E_ Schneider November 18, 2018

As a matter of fact we're on a cloud installation. So the confluence-source-editor is not available to us.
The one that I tried was the Source Editor for Confluence, which is not free. But I guess, I'll just try it again. Search & replace should work in any editor, shouldn't it?

Bill Bailey
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
November 19, 2018

Well you have to perform this task on the source for the page (XHTML) and not in the normal view. And the free source editor supports RegEx which is really helpful as you will get tags with additional classes, IDs, style, etc. (even for </br>). A lot easier to search for <br.+?> then all the combinations. ;-)

As far as the commercial plugin, it is cheap when you think of the hours to process really long Word docs.

Another suggestion is to set up a small server version of Confluence ($10 license, could be local to your machine). Then you have access to all the tools. You can import the doc into a space, then export and reimport to your Cloud instance.

And be sure to do a cleanup pass on the Word doc. I typically do the following:

  • Remove any title page and ToC
  • Remove any headers and footers
  • Check that all the headings are really headings (formatted with H1, H2, etc.)
  • Check the heading hierarchy (really important when splitting a doc)
  • Delete any blank headings (you can see them easily in left-hand panel in Word).
  • If tracked changes are on, accept all edits and turn off (deleted text will get imported along with comments)
  • Delete all comments (see above)

Spending a few hours on cleaning up a Word doc before importing pays YOOGE dividends.

 

Or you have to pay a professional to do it for you. ;-)

Robert E_ Schneider November 19, 2018

Thank you Bill, for your comments!

This is pretty much, what I do to those Word documents before I upload them. So, it's great to see that I am on the right track.

I believe that our company can afford the $10 that the commercial editor costs. So, if I can do the task with that, I'll be fine. I might need some support for this. I already tried search&replace with the commercial editor but I probably did something wrong which led to undesirable results...  So, I think that I'll give it another try. Watch out for me coming back for some advice ;-) Currently, the paragraph breaks are the stuff that mostly concern me. So, if the commerical editor can handle that I'd be happy enough.

On the other hand - setting up my own little server might be an idea, too. Let me think about that.

Best regards

Robert

Robert E_ Schneider November 21, 2018

Bill, I now have set up my own little Confluence server.  And I installed the free Source editor that you suggested on it.

Could you please, be a little more specific about using RegEx to replace the paragragh ends?

And one more question: which would be the best way to move a page from the Cloud system to my server for treatment and then back? In the page-menu on the top right I find "Export to Word" but not a "simple" export...

--Robert

Bill Bailey
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
November 21, 2018

Starting from back to front:

Transferring content

  • Cloud does it make a bit more difficult, but you should be able to copy all on the server view of the page in edit mode and paste into a new page (server to server I would go the the source code view to do this).
  • The other way is to do all of your work in a special transfer space, then when done with your work, export out a copy of the space, then restore to the cloud. Then you can move the pages where you want.

On regex

  • Not sure how your source looks, but if they are indeed break tags, typically your block of text will start with and end with p tags, with break tags spread throughout. In that case in the find insert <br.+?> and in replace </p><p> with RegEx checked, followed by replace all.
  • Then to clean up empty paragraphs, insert <p>\s+</p> in find and nothing in replace, and then Replace All.
  • And to clean out any empty heading tags, <h.>\s+</h.> in find, nothing in replace

Boom, now you have a cleaner page. One note: the source editor has a quirk where when you reopen, it will show your last value for find, but it wont work. You need to delete and renter on character for it to activate.

 

And when you have a mo, accept my answer please. ;-)

Like Robert E_ Schneider likes this
Robert E_ Schneider November 21, 2018

Yess, that did the trick. Thanks a million, Bill!

Best regards

Robert

0 votes
Davin Studer
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
November 16, 2018

I'm seeing the same behavior in my environment as well. A workaround, though not necessarily a great one, would be instead to attach the Word document to the page and embed it with the Word macro. That does show the paragraphs and line breaks correctly.

Robert E_ Schneider November 18, 2018

Well, I am afraid that this would impose new problems with that Word adapter and anyone who wants to edit the documents must have the adapter installed on her PC. I don't consider this feasible in this organization...

Also: the documents are pretty long (500+ pages) and the import function splits them up to give me one Confluence page for each level 1 chapter. If I used the the embed function I'd have to split the documents manually. That, too, looks like a bit of work that I would like to circumvent.

Robert E_ Schneider November 22, 2018

Good morning Bill,
it's me again...

Your advice works very well, but I'd like to come back to two extra questions:

1. I find that sometimes the import trims a blank after a bold word. Loks like this:

         Orientation<noblankhere>This parameter defines the appearance...

   of course it should look like this:

        Orientation This parameter defines the appearance...

In the source editor it looks like this:

        <strong>Orientation</strong>This parameter defines....

Of course I could replace "</strong>" with "</strong> ". But there are lots of </strong>s that already have a space after them and this would double the spaces. 

So my question: as I have no experience with RegExs: can use a Find/Replace RegEx to only replace those </strong> - tags that have any character (except dot, comma, semicolon) directly behind them?

2. Do you know, whether the Source Editor has a "macro" feature that would allow me to automatize the Find/Replace operations? 

3. I use Notepad++ a lot. That editor also has the capability for RegEx Find/Replaces - and I believe that it can do macros. Would it be an idea to go like this:
   a) open the document in the Cloud in the commercial editor (which obviously cannot do RegEx)
   b) copy all content into a Notepad++ file
   c) do the S/R there
   d) copy the content back to the commercial editor, wiping out the former content
Could that work?

Thanks for your patience with me!!

Robert

Robert E_ Schneider November 22, 2018

Hi Bill,
please disregard my last message.

I found that the macros in Notepad++ can help me here a lot.

Now I got my boss to buy the commercial editor's licence. So I can

  1. Open a page in the Source Editor (Cloud)
  2. cut/paste the content into a NP++ file
  3. run a macro that takes care of all my pains
  4. copy/paste the content back

That's as swiftly as I can get it :-)

As for my </strong> - Problem: I tried finding "</strong>\w" and replacing it with "</strong> \w" but that didn't work. The RegEx seems not to "store" the character it found with the "\w" tag.
So I edited my NP++ - macro (inside shortcuts.xml) and added a search/replace sequence for each character and number. That's not very elegant, but it does the trick.
But, if you have a suggestion how I could achieve this in a RegEx, I'd be very happy. As I said, I have little to no experience with RegExs

Bill Bailey
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
November 22, 2018

Yeah, the free source editor has that limitation, Not sure about the commercial version.

I have also done the same thing for more complex routines, copying back and forth between notepad++ and Confluence.

In Regex, it works for find, but you cannot use it for replace. And it does take a while to come up to speed on it. There is an online tester to help with developing RegEx strings:

https://regex101.com/.

What you want now is a RegEx string to find the tag followed by a non space, which would be <\/strong>\S, but that also replaces the first non-whitespace character. So what you want to do is go ahead and insert the redundant space, THEN replace all multiple spaces after the tags with a single space  -- something like <\/strong>\s\s+ in find (regex) and "</strong> " in replace (that is the space after >). Boom, problem solved.

Like Robert E_ Schneider likes this
Robert E_ Schneider November 23, 2018

Great stuff! Thank you very much, Bill!

This has saved me a couple of tons of hours. I am down to a few seconds per page, now.

So, the combination of the commercial editor and Notepad++ - macros really helps.

Thank you sooo much!

Best regards

Robert

Robert E_ Schneider December 18, 2018

Hi Bill,

it's me again. Meanwhile I have built a macro in Notepad++ that does all the regex-conversions. So cleaning up is a three step process for me now:
1. Open the commercial editor and cut the complete content
2. Open a Notepad++ window and press CTRL+SHIFT+C (this inserts the clipboard and runs my macro)
3. Cut and paste the content back from the Notepad++ window into the editor

That's nice! And it works for many documents, let's say about 4 out of 5.

But sometimes I get a syntax error when I save the document in the editor. I tried to fix these manually, but it just takes me from one error to the next. 

To me it looks as if it is the same error all the time:

"Error validating XHTML    x :
Error parsing xhtml: Unexpected dose tag
&lt;&#47;p&gt;; expected &lt;&#47;span&gt;. at [row,col {unknown-source}]: [1506,91]"

(with different line/column numbers, of course ;-) )

These are lines 1505 to 1507 of this particular source, I highligted position 91 and 92:

<p><span style="color: rgb(0,47,90);">14=Bitmap</span></p>
<h2><span style="color: rgb(0,47,90);">Grafiken f&uuml;r directfax verf&uuml;gbar machen</p><p></span></h2>
<p>Eine neue DirectFax-Grafik k&ouml;nnen Sie wie folgt erstellen:</p>

From what I understand it looks like the order in which some </p> and </span> tags appear that gets disturbed.

I would like to upload the source but I cannot find an "upload" button around here. So I put the macro-code and a before- and an after-macro source into my Google drive. You can find it here: https://drive.google.com/open?id=1668n2s66BPFn3En0MLL8PfFdI5rYdehQ

May I ask you to cast a look at them and tell me how I can refine my macro to avoid that syntax error? I'd really appreciate that!

Bill Bailey
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
December 19, 2018

Sure, I wonder if you are clearing out all tags with your regex. Does your regex account for class names for example. If it doesn't, you could end up with open tags.

As an example, the following regex will clear out all span tags, both opening and closing with any number of attributes, in one go:

</?span.*?>

 If you are getting this error with ONLY p tags, It could be that you are not handling  br tags properly (some have a class name). Are you finding <br.+?> and replacing with </p><p>?

Once I hear back, I will take a look at the code.

Robert E_ Schneider December 19, 2018

Hi Bill,

during our last conversation, you gave me a couple of S/R to do with regex and I put them in a NP++ macro

These are:

S: <br />        R: <br/>        (remove the blank between "br" and the slash)

S: <br.+?>       R: </p><p>     (yes, I do)

S: <p>\s+</p>    R: nothing

S: <h.>/s+</h.>   R: nothing

S: <\/strong>\s\s+    R: "</strong> "    (one blank after the closing >)

That's all I do.

I do not touch any <span> or </span> in this macro.
And yes, as much as I can see, it keeps tripping over p-tags in conjunction with span in some way.

Bill Bailey
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
December 19, 2018

Is <h.>/s+</h.> a typo? It should be

<h.>\s+</h.>

 A slicker version that deletes empty and p and h tags, try this:

<[ph]\d*>\s+<\/[ph]\d*>

 But looking at the code, here is the offending structure:

<h2><span style="color: rgb(0,47,90);">Grafiken f&uuml;r directfax verf&uuml;gbar machen<br /></span></h2>

 Here the break tag should be just removed, rather than replaced (I wish people would stop using Word like a typewriter ;-)

So we need a routine to find these first:

<h.>.+?<br.+?>.+?<\/h.>

Now that you can find them, you can manually massage the code before going on.

BTW, you do not need to remove the blank between br and /. And I would strongly suggest you remove all the span tags -- they will interfere with the Confluence CSS (plus they clutter the code).

Robert E_ Schneider December 20, 2018

Hi Bill,

thank you for your advice!

you are right: <h.>/s+</h.>  is a typo. I am actually searching for the correct string: <h.>\s+</h.> 

About the removing the span-tags: these do things to the colouring etc. Wouldn't I remove this along with the spans, too? What would be the S/R strings?

I cannot find the sequence <h.>.+?<br.+?>.+?<\/h.> in my cleaned up code (after having run the macro).

But when I search in the original source, before running the macro, I can find it. In my example page it is fould for two times (see 01 - Original source before cleanup-macro-run.txt on the Google drive).

So: should I start by replacing <h.>.+?<br.+?>.+?<\/h.> by something before I do all the other S/Rs? By what shall I replace it?

Bill Bailey
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
December 20, 2018

Hello Robert,

I suggest removing the span tags as it overrides Confluence styling. Basically it is philosophical/best practices issue. In order to keep the formatting of documents consistent within Confluence, I remove these low-level formatting overrides. If you need some special formatting, I believe it is better to wrap these in a user macro. Then as opinions change, you just have to update the CSS file, and boom all instances are changed.

And yes, you should find <h.>.+?<br.+?>.+?<\/h.> before replacing br tags. I am not aware of any ReGex that you could use that would only highlight these br tags in heading tags. There is lookahead syntax, but it is very restrictive - or I am not smart enough to know how to write that expression.

So unfortunately, you have to run this find manually. OR have to instruct authors using Word to not insert soft returns in headings to control formatting (probably won't happen)

Robert E_ Schneider December 21, 2018

Hello Bill,

thank you for your advice, again!

Meanwhile I did a couple of pages "manually". That is: I first ran my normal process and then, when I stored the page I received the error message. I then looked up the places that were marked as offending and I found this:
Each time it looked like this: <p></p></strong>.

I manually deleted the <p></p> and that put everything back into order.

So I wonder if I can just run this "normal" (not regex) S/R as the last step in my NP++ - macro:
S: <p></p></strong>    R: </strong>

Well, I'll check it out, and come back, lest you already have a suggestion now.

Bill Bailey
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
December 21, 2018

Wouldn't running a Regex to delete empty p tags take care of it?

Run things maybe in this order:

  1. Manually find the br tags in heading tags and delete
  2. Search and delete all span tags
  3. The S/R br tags with </p><p>
  4. Search for and delete all empty p tags (and maybe h tags)

Suggest an answer

Log in or Sign up to answer
TAGS
AUG Leaders

Atlassian Community Events