Page 1 of 1

Stripping the leading '/' from hyperlinks

PostPosted: Wed Feb 12, 2003 3:19 am
by Calum
hello. this question i didn't know where to put it, so here it is.

Now my university has a website and i am required to read rather a lot of it for my course. i can either go on the internet to do this, print it all out (!) which would be a real pain in the bum since all the pages are about 2 paragraphs long, or else they provide an offline version of it. the offline version comes as a set of *.exe files but that's not the problem. these are winzip self-openers and can be opened using an 'unzip' no problem. however i have a problem. i noticed last night that every single hyperlink in the offline version has got a leading / on it! so for example if i am on the page /home/calum/T171/index.cfm and i click a link, instead of going to /home/calum/T171/calendar.cfm it will attempt to go to file:///T171/calendar.cfm (which does not exist). the only workaround i have thought of for this is to make / the site root directory, which i am obviously not going to do.

I thought maybe this was some incompatibility with my windoid university, but no, i tried it here on win2000 at work too and same problem. Also, i tried to wget the online version however the site requires me to login via a whole bunch of javascript and secure webpages and all that crap and when i wget it just times out because i can't authenticate myself to it.

Any idea for how i can strip the leading '/' from the hyperlinks and leave all the other instances of '/' undamaged, and not ruin the image files and so on?

ta.

PostPosted: Wed Feb 12, 2003 3:43 am
by Void Main
Do you have a sample page? I wonder if my search/replace script would be of any use?

http://voidmain.is-a-geek.net/files/sr

PostPosted: Wed Feb 12, 2003 6:01 am
by Calum
that script looks good to me however as your comment says, if i replace '/' with ' ' then it will literally remove every '/' in the files (won't it?) so that might end up worse.

i put a copy of one of the html pages at http://calumsmusic.netfirms.com/sample.zip - it's inside a .zip file because if i upload plain text files to netfirms it changes them and puts adverts in and stuff. Only the sample text page is in that zip file, none of the pictures or stylesheets et c, they are all included in the offline stuff i downloaded, but their paths are all relative rather than absolute (which is what they are with a leading '/' no matter how you cut it), so none of the images or stylesheet or links work.

I have logged this with the university too, but i expect that if they have any help to give it will simply be that they fix the links and tell me to download it again afterwards, and i'd rather see it sorted myself (because i'm a funny bugger like that).

PostPosted: Wed Feb 12, 2003 6:26 am
by Void Main
This will work:

Code: Select all
perl -i -p -e "s/href=\"\//href=\"/g;" *.cfm


Which will search for all occurances of:

href="/

and replace it with:

href="

in all *.cfm files.

PostPosted: Wed Feb 12, 2003 7:31 am
by Calum
you're right! i should have thought of that sooner. i did think of it myself, but you had helpfully posted in the meantime!
thanks again, void main.

PostPosted: Thu Feb 13, 2003 5:42 pm
by Calum
okay, as i said i put this query to the university and they came up with the following:

me: i downloaded the module one offline archive following the instructions at http://nameofsite but i find an odd problem.
All the links in the offline pages have a leading '/' which means that whenever you click a link, it points to a location that doesn't exist (ie. you get an error) and this makes all the internal links nonworking.

reply one: It is one of those nice things that Windows based web servers / web editors love to do, that frequently mucks up people when they copy stuff over onto Unix based servers / non MS servers like Apache. Even Apache on W2K doesn't like that style.

me: I tried this at work on windows 2000 as well and had the same problem. Has anybody else noticed this? I wonder if i am missing something obvious...

[b]reply one: This bit should be solved by making sure you follow instructions step 5;
Click 'Unzip'. The file will be unzipped and copied to a folder on your hard disk called c:\T171.
If the files are in this folder on T171 it should work - certainly last years did on w2k, though you might have to then refer to the T171 FAQ about cfm files to solve the 'I clicked on index.cfm and it didn't work' problem.
Note that it will work through 'file open' on the browser, or the rename mentioned above, or even via IIS web server on w2k so long as the root directory filename is T171. It won't work with many other web servers on w2k.

reply two (addressing my tutor): Which browser is he using? There were reports that the offline version of TU170 didn't work well with Netscape 4.7 & Netscape 7, or with Opera. It seems that IE (all versions) & Netscape 6 (as supplied on the Applications CD) were the most successful. I imagine the T171 offline version is built on the same principles.
A change of browser might do the trick for him.

basically, i didn't notice before but the instructions for this offline version state that you must put the folder 't171' within your 'C:\' directory, so that it's all in 'c:\t171\' which i think is really messy. there's no reason to make these links absolute. so the answer is that i have to put the entire t171 folder containing the whole offline site into my root directory. not so good in my opinion. pretty dumb in fact, so i am going to try and figure out how to strip the leading / myself and then post the results for my fellow students to see. as far as i can see, the following four commands must be given within every single directory in the offline site (in order to catch upper and lowercase tags, and get the pictures to work too).

Code: Select all
perl -i -p -e "s/HREF=\"\/t171\//href=\"/g;" *.cfm
perl -i -p -e "s/href=\"\/t171\//href=\"/g;" *.cfm
perl -i -p -e "s/src=\"\/t171\//href=\"/g;" *.cfm
perl -i -p -e "s/SRC=\"\/t171\//href=\"/g;" *.cfm

this removes the leading '/t171/' from addresses (which is what i want, or else the relative addresses turn out to have two occurences of 't171/' next to each other).
This isn't too easy to do by hand since there are a *lot* of subdirectories. i wonder if there is an easy way to make these commands automatically recurse? i am sure there is, but knowing nothing about perl, it just is not too obvious to me. should it be an if loop?

PostPosted: Thu Feb 13, 2003 7:04 pm
by Void Main
One of many ways:
Code: Select all
find -name '*.cfm' | xargs perl -i -p -e "s/\/t171\///g;"


But of course none of this would be necessary if the HTML was properly formatted in the first place. Also if the files were downloaded with "wget" there are options that can automatically convert those hard paths to relative paths. See "--convert-links" in wget man page among other options to manipulate this stuff.

At least they didn't hard code the server name into the URL like many novice web programmers do.

PostPosted: Thu Feb 13, 2003 11:26 pm
by X11
"top cool"

PostPosted: Fri Feb 14, 2003 2:59 am
by Calum
you know what i realised this morning?

the paths are all absolute to / so changing them to relative will render most of the recursively sunk pages' internal links invalid.
if a file is in ~/t171/courses/module1/ and if i change all the links beginning with /t171/ then those links will still be invalid since they should have the phrase '/t171' replaced with '../../' which means that if i write a script then it will need to be able to count the number of recursed subdirectories and insert the correct number of '../'s in there. :( In affect i want to make the links absolute to the top directory in the site (called 't171') so i can move that folder around and all its contents will interact with each other with no problems.

it's all getting a little complicated i think. i think i will try wget next and if that fails, i will give in and stick it in my root directory :x however it would be nice to write a perl script (if i had the know how, maybe this will be an incentive to look this stuff up) so i can let other people know about it. i am sure i am not the only person who has had problems with this from not unzipping it into my root directory.

thanks for all your help void main.

PostPosted: Fri Feb 14, 2003 12:43 pm
by Void Main
If whoever created the zip file in the first place would just recreate it but use "wget --convert-links blah blah" it should fix it. OR if you would give me the original zip file I will fix it for you, it isn't that hard.

PostPosted: Sat Feb 15, 2003 10:47 am
by Calum
no i can do it. thanks very much for the offer though!