I have the spare time to extract the attached script.
I have not look too deep into the script, but I have some suggestion: instead of remove the directory if already exists, I would instead just quit if it exist and assume it's already updated - Not everyone like
.
The OP's script, I suggest to run inside a container, jail or something, RUN AT YOUR OWN RISK.
#!/bin/sh
# Downloads the wiki and makes it nice for local usage, takes lots of time too.
# Works as intended on Debian Wheezy using the standard packages.
# Partially based off a command found on the wiki.
# Written by GhostlyDeath <
[email protected]>
# Current time
DATE="$(date +%F)"
# Make directory to put files in
WD="osdevwiki_$DATE"
rm -rf -- "$WD"
mkdir -- "$WD"
# Get files
# I added more special depths, because you get something like
# ./special%3arecentchangeslinked/johnburger%3ademo/exec/ints/debug.html
cd -- "$WD"
wget --mirror -k -p --reject '*=*,Special:*' --exclude-directories='Special:*,Special:*/*,Special:*/*/*,Special:*/*/*/*,Special:*/*/*/*/*,Special:*/*/*/*/*/*,Special:*/*/*/*/*/*/*' \
--user-agent="osdev-mirror, new and improved." --limit-rate=128k \
$(echo "Do not create host directory (i.e. example.com" > /dev/null) \
-nH \
$(echo "Always end in htm/html" > /dev/null) \
-E \
$(echo "Force Windows compatible file names" > /dev/null) \
--restrict-file-names=lowercase,windows,nocontrol,ascii \
$(echo "Actual wiki URL" > /dev/null) \
http://wiki.osdev.org/Main_Page
# Some browsers (like Firefox) get completely confused when there are files on the disk that have special
# symbols associated with them. So all of those links in HTML pages must be replaced in every single file
# with a gigantic sed script.
# Turn % to %25 because that is how it is used in the HTML code.
rm -f /tmp/$$.sed
find . -type f | grep '%[0-9a-fA-F][0-9a-fA-F]' | sed 's/^\.\///;s/%/%25/g' | while read line
do
# Need to change the old name, to the new name
# The old name also, needs to lose the . and / (confuses sed)
# Turn %25 to ___ to simplify the operation on the disk.
OLD="$(echo "$line" | sed 's/\([./]\)/\\\1/g')"
NEW="$(echo "$line" | sed 's/\([/]\)/\\\1/g;s/%25/___/g')"
echo "s/$OLD/$NEW/g" >> /tmp/$$.sed
done
# Go through all files again, and sed them (only web pages)
# This takes forever! Literally! At the time of this writing there are 1520 pages
# and if each one takes 1 second (on this atom at least) then it would take about
# 25 minutes for this to complete.
NF="$(find . -type f | grep '\.htm[l]\{0,1\}$' | wc -l)"
CC="0"
find . -type f | grep '\.htm[l]\{0,1\}$' | while read line
do
CC="$(expr $CC + 1)"
echo ".. $line ($CC of $NF)" 1>&2
sed -f "/tmp/$$.sed" < "$line" > /tmp/$$
mv /tmp/$$ "$line"
done
echo "Done" 1>&2
# Go through all files, and change every % to ___
# First rename directories
find . -type d | grep '%' | while read line
do
# Keep changing % to ___
CUR="$line"
while true
do
TO="$(echo "$CUR" | sed 's/%/___/')"
# If line has not changed, then done renaming
if [ "$TO" = "$CUR" ]
then
break
fi
# Rename
mv -v "$CUR" "$TO"
CUR="$TO"
done
done
# Now through all the files
find . -type f | grep '%' | while read line
do
TO="$(echo "$line" | sed 's/%/___/g')"
mv -v "$line" "$TO"
done
# Create an index html which just goes to main page (expanded_main_page.html)...
echo '
<html>
<head>
<title>Redirecting</title>
<meta http-equiv="refresh" content="0; url=expanded_main_page.html">
</head>
<body>
<a href="expanded_main_page.html">expanded_main_page.html</a>
</body>
</html>
' > index.html
# Any extra gunk
rm /tmp/$$ /tmp/$$.sed
# Go back out and archive it
cd ..
zip -r -9 "$WD.zip" "$WD"