User:Kellen/Scripts
Scripts I've written for manipulating wikibooks data. Be sure to copy from the edit box, not the rendered page, as some html entities are used.
No guarantee of efficiency, correctness, or beauty. A guarantee of hacky sed regular expressions.
Isolate cookbook wantedpages:
#!/bin/bash lynx -source -dump "http://en.wikibooks.org/w/index.php?title=Special:Wantedpages&limit=5000&offset=0" | \ grep Cookbook | \ grep edit | \ grep -v "Talk:Cookbook" | \ grep -v "Cookbook_talk:" | \ grep -v "<s>" | \ sed 's/<[/li]\{2,3\}>//g' | \ sed 's/_/ /g' | \ sed 's/(<a/\n(<a/g' | \ sed 's/&action=edit//g' | \ sed 's/^<a href="\/w\/index.php?title=\(Cookbook:\([^"]*\)\)".*$/[[\1|\2]]/g' | \ sed 's/<a href="\/w\/index.php?title=\([^"]*\)"[^>]*>\([^<]*\)[^)]*/[[\1|\2]]/g' | \ sed 's/&target=/\//g' | \ sed '1~2N;s/\n/ /g' | \ sed 's/^/*/g'
Get redirects from cookbook allpages:
#!/bin/bash # filename prefix FILE=cookbookredirs # servername SERVER="http://en.wikibooks.org" # start page PAGE="/w/index.php?title=Special%3AAllpages&from=&namespace=102" # fetch all pages num=0 while [ -n "$PAGE" ] do num=$(($num+1)) echo "Getting page number ${num}, ${PAGE}" wget -q -O "${FILE}.${num}" "${SERVER}${PAGE}" # get the next page url PAGE=`grep -o '<a href="[^"]*" title="[^"]*">Next page ([^)]*)' "${FILE}.${num}" | grep -o 'href="[^"]*"' | grep -o '"[^"]*"' | sed 's/"//g' | sed 's/&/\&/g'` #if [ -n "$PAGE" ]; then # echo "Next page is ${PAGE}" #fi done echo "Got ${num} files." # strip each file down to only redirects i=0 while [ $i -lt "$num" ] do i=$(($i+1)) FN="${FILE}.${i}" if [ ! -f $FN ]; then echo "Can't find ${FN}" break fi # add a marker to beginning of page list sed -i 's/<table style="background: inherit;" border="0" width="100%">/\nBREAKHERE\n/' $FN # kill everything above page list marker sed -i '0,/BREAKHERE/d' $FN # find end of page list and kill everything after sed -i 's/<\/table>/\n/1' $FN sed -i '2,$d' $FN # add a linebreak after each item, replacing /td sed -i 's|</td>|\n|g' $FN # remove all remaining tr's td's and ending tr's sed -i 's|<[trd/]\{2,3\}>||g' $FN # strip down to just title sed -i 's/<a href="\/wiki\/\([^"]*\)" title="\([^"]*\)">.*$/\1\n/g' $FN # only get redirects sed -i -n '/allpagesredirect/p' $FN sed -i 's/<div class="allpagesredirect">\(.*\)$/* [[\1]]/g' $FN done # Join files together i=0 CATFILES="" while [ $i -lt "$num" ] do i=$(($i+1)) CATFILES="${CATFILES} ${FILE}.${i}" done FINAL="${FILE}.final" cat $CATFILES > $FINAL rm $CATFILES # add a blank line so we can easily find where to put colunms lines=`wc -l < $FINAL` col=`expr $lines / 3` pattern="${col}~${col}G" sed -i $pattern $FINAL echo "Resultant file is ${FINAL}"