|
Author |
Message |
Rogerborg |
This post is not being displayed .
|
Rogerborg nimbA
Joined: 26 Oct 2010 Karma :
|
Posted: 16:56 - 03 Jun 2017 Post subject: Downloading the Project Gutenberg Science Fiction Bookshelf |
|
|
https://www.gutenberg.org/wiki/Science_Fiction_(Bookshelf)
I got hacked off with not being able to bulk download the whole lot, so hacked up some Perl to bulk download the whole lot. I might as well share the h4xx while they download.
You'll need perl and wget and some sort of lunix shell, or Cygwin I suppose. This version pulls down the kindle.noimages content for each title and names the files as "Actual Title.mobi" rather than the rather unhelpful file number; other formats are available.
As tradition demands, any nerds reading this must strive for alpha-geek status by saying how they'd have done it better if they'd done it which they didn't.
Code: |
use strict;
my $htmlIndexFile = "Science_Fiction_(Bookshelf)";
`mkdir books`;
`rm $htmlIndexFile`;
`wget -q "https://www.gutenberg.org/wiki/$htmlIndexFile"`;
my $document = do {
local $/ = undef;
open my $fh, "<", $htmlIndexFile
or die "could not open $htmlIndexFile: $!";
<$fh>;
};
my @books = ( $document =~ /title="ebook:(\d+)">([^<]+)</g );
my $numberOfBooks = (scalar @books) / 2;
my $book;
for ($book = 0; $book < $numberOfBooks; ++$book) {
my $number = $books[$book * 2];
my $title = $books[($book * 2) + 1];
# Replace double quotes with singles so as not to spack the output
$title =~ s/"/''/g;
my $url = "https://www.gutenberg.org/ebooks/$number.kindle.noimages";
my $filename = "books/$title.mobi";
if (!-e $filename) {
print "Getting $book of $numberOfBooks: $title, #$number\n";
`wget -q --content-disposition $url --output-document "$filename"`;
if ($? != 0 && $? != 2048) {
print "Download of $title, #$number failed with code $?\n";
exit(1);
}
}
}
|
____________________ Biking is 1/20th as dangerous as horse riding.
GONE: HN125-8, LF-250B, GPz 305, GPZ 500S, Burgman 400 // RIDING: F650GS (800 twin), Royal Enfield Bullet Electra 500 AVL, Ninja 250R because racebike |
|
Back to top |
|
You must be logged in to rate posts |
|
|
TbirdX |
This post is not being displayed .
|
TbirdX Crazy Courier
Joined: 06 Dec 2015 Karma :
|
|
Back to top |
|
You must be logged in to rate posts |
|
|
ScaredyCat |
This post is not being displayed .
|
ScaredyCat World Chat Champion
Joined: 19 May 2012 Karma :
|
|
Back to top |
|
You must be logged in to rate posts |
|
|
Rogerborg |
This post is not being displayed .
|
Rogerborg nimbA
Joined: 26 Oct 2010 Karma :
|
Posted: 18:50 - 03 Jun 2017 Post subject: |
|
|
ScaredyCat wrote: | Quote: | The Project Gutenberg Science Fiction CD is is available via BitTorrent from The Project Gutenberg BitTorrent Tracker.
The March 2007 Science Fiction Bookshelf CD |
|
Aww, bless, you clicked a link.
As you'll be aware, a lot of SF has dropped out of copyright in the last decade, and the SF bookshelf is regularly updated. Enjoy your obsolete ickle sub-archive though. ____________________ Biking is 1/20th as dangerous as horse riding.
GONE: HN125-8, LF-250B, GPz 305, GPZ 500S, Burgman 400 // RIDING: F650GS (800 twin), Royal Enfield Bullet Electra 500 AVL, Ninja 250R because racebike |
|
Back to top |
|
You must be logged in to rate posts |
|
|
ScaredyCat |
This post is not being displayed .
|
ScaredyCat World Chat Champion
Joined: 19 May 2012 Karma :
|
|
Back to top |
|
You must be logged in to rate posts |
|
|
Rogerborg |
This post is not being displayed .
|
Rogerborg nimbA
Joined: 26 Oct 2010 Karma :
|
|
Back to top |
|
You must be logged in to rate posts |
|
|
ScaredyCat |
This post is not being displayed .
|
ScaredyCat World Chat Champion
Joined: 19 May 2012 Karma :
|
|
Back to top |
|
You must be logged in to rate posts |
|
|
UnknownStuntm... |
This post is not being displayed .
|
UnknownStuntm... World Chat Champion
Joined: 13 Sep 2007 Karma :
|
|
Back to top |
|
You must be logged in to rate posts |
|
|
Rogerborg |
This post is not being displayed .
|
Rogerborg nimbA
Joined: 26 Oct 2010 Karma :
|
|
Back to top |
|
You must be logged in to rate posts |
|
|
ScaredyCat |
This post is not being displayed .
|
ScaredyCat World Chat Champion
Joined: 19 May 2012 Karma :
|
|
Back to top |
|
You must be logged in to rate posts |
|
|
Rogerborg |
This post is not being displayed .
|
Rogerborg nimbA
Joined: 26 Oct 2010 Karma :
|
Posted: 23:50 - 03 Jun 2017 Post subject: |
|
|
You're lucky I didn't write it as a K&R C macro.
To be honest, the whole thread is just to milk pedant tears. Sweet, nourishing tears. ____________________ Biking is 1/20th as dangerous as horse riding.
GONE: HN125-8, LF-250B, GPz 305, GPZ 500S, Burgman 400 // RIDING: F650GS (800 twin), Royal Enfield Bullet Electra 500 AVL, Ninja 250R because racebike |
|
Back to top |
|
You must be logged in to rate posts |
|
|
barrkel |
This post is not being displayed .
|
barrkel World Chat Champion
Joined: 30 Jul 2012 Karma :
|
Posted: 01:18 - 04 Jun 2017 Post subject: |
|
|
Code: | wget -a log -O - 'https://www.gutenberg.org/wiki/Science_Fiction_(Bookshelf)' | egrep -o 'ebook:[0-9]+">[^<]+' | sed -r 's|"|\\"|g;s|ebook:([0-9]+)\\">(.*)|wget https://www.gutenberg.org/ebooks/\1.kindle.noimages -O "\2.mobi" -a log|' | bash
|
Just cos I can. Thanks Rogerborg. ____________________ Bikes: S1000R, SH350; Exes: Vity 125, PS125, YBR125, ER6f, VFR800, Brutale 920, CB600F, SH300x4
Best road ever ridden: www.youtube.com/watch?v=s2MhNxUEYtQ |
|
Back to top |
|
You must be logged in to rate posts |
|
|
Rogerborg |
This post is not being displayed .
|
Rogerborg nimbA
Joined: 26 Oct 2010 Karma :
|
Posted: 09:30 - 04 Jun 2017 Post subject: |
|
|
https://i.imgur.com/fwtajh9.jpg
I award you one alpha-geek award.
Disappointed that you didn't get some awk in there though. ____________________ Biking is 1/20th as dangerous as horse riding.
GONE: HN125-8, LF-250B, GPz 305, GPZ 500S, Burgman 400 // RIDING: F650GS (800 twin), Royal Enfield Bullet Electra 500 AVL, Ninja 250R because racebike |
|
Back to top |
|
You must be logged in to rate posts |
|
|
Old Thread Alert!
The last post was made 6 years, 320 days ago. Instead of replying here, would creating a new thread be more useful? |
|
|
|