Make Subscription to Daily News with Scrapy and Gitbook and Push it to Kindle

About four years ago, I used Gouerduo Daily Report for subscriptions to daily news for a period of time after I bought Kindle. I canceled it because I found it uselessness. However, about three months ago, I felt a little bit isolated and would like to get more information about the outside world.

I searched and compared with some productions for news subscription and finally choose Kindle4rss. I ordered it for one year.

I’m so careless to find that a lot of articles are incomplete until a month later. Some of them contain only the first page of the article. I mistook it for the invalid news from CanKaoXiaoXi for there is no hint of that. I also sent an email to Kindle4rss but there is no result. As a programmer, I decided to make it by myself.

Principles: Simple and easy to be developed. I should complete it within a week during my every-day lunch break.

The thread is simple:

graph LR;
  id1(fetch articles)-->id2(write them into an ebook);
  id2(write them into an ebook)-->id3(push to my Kindle);

I found some tools according to above:

graph LR;
  id1(Scrapy for the fetching of articles)-->id2(Gitbook for ebook);
  id2(Gitbook for ebook)-->id3(send an email to my Kindle);

Fetching of Articles

I planned to fetch articles from the World News column only in the first version.

Fetch

After parsing some pages of the column of cankaoxiaoxi.com, I found that the multipage is powered by AJAX. The site sends an asynchronous request to get a json file, in which the json.data is what we want.

start_urls = ['http://app.cankaoxiaoxi.com/?app=shlist&controller=milzuixin&action=world&page=1&pagesize=20']

I want to deal with it simply. So just extract all links of the list.

body = response.body[1:-1]
body = json.loads(body)
data = body["data"]
links = Selector(text=data).xpath("//a/@href").extract()

What I really want is the news published most recently. Outdated ones are useless. Fetch the first page and filter them by the published dates.

date = datetime.datetime.strftime(datetime.datetime.now(), "%Y%m%d")

for link in links:
    if date not in link:
        return
    yield scrapy.Request(link, self.parse_article, dont_filter=False)

Get the links and put them in the fetching links stack. Then parse them with the parse_article method. There is a challenge here which caused the problems in the subscriptions of Kindle4rss. Some articles have more than one page. I also need to fetch the rest of the pages. Some of these pages belong to Extra Readings which are not useful for me and needed to be cut.

def parse_article(self, response):
        item = KindleItem()
        item['resource'] = "参考消息国际版"

        # Parse the content.
        item['title'] = response.xpath("//h1[contains(@class, 'YH')]/text()").extract_first()
        item['content'] = response.xpath('//div[contains(@class, "article-content")]').extract_first()
        item['url'] = response.url

        # Drop the extra readings.
        if '延伸阅读' in item['content'] :
            return

        # Get the next page.
        next_link = response.xpath("//p[contains(@class, 'fz-16')]/strong/a/@href").extract_first()

        if( next_link ):
            yield scrapy.Request(next_link, self.parse_article, dont_filter=False)

        # Put them into the pipeline.
        yield item

Pipeline

The items extracted will be put into pipelines for further processing. What we usually do is storing them in the database. However, we can just write them into markdown pages here to make an e-book according to the specification of the Gitbook. What we get here is with the markup of HTML, which can be parsed correctly in a markdown file.

class KindlePipeline(object):
    def process_item(self, item, spider):
        date = datetime.datetime.strftime(datetime.datetime.now(), "%Y%m%d")

        d = sys.path[0] + "/posts/" + date + "/"

        # Extract the name of the file from the URL and cut the underline with integers in format '_1'. That will be the criteria of whether two pages belongs to one article.
        result = re.findall(r'(?<=\/)(\d+)(_\d+)?(?=.shtml)', item["url"])
        filename = result[0][0]

        # If there isn't any underline with integers, create a new file and write the contents into the file. The title and filename are also needed to be written into the file SUMMARY.md.
        if ( not result[0][1] or result[0][1] == "" ):
            f = open(d + filename + '.md', 'w')
            f.write('# ' + item["title"] + '\n\n')
            f.write(item["content"])
            f.close()
            summary = open(d + 'SUMMARY.md', 'a+')
            summary.write('* [' + item['title'] + '](' + filename + '.md)\n')
            summary.close()

        # Or just find the file with the extracted filename and append the contents.
        else:
            f = open(d + filename + '.md', 'a+')
            f.write(item["content"])
            f.close()
return

Make an e-book

It can be done with only one line of command of Gitbook.

$ gitbook mobi ./ book.mobi

Push to Kindle

Send an email with the attachment by mutt and msmtp to Kindle.

Then I need to integrate all those scripts in a shell file. The scripts will be executed every day with crontab.

#!/bin/bash

ls_date=`date +%Y%m%d`

cd posts
mkdir ${ls_date}
cd ${ls_date}
gitbook init

echo "{\"title\": \"kindle推送-${ls_date}\"}" >> book.json

cd ../..
/usr/local/bin/scrapy crawl ckxx

cd posts
cd ${ls_date}
gitbook mobi ./ ./../../ebooks/${ls_date}.mobi

cd ../..
echo "kindle推送-${ls_date}" | mutt -s "kindle推送-${ls_date}" icily0719@kindle.cn -a "ebooks/${ls_date}.mobi"

Problem

There is no exception handling here. I’m too lazy to do that!

Complete codes and files can be found here: kindlepush