node抓取html内容

# 需求

复制某一个网站内容，即html，页面html是通过服务端直接渲染

# 分析

分析目标网站

服务端渲染，页面元素一次性返回
网站构成为jQuery+div+css
页面url：http://www.lukuoyi.cn/tSNHMEHzsrK/?tid=820，tid测试，是逐渐增加，但不连续
记得linux命令curl 访问一个链接，可返回该站的html，那么放在node中是否可行呢

# 解决

经过测验，可以使用child_process开启子进程，使用 child_process发起执行命令exec，异步获取返回结果；
通过判断是否有页面元素（title）判断该页是否为有效页面
递归循环，在一定数量内把需要的id，html全部获取并通过流写入到文件中

# 代码

const fs = require('fs')
const path = require('path')
const child_process = require("child_process"); 

const lukuoyiTask = async () => {
  let id = 100
  let errorCount = 0
  const res = await getResponseHtml(id, errorCount)
  console.log(res)
}
async function getResponseHtml(id: number, errorCount: number){
  return new Promise(resolve => {
    const url = `curl http://www.lukuoyi.cn/tvQGvz2xtio/?tid=${id}`
    child_process.exec(url, function(err:any, stdout:any, stderr:any) {
      try{
        if(err){
          errorCount+=1
        }
        if(errorCount > 1000){
          resolve(`当前id:${id}, errorCount: ${errorCount}`)
          return false
        }
        const html = stdout
        var reg = /<div class="test_tit_t">(.*)<\/div>/
        let title
        if(reg.test(html)) {
          title = (RegExp.$1).trim(); // 获取匹配到的字符串
          title = title.replace(/[?/？]/, '')
        }
        if(!title){
          errorCount+=1
        } else{
          let ws = fs.createWriteStream(path.join(__dirname, `../../lukuoyi-data/${id}-${title}.html`))
          console.log(`${id}-${title}.html`)
          ws.write(html)
        }
        id+=1
        return getResponseHtml(id, errorCount)
      }catch(e){
        resolve(`error-当前id:${id}, errorCount: ${errorCount}`)
        console.log(e)
      }
    });
  })

#node

上次更新: 2021/12/19, 18:05:42

← node写文件到json中 Node.js使用ES6语法→