记一次解决抖音分享页混淆字体,字体图标转UID解决方案

# 概况

目的：希望通过在用户的分享页拿到用户的抖音号及相关信息

# 分析

通过chrome查看页面元素，发现大部分的阿拉伯数字都是一个,这样就无法通过获取元素的方式去得到里面的值了。

分析应该是做了处理，所以无法查看，百度查询一下，确实可以看到一部分相关文章的解说，找到原因
大概原理就是使用字体替代了数字的展现，也有相应的解决方法，不过都是python格式的，没看到有js解析的
解决的思路就是：找到数字与字体字符串的匹配关系，把相应的字体字符串替换成数字

# 解决

找到匹配表

export const fontCodes:{[key: string]:any} = {
  "&#xe603;": "0", "&#xe60d;": "0", "&#xe616;": "0",
  "&#xe602;": "1", "&#xe60e;": "1", "&#xe618;": "1",
  "&#xe605;": "2", "&#xe610;": "2", "&#xe617;": "2",
  "&#xe604;": "3", "&#xe611;": "3", "&#xe61a;": "3",
  "&#xe606;": "4", "&#xe60c;": "4", "&#xe619;": "4",
  "&#xe607;": "5", "&#xe60f;": "5", "&#xe61b;": "5",
  "&#xe608;": "6", "&#xe612;": "6", "&#xe61f;": "6",
  "&#xe60a;": "7", "&#xe613;": "7", "&#xe61c;": "7",
  "&#xe60b;": "8", "&#xe614;": "8", "&#xe61d;": "8",
  "&#xe609;": "9", "&#xe615;": "9", "&#xe61e;": "9"
}

获取定位元素html

提示

这里花费了较多的功夫，主要原因在于工具获取到的html元素并不是想要的原始数据

我采用的是puppeteer 工具来解析，因为短链接会重定向到另外一个网站，后者才是真正要的

# 坑位：替换不了

const shortidHtml  = await page.$eval('.shortid', (el: any) => el.innerHTML)
let shortid = shortidHtml.replace('抖音ID：     ', '').replace(/<i class="icon iconfont ">/g, '').replace(/<\/i>/g, '').replace(/\s/g, '')
for (let k in fontCodes) {
  const reg = new RegExp(`${k}`, 'g')
  shortid = shortid.replace(reg, fontCodes[k])
}
// 得到undefined，打印shortidHtml发现这个拿到的结果还是
// "抖音ID：     <i class="icon iconfont ">  </i><i class="icon iconfont ">  </i><i class="icon iconfont ">  </i><i class="icon iconfont ">  </i><i class="icon iconfont ">  </i><i class="icon iconfont ">  </i><i class="icon iconfont ">  </i><i class="icon iconfont ">  </i><i class="icon iconfont ">  </i><i class="icon iconfont ">  </i>   "

解法，这里应该是puppeteer模拟了浏览器的行为，从浏览器里得到这样的数据是已经被解码过的，应该从原始数据去拿

# 原生https获取

export const loadHtml = async (url: string): Promise<any> => {
  return new Promise(resolve => {
    var https = require('https');
    // 参数url 和 回调函数
    https.get(url, function (res: any) {
      var html = '';
      // 绑定data事件 回调函数 累加html片段
      res.on('data', function (data: any) {
        html += data;
      });
      res.on('end', function () {
        resolve(html)
      });
    }).on('error', function () {
      console.log('获取数据错误');
    });
  })
}

# 利用puppeteer 的response.text()方法获取

page.on('response', async (response: any) => {
    if (response.url().indexOf('https://www.iesdouyin.com/share/user/') > -1) {
      const content = await response.text()
      const reg = /<p class="shortid">(.*)   <\/p>/
      let shortid = ''
      if (reg.test(content)) {
        shortid = RegExp.$1
        shortid = shortid.replace('抖音ID：     ', '').replace(/<i class="icon iconfont ">/g, '').replace(/<\/i>/g, '').replace(/\s/g, '')
        for (let k in fontCodes) {
          const reg = new RegExp(`${k}`, 'g')
          shortid = shortid.replace(reg, fontCodes[k])
        }
      }
      obj.shortid = shortid
    }
  })
  // shortid 抖音ID：     JOJO<i class="icon iconfont "> &#xe617; </i><i class="icon iconfont "> &#xe617; </i><i class="icon iconfont "> &#xe61f; </i><i class="icon iconfont "> &#xe61e; </i><i class="icon iconfont "> &#xe61a; </i>

获取相应的数据

const nickname = await page.$eval('.nickname', (el: any) => el.innerText)
const uid = await page.evaluate(`document.querySelector('.focus-btn ').attributes[1].value`)
const avatar = await page.evaluate(`document.querySelector('.avatar').src`)

# 总结

本次抓取数据的难点在于要获取的原始html数据，再根据匹配关系替换相应的字体代码，得到想要的数字利用的技术：正则+ puppeteer相关api

# 完整代码

// url 短链接
const userInfoByShortUrl = async (url: string) => {
  let obj: IuserInfo = {
    uid: '',
    shortid: '',
    uniqueid: '',
    nickname: '',
    avatar: ''
  }
  const browser = await puppeteer.launch({
    // headless: false,
    args: ['--no-sandbox', '--disable-setuid-sandbox'],
  });
  const page = await browser.newPage();
  page.on('response', async (response: any) => {
    if (response.url().indexOf('https://www.iesdouyin.com/share/user/') > -1) {
      const content = await response.text()
      const reg = /<p class="shortid">(.*)   <\/p>/
      let shortid = ''
      if (reg.test(content)) {
        shortid = RegExp.$1
        console.log('shortid', shortid)
        shortid = shortid.replace('抖音ID：     ', '').replace(/<i class="icon iconfont ">/g, '').replace(/<\/i>/g, '').replace(/\s/g, '')
        for (let k in fontCodes) {
          const reg = new RegExp(`${k}`, 'g')
          shortid = shortid.replace(reg, fontCodes[k])
        }
      }
      obj.shortid = shortid
    }
  })
  await page.emulate(iPhoneDevice);
  // 进入页面
  await page.goto(url);
  await sleep(200);
  const nickname = await page.$eval('.nickname', (el: any) => el.innerText)
  const uid = await page.evaluate(`document.querySelector('.focus-btn ').attributes[1].value`)
  const avatar = await page.evaluate(`document.querySelector('.avatar').src`)
  obj.nickname = nickname
  obj.uid = uid
  obj.uniqueid = uid
  obj.avatar = avatar
  await browser.close()
  return obj

}

# 参考

#抖音

上次更新: 2021/12/19, 18:05:42

← 解决抖音获取签名及并发的问题获取抖音用户作品列表信息→