igxe磨损比价爬虫

2019-08-26 阅读量

字数：978字 | 预计阅读时长：4分钟

igxe-csgo饰品磨损比价爬虫

自学web以及python爬虫的第一个练手项目

难度：简单

应用技术：request库 re库 jsonpath库 BeautifulSoup库

主要收获：

利用浏览器开发者工具中的Network 的XHR来或许网页异步发送的请求（比如翻页之后的信息）

通过jsonpath库来抓取json格式的数据文件

简单的使用re库获取信息

不足：igxe网站反爬虫难度不大，未设有登录验证以及滑块验证码等复杂操作，因此难度过于简单

使用方法

环境要求：request库 BeautifulSoup库 jsonpath库（使用pip install 指令安装）

具体用法：

输入关键字（建议输入完整的商品名称 e.g. AWP | 二西莫夫 (久经沙场) )

输入期待的最高价格，程序会给出改价格内能买到的最低的磨损

程序会在根目录生成 data.text 文件，里面包含了该商品的所有磨损及对应价格

源码中的 choice 变量默认为0，但若搜索的饰品包含了StatTrack则将 choice 改成1可显示带计数器的皮肤

后记

该项目只是第一个练手，我根据情况说不定会更新其他饰品网站的比价程序（呼声较高的是网易BUFF），但是其他网站说不定会更难爬取一些，一切看我情况再说啦(。・∀・)ノ

代码

> bs4 import BeautifulSoup
>import re
>import json
>import jsonpath
>
>
>def getLink(url):
>    try:
>        r = requests.get(url, params=mykv, timeout=10, headers=kv)
>        r.raise_for_status()
>        r.encoding = r.apparent_encoding
>        demo = r.text
>        soup = BeautifulSoup(demo, "html.parser")
>        #print(soup.div.string)
>        count = 0
>        for link in soup.find_all('div',"name"):
>            listItem.append(re.findall("/product/\d{3}/\d+", str(link.parent)))
>            count += 1
>            #print(link.parent)
>        print("一共找到了" + str(count) + "个结果")#第二个一般为stattrack
>        return count
>    except:
>        return "fail"
>
>
>def newLink(url, num, id):
>    try:
>        r = requests.get(url, params=mykv, timeout=10, headers=kv)
>        r.raise_for_status()
>        r.encoding = 'utf-8'
>        data = json.loads(r.text)
>        page_count = jsonpath.jsonpath(data,'$.page.page_count')[0]
>        infile = open('data.text', 'w')
>        for count in range(page_count):
>            itemUrl = "https://www.igxe.cn/product/trade" + num + "?sort_rule=0&buy_method=0&status_locked=0&is_sticker=0&" \
>                      "gem_attribute_id=&gem_id=&paint_seed_type=0&paint_seed_id=0&page_no=" + str(count) + '&cur_page=1&product_id='+ id
>            re = requests.get(itemUrl, params=mykv, timeout=10, headers=kv)
>            re.encoding = 'utf-8'
>            newData = json.loads(re.text)
>            for i in range(10):
>                mywear = '$.d_list[' + str(i) + '].exterior_wear'
>                myprice = '$.d_list[' + str(i) + '].unit_price'
>                data = ''.join(jsonpath.jsonpath(newData, mywear)) + ' ' + ''.join(jsonpath.jsonpath(newData, myprice)) + '\n'
>                infile.write(data)
>    except:
>        return "失败"
>
>
>def compare():
>    outfile = open('data.text', 'r')
>    for line in outfile.readlines():
>        wear_list.append(line.split()[0])
>        price_list.append(line.split()[1])
>        dict[line.split()[1]] = line.split()[0]
>    wear_list.sort()
>    price_list.sort()
>    high = input("输入期待的最高价格")
>    if float(high) < float(price_list[0]):
>        print('算了吧，你这预算，毛都买不到(￣_￣|||)')
>        return
>    for price in price_list:
>        if float(price) > float(high):
>            break
>        else:
>            compare_list.append(dict[price])
>    compare_list.sort()
>    for key in dict:
>        if dict[key] == compare_list[0]:
>            print('这个价位最好磨损：' + key + '\n' + compare_list[0])
>    print(dict)
>
>
>def main():
>    keyword = input("请输入关键字")
>    myurl = url + keyword
>    count = getLink(myurl)
>    if count <= 0:
>        print('输入关键词有误，未查到商品，请输入商品完整名称')
>        return
>    newurl = mainurl + "".join(listItem[choice])
>    print('默认为你选择不带StatTrak的商品')
>    print('以下为商品链接，也可以自己手动查看')
>    print(newurl)
>    product_num = "".join(listItem[choice]).split('/product')[1]
>    product_id = product_num.split('/')[2]
>    dataLink = "https://www.igxe.cn/product/trade" + product_num + "?sort_rule=0&buy_method=0&status_locked=0&is_sticker=0&" \
>                                                                   "gem_attribute_id=&gem_id=&paint_seed_type=0&paint_seed_id=" \
>                                                                   "0&page_no=2&cur_page=1&product_id=" + product_id
>    newLink(dataLink, product_num, product_id)
>    compare()
>
>
>compare_list = []
>price_list = []
>wear_list = []
>dict = {}
>page_count = 0
>list = []
>listItem = []
>mainurl ="https://www.igxe.cn"
>url = "https://www.igxe.cn/csgo/730?keyword="
>kv = {'user-agent':'Chrome/10'}
>mykv = {'wd':'Pypy'}
>choice = 0
>main()
>