Python爬虫之抖音视频批量提取术插图

Python爬虫之抖音视频批量提取术插图1

浣滆€咃細  寮犲皬楦?/span>  Python鐖卞ソ鑰呯ぞ鍖轰笓鏍忎綔鑰?/span>

鐭ヤ箮ID锛歨ttps://www.zhihu.com/people/mr.ji

涓汉鍏紬鍙凤細楦′粩璇?nbsp;

鍋囨湡姝eソ鏈夌┖闂叉椂闂达紝缁堜簬鍙互鏁寸悊鑷繁鐨勭瑪璁板暒銆傛暣鐞嗗埌鎶栭煶瑙嗛鐨勬椂鍊欙紝灏卞ソ楹荤儲锛屾瘡娆¢兘瑕佸厛鎶婅棰戝鍑哄埌鏈湴锛屽啀缁欏井淇$殑鏂囦欢绠$悊鍔╂墜锛屽啀涓嬭浇浼犲埌鍗拌薄绗旇锛屼竴鏉ヤ簩鍘绘氮璐逛笉灏戞椂闂达紝鎯虫兂杩欎簨涓嶆濂介€傚悎鐖櫕鍘诲共鍚楋紵浜庢槸灏辨湁浜嗕互涓嬭繖绡囧唴瀹?/span>

宸ュ叿鐜

  • 璇█锛歅ython3.6

  • 缂栬緫鍣細Pycharm

  • 鏁版嵁搴擄細MongoDB

  • 宸ュ叿锛欳harles

鍓嶈█锛?/span>

鍦ㄤ娇鐢–harles锛屼綘闇€瑕佸仛涓€浜涘熀纭€鐨勯厤缃紝灏嗕綘鐨勬墜鏈虹殑缃戠粶浠g悊鍒版湰鍦扮數鑴戯紝浠ヤ究鍋氳繘涓€姝ョ殑鎶撳寘鍒嗘瀽锛屼互涓嬩袱绡囨枃绔犲彲鑳藉浣犳湁鎵€甯姪

Charles 浠庡叆闂ㄥ埌绮鹃€?/span>

https://www.jianshu.com/p/a3f005628d07

绉诲姩搴旂敤鎶撳寘璋冭瘯鍒╁櫒Charles

https://www.jianshu.com/p/68684780c1b0

鐖彇鎬濊矾

鐖彇绔欑偣锛歨ttps://www.douyin.com/

杩欓噷鐨勭埇鍙栨€濊矾闈炲父绠€鍗曪紝浠ヨ嚦浜庢垜浼氳寰楄繖绡囨枃绔犱細鏈変簺绌烘礊銆傚綋浣犳姄鍖呮纭厤缃ソ鐜鍚庯紝鎵撳紑鎶栭煶杞欢锛屽仛涓€浜涚畝鍗曠殑鎿嶄綔锛孋harles灏变細缁欎綘杩斿洖濡備笅鐨勬暟鎹紝杩欎簺鏁版嵁鍏跺疄灏辨槸鏈嶅姟绔粰浣犺繑鍥炵殑鏁版嵁锛岄噷闈㈠寘鍚墍鏈夋垜浠渶瑕佺殑淇℃伅銆傛瘮濡傛垜浠粖澶╄涓嬭浇鐨勮嚜宸辩偣鍑昏繃鐨勶紝鍠滄鐨勮棰戦摼鎺ョ瓑

Python爬虫之抖音视频批量提取术插图2

浣犳搷浣滆蒋浠舵椂锛岀湅涓€涓婥harles涓瘡鏉℃暟鎹殑鍙樺寲鎯呭喌锛屼綘浼氬彂鐜帮紝浣犱釜浜轰富椤典笅闈㈢殑閾捐窡videos銆乫eed鍜宭ikes鍜岃繖涓夋潯鏁版嵁鏈夊叧锛屾瘡涓€娆′綘鍋氱浉搴旂殑鎿嶄綔锛屼笅闈㈠氨浼氬鍑轰竴浜涜姹傞摼鎺?/span>

Python爬虫之抖音视频批量提取术插图3

charles涓殑璇锋眰鎴彇缁撴灉

Python爬虫之抖音视频批量提取术插图4

鎶栭煶涓殑鎴戠殑鍔熻兘椤?/span>

閭f垜浠埆鐨勫厛涓嶇锛岀湅涓嬫瘡涓姹備腑鐨勬暟鎹紝鏈夋病鏈夋垜浠兂瑕佺殑鏁版嵁锛岄殢渚跨湅涓€涓嬫煇涓摼鎺ヤ腑鐨勮繑鍥炴暟鎹?/span>

Python爬虫之抖音视频批量提取术插图5

鍙互鐪嬪埌杩欓噷鏈塸lay_addr锛屽啀涓€鐪嬮摼鎺ヤ腑鏈塿ideo瀛楁牱锛屽熀鏈叓涔濅笉绂诲崄浜嗐€傚洜涓烘垜宸茬粡楠岃瘉杩囦簡锛岃繖閲岀殑淇℃伅灏辨槸濡傛垜浠寽娴嬬殑閭f牱锛屽寘鍚棰戠殑鍏ㄩ儴淇℃伅

Python爬虫之抖音视频批量提取术插图6

閭f垜浠叾瀹炲氨闇€瑕佹ā鎷熻繖閲岀殑璇锋眰閾炬帴鍗冲彲锛屽厛鐪嬩笅璇锋眰涓兘鍖呭惈鍝簺蹇呰鐨勪俊鎭紝浣犲鐪嬪嚑涓氨鍙戠幇锛岀湡姝e彉鍖栫殑灏卞嚑涓浐瀹氱殑鍙傛暟锛屽叾涓孩绾夸互涓婄殑閮ㄥ垎閮芥槸鍜岃澶囩浉鍏崇殑淇℃伅鍜宎pp淇℃伅锛岀湡姝f牳蹇冨姞瀵嗙殑鍙傛暟灏卞彧鏈夛紝mas锛宎s鍜宼s銆傝繖閲屾垜鍏堣嚜宸辩綉涓婃壘浜嗕笅鏈夋病鏈夌浉鍏崇殑杞瓙鍙敤锛岀储鎬х嫍灞庤繍姣旇緝濂斤紝姝eソ鎵惧埌浜嗭紝鍦板潃鍦ㄨ繖锛歨ttps://github.com/AppSign/douyin

濂楃敤鍗冲彲锛岃€屼笖杩欎綅澶т浆鐨勬墍鏈夌牬瑙o紝閮芥槸鍜屽瓧鑺傝烦鍔ㄦ湁鍏崇殑锛屾垜鏈夌偣瑙夊緱杩欎釜灏辨槸瀹樻柟璁╁憳宸ヨ嚜宸辨斁鍑烘潵鐨勩€傛寜灏艰儍锛屾垜浠嬁鍒颁簡鍔犲瘑鐨勫弬鏁扮殑瀹炵幇涔嬪悗锛屽悗闈㈠氨澶畝鍗曚簡

Python爬虫之抖音视频批量提取术插图7

鐪嬩笂闈㈤偅浣嶅ぇ浣殑浠g爜鎻愬彇瑙嗛閭i噷锛岃窡瑙嗛鐩稿叧鐨勫叧閿弬鏁板氨鏄繖涓猘weme_id锛屾垜浠嬁鍒板畠涔嬪悗锛屽悗闈㈢洿鎺ユ瀯閫犳彁鍙栧師瑙嗛鐨勮姹傚嵆鍙?/span>

閭d箞搴熻瘽涓嶈锛屼笂鐮佽蛋璧?/span>

show me the code

鏍稿績璇锋眰锛?/span>

def grab_favorite(self, user_id, max_cursor=0):    favorite_params = self.FAVORITE_PARAMS    favorite_params[user_id] = user_id    favorite_params[max_cursor] = max_cursor    query_params = {favorite_params, self.common_params}    sign = getSign(self.gettoken(), query_params)    params = {query_params, sign}    resp = requests.get(self.FAVORITE_URL,                        params=params,                        verify=False,                        headers=self.HEADERS)    favorite_info = resp.json()    hasmore = favorite_info.get(hasmore)    max_cursor = favorite_info.get(max_cursor)    video_infos = favorite_info.get(aweme_list)    for per_video in video_infos:        author_nickname = per_video[author].get("nickname")        author_uid = per_video[author].get(uid)        video_desc = per_video.get(desc)        download_item = {            "author_nickname": author_nickname,            "video_desc": video_desc,            "author_uid": author_uid,        }        awemeid = per_video.get("awemeid")        self.download_favorite_video(awemeid, download_item)        time.sleep(5)    return hasmore, max_cursor

杩欓噷鎴戜滑灏嗚澶囧弬鏁帮紝app淇℃伅锛岀敤鎴蜂竴璧风敤浣滄煡璇㈠弬鏁帮紝鍐嶄笌鑾峰緱鐨則oken涓€璧凤紝鍙戦€佺粰getSign鍑芥暟锛屾瀯閫犲姞瀵嗘暟鎹紝鏈€鍚庢妸杩欎簺鏁版嵁缁勫悎鎴愮殑瀛楀吀鏀惧湪涓€璧凤紝璇锋眰鎴戜滑鐨勫枩娆㈢殑閾炬帴锛坔ttps://aweme.snssdk.com/aweme/v1/aweme/favorite/锛夊嵆鍙嬁鍒板搴旂殑response鏁版嵁銆傚ぇ瀹跺彲鑳戒細鍙戠幇锛屾垜杩欓噷婕忔帀浜嗕竴涓?strong>max_cursor鍙傛暟锛岃繖鏄洜涓猴紝绗竴娆″彂閫佽姹傛椂锛岃繖閲岀殑鍙傛暟鏄?锛屼箣鍚庢垜浠姹備簡鏁版嵁鍚庯紝濡傛灉杩斿洖鐨刪as_more鏄?锛屽氨浠h〃鏈夋暟鎹紝閭d箞涓嬩竴娆℃垜浠姹傜殑鏃跺€欙紝灏遍渶瑕佸甫涓婁笂涓€娆$殑max_cursor銆傚氨鍙互鐞嗚В涓烘垜浠埛鏁版嵁锛屽線涓嬬炕椤靛惂

鎵€浠ヨ繖涔熷氨鏄负浠€涔堟垜鍦ㄨ繖涓湴鏂瑰仛浜嗚繑鍥烇紝灏辨槸涓轰簡鏂逛究涓婁竴灞傝皟鐢紝鐪嬩笅杩欓噷濡傛灉鏈夋暟鎹殑璇濓紝鎴戜滑灏辩户缁炕椤典笅杞?/span>

缈婚〉锛?/span>

   def grab_favorite_main(self, user_id):        count = 1        self.logger.info("褰撳墠姝e湪鐖彇绗?馃憠 {} 馃憟 椤靛唴瀹?..".format(count))        hasmore, max_cursor = self.grab_favorite(user_id)        while hasmore:            count += 1            self.logger.info("褰撳墠姝e湪鐖彇绗?馃憠 {} 馃憟 椤靛唴瀹?..".format(count))            hasmore, max_cursor = self.grab_favorite(user_id, max_cursor)

鎴戜滑鍦ㄧ涓€娆¤姹傚悗寰楀埌鏄惁鏈夋暟鎹殑鐘舵€佸拰max_cursor鍙傛暟锛岄偅灏辩畝鍗曚簡锛屽鏋滄垜浠彂鐜版湁鏇村鏁版嵁锛屽氨缁х画璇锋眰鍗冲彲

瑙嗛涓嬭浇

def grab_favorite_main(self, userid):    count = 1    self.logger.info("褰撳墠姝e湪鐖彇绗?馃憠 {} 馃憟 椤靛唴瀹?..".format(count))    hasmore, max_cursor = self.grab_favorite(userid)    while hasmore:        count += 1        self.logger.info("褰撳墠姝e湪鐖彇绗?馃憠 {} 馃憟 椤靛唴瀹?..".format(count))        hasmore, max_cursor = self.grab_favorite(userid, max_cursor)        def download_favorite_video(self, awemeid, video_infos):    video_content = self.download_video(awemeid)    author_nickname = video_infos.get("author_nickname")    author_uid = video_infos.get("author_uid")    video_desc = video_infos.get("video_desc")    video_name = "".join(author_nickname, author_uid, video_desc)    self.logger.info("download_favorite_video 姝e湪涓嬭浇瑙嗛 {} ".format(video_name))    if not video_content:        self.logger.warn("浣犳鍦ㄤ笅杞界殑瑙嗛锛岀敱浜庢煇绉嶇绉樺姏閲忕殑浣滅敤锛屽凡缁忓噳鍑変簡锛岃璺宠繃...")        return    with open("../videos/{}.mp4".format(video_name), wb) as f:        f.write(video_content)def download_video(self, awemeid, retrytimes=0):    query_params = self.common_params    query_params[awemeid] = awemeid    sign = getSign(self.gettoken(), query_params)    params = {query_params, sign}    postdata = {        "awemeid": awemeid    }    resp = requests.get(self.VIDEO_DETAILURL,                        params=params,                        data=post_data,                        verify=False,                        headers=self.HEADERS)    resp_result = resp.json()    play_addr_raw = resp_result[aweme_detail][video][play_addr][url_list]    content = requests.get(play_addr).content    return content

绫讳技鐨勶紝鎴戜滑鏋勯€犱簡sign绛惧悕涔嬪悗锛岃姹傝棰戣幏鍙栭摼鎺ワ紝浼犲叆瀵瑰簲鐨刟weme_id鍗冲彲鎷垮埌鎴戜滑鎯宠鐨勮棰戞暟鎹紝鏈€鍚庣洿鎺ヤ互浜岃繘鍒剁殑褰㈠紡鍐欏叆鏂囦欢鍗冲彲銆傛枃浠跺悕鎴戣繖閲屾槸鐢ㄧ殑鐢ㄦ埛鏄电О銆佺敤鎴峰敮涓€id鍜岃棰戞弿杩帮紝濡傛灉瑙夊緱澶暱锛屽ぇ瀹朵篃鍙互鑷繁鏀规垚鑷繁鎯宠鐨勬枃浠跺悕

鏈€鍚庡紑鍚埇铏紝灏卞彲浠ュ緱鍒板涓嬬粨鏋?/span>

Python爬虫之抖音视频批量提取术插图8
Python爬虫之抖音视频批量提取术插图9

浠ヤ笂瀹炵幇鐖彇鑷繁鎶栭煶鍠滄杩囩殑鎵€鏈夎棰戠殑姝ラ锛屽皬浼欎即浠彲浠ヨ嚜宸卞畬鏁磋蛋涓€閬嶈繃绋嬶紝鎴栬€呯洿鎺ユ嫹璐濇垜鍦╣ithub涓婄殑浠g爜鍦板潃锛坔ttps://github.com/hacksman/spider_world锛?/span>

娉ㄦ剰user_id瑕佹敼鎴愪綘鑷繁鐨勫摝锛?鍙﹀鍚庣画鎴戣繖涓粨搴撲細澧炲姞鏇村鏈夎叮瀹炵敤鐨勭埇铏紝娆㈣繋澶у缁欐垜鐐规槦锛屾湁浠€涔堥棶棰樺彲浠ュ悜鎴戝弽棣堬紝涓€璧峰涔犺繘姝?/span>

github椤圭洰鍦板潃锛歨ttps://github.com/hacksman/spider_world

涓汉缃戠珯锛歨ttp://www.zxiaoji.com/

浣滆€呭ソ鏂囨帹鑽愶細褰撳コ绁ㄥ彂鏉ヤ竴濂楅€佸懡棰橈紝绋嬪簭鍛樺簲璇ユ€庝箞鍋氾紵

Python爬虫之抖音视频批量提取术插图10

Python鐨勭埍濂借€呯ぞ鍖哄巻鍙叉枃绔犲ぇ鍚堥泦锛?/span>

Python鐨勭埍濂借€呯ぞ鍖哄巻鍙叉枃绔犲垪琛?/span>

Python爬虫之抖音视频批量提取术插图11绂忓埄锛氭枃鏈壂鐮佸叧娉ㄥ叕浼楀彿锛?/span>Python鐖卞ソ鑰?/span>绀惧尯锛?/span>寮€濮嬪涔燩ython璇剧▼锛?/span>

鍏虫敞鍚庡湪鍏紬鍙峰唴鍥炲 璇剧▼ 鍗冲彲鑾峰彇锛?/span>

灏忕紪鐨勮浆琛屽叆鑱屾暟鎹瀛︼紙鏁版嵁鍒嗘瀽鎸栨帢/鏈哄櫒瀛︿範鏂瑰悜锛?/span>銆愭渶鏂板厤璐广€?/span>

灏忕紪鐨凱ython鐨勫叆闂ㄥ厤璐硅棰戣绋?/span>锛?/span>

灏忕紪鐨凱ython鐨勫揩閫熶笂鎵媘atplotlib鍙鍖栧簱锛?/span>

宕?/span>鑰佸笀鐖櫕瀹炴垬妗堜緥鍏嶈垂瀛︿範瑙嗛銆?/span>

闄?/span>鑰佸笀鏁版嵁鍒嗘瀽鎶ュ憡鎵╁睍鍒朵綔鍏嶈垂瀛︿範瑙嗛銆?/span>

鐜╄浆澶ф暟鎹垎鏋愶紒Spark2.X + Python绮惧崕瀹炴垬璇剧▼鍏嶈垂瀛︿範瑙嗛銆?/span>

Python爬虫之抖音视频批量提取术插图12

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注