Written by php-style
on 2021-03-26

파이썬(인스타 크롤링)

반응형

1. CMD

2. jupyter notebook

> php install jupyter

> python -m pip install --upgrade pip

> jupyter notebook

https://wikidocs.net/book/1

점프 투 파이썬으로 공부!!

https://chromedriver.storage.googleapis.com/index.html

크롬 버전 확인후 크롬 웹 드라이버 다운 -> 소스 디렉터리(%pwd 해서 그 경로)에 exe파일 옮기기

!는 cmd명령어

%는 리눅스 명령어

1. 모듈 설치

!pip install selenium // 웹 브라우저 제어할 수 있게 도와주는 도구

!pip install pillow //이미지 편집 도구(미니 포토샵)

!pip install bs4 // 웹 언어 분석 도구

!pip install requests // 결과를 보내주는 도구

!pip install matplotlib // 차트나 그래프 만드는 도구

2. 인스타 창 열기

from selenium import webdriver import time # time 패키지 driver = webdriver.Chrome('chromedriver.exe') driver.get('https://www.instagram.com') # 인스타그램 창 띄우기 time.sleep(2) # 2초 정도 쉰다

## 인스타 접속 URL 함수 생성

def insta_searching(word) : url = 'http://www.instagram.com/explore/tags/' + word return url

insta_searching("코로나")

3. 인스타에 자동로그인하기

새로운 웹 창에서 F12 누르기

클래스 부분 더블클릭후 복사

input._2hvTZ.pexuQ.zyHYP #공백은 .으로 처리

email = "아이디" input_id = driver.find_elements_by_css_selector('input._2hvTZ.pexuQ.zyHYP')[0] # find_elment_by는 마우스로 찾아가는거 여기까지는 다 똑같음, ID창 선언 input_id.clear() input_id.send_keys(email) password = "비밀번호" input_pw = driver.find_elements_by_css_selector('input._2hvTZ.pexuQ.zyHYP')[1] input_pw.clear() input_pw.send_keys(password) input_pw.submit()

정보 저장 > 설정 들의 항목은 수동으로 설정한다

word = "제주도맛집" url = insta_searching(word) driver.get(url)

4. 첫번째 게시글 선택하기

def select_first(driver) : first = driver.find_element_by_css_selector('div._9AhH0') first.click() time.sleep(3) select_first(driver)

<태그 class = 클래스이름>

<태그이름.클래스이름>

CSS는 꾸미는거

5. 게시글 가져오기

본문 작성시간 좋아요

import re from bs4 import BeautifulSoup def get_content(driver) : # 현재 페이지 정보 가져오기 html = driver.page_source # 우클릭 했을 때 소스보기 내용을 다 가져옴 soup = BeautifulSoup(html, 'html.parser') #html.parser가 분석기 # 본문 내용 가져오기 (예외처리) try : content = soup.select('div.C4VMK > span')[0].text except : content = ' ' # 해시태그 가져오기 tags = re.findall(r'#[^\s#,\\]+',content) # 작성일자 가져오기 date = soup.select('time._1o9PC.Nzb55')[0]['datetime'][:10] #datetime에서 10자만 가져오겠다 # 종아요 숫자 가져오기 try : like = soup.select('div.Nm9Fw > button')[0].text[4:-1] #4번부터 끝(-1)까지 except : like = 0 # 위치정보 가져오기 try : place = soup.select('div.M30cS')[0].text except : place = '' # 현재 수집한 정보를 저장하기 data = [content, date, like, place, tags, place] return data get_content(driver)

5. 이전/다음 게시글 열기

## 오른쪽 이동

def move_right(driver) : right = driver.find_element_by_css_selector('a.coreSpriteRightPaginationArrow') right.click() time.sleep(3) move_right(driver)

## 왼쪽 이동

def move_left(driver) : right = driver.find_element_by_css_selector('a.coreSpriteLeftPaginationArrow') right.click() time.sleep(3) move_left(driver)

from http://qucdas.tistory.com/45 by ccl(A) rewrite - 2021-03-26 10:00:13

Top