Search

BeautifulSoup

๋ชฉ์ฐจ

BeautifulSoup

โ€ข
ํŒŒ์‹ฑ์„ ๋„์™€์ฃผ๋Š” ๊ฐ•๋ ฅํ•œ python ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
โ€ข
์‰ฝ๊ณ  ๊ฐ„๊ฒฐํ•˜๋ฉฐ, ์ •๊ทœ์‹์„ ์ž‘์„ฑํ•  ํ•„์š” ์—†์ด tag, id, class ๋“ฑ์˜ ์ด๋ฆ„์œผ๋กœ ์‰ฝ๊ฒŒ ํŒŒ์‹ฑ ๊ฐ€๋Šฅ
โ€ข
์„ค์น˜: pip install beautifulsoup4

์ฃผ์š” ํ•จ์ˆ˜

์ฃผ์š” ๊ธฐ๋Šฅ
์„ค๋ช…
find_all(ํƒœ๊ทธ)
ํƒœ๊ทธ๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ๋Š” ๋ชจ๋“  ๋ฌธ์žฅ ๋ฐ˜ํ™˜
find(ํƒœ๊ทธ)
ํƒœ๊ทธ๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ๋Š” ๋ชจ๋“  ๋ฌธ์žฅ ์ค‘ ๊ฐ€์žฅ ์ฒซ๋ฒˆ์งธ ๋ฌธ์žฅ ๋ฐ˜ํ™˜
select(์„ ํƒ์ž)
์„ ํƒ์ž(selector)๋ฅผ ์‚ฌ์šฉํ•œ ๋ฐ์ดํ„ฐ์„ ํƒ

์‚ฌ์šฉ ๋ฐฉ๋ฒ•

find_all(), find()
from bs4 import BeautifulSoup # โ€ฆ html ๊ฐ€์ ธ์˜ค๋Š” ๋ถ€๋ถ„ ์ƒ๋žต โ€ฆ soup = BeautifulSoup(html, 'lxmlโ€™) #1. ํฌ์ŠคํŠธ ๋ฆฌ์ŠคํŠธ post_list = soup.find_all('div', {'class': 'post-preview'}) #2. ์ œ๋ชฉ, ์†Œ์ œ๋ชฉ, ๋‚ ์งœ ๋ฆฌ์ŠคํŠธ for post in post_list: title = post.find('h2', {'class' : 'post-title'}).text.strip()
Python
๋ณต์‚ฌ
selector()
# ์ฝ”๋“œ ์ƒ๋žต selector = '#yesFixCorner > dl > dd > ul.yesCornerLi > li:nth-child(1) > a' data = soup.select(selector)
Python
๋ณต์‚ฌ

Tag ํ™•์ธ ๋ฐฉ๋ฒ•

โ€ข
๋งˆ์šฐ์Šค ์šฐํด๋ฆญ > Copy > Copy element ๋˜๋Š” Copy Selector ๋“ฑ ์ƒํ™ฉ์— ๋”ฐ๋ผ ์ ์ ˆํ•˜๊ฒŒ ์‚ฌ์šฉ

๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ๋ฐฉ๋ฒ•

<div class="post-preview"> <a href="index.html"> <h2 class="post-title"> ๋‹จํ’๊ตฌ๊ฒฝ๊ฐ€์ž~~ </h2> <h3 class="post-subtitle"> ๋™ํ‚ค๋ž‘ ๋‹จํ’๊ตฌ๊ฒฝ ๋‹ค๋…€์™”์–ด์š” </h3> </a> <p class="post-meta">September 24, 2021</p> </div> <hr> <div class="post-preview"> <a href="index.html"> <h2 class="post-title"> ๋‹ค์ด์–ดํŠธ ์‹œ์ž‘! </h2> <h3 class="post-subtitle"> ๊ฟ€์€ ์ด์ œ ๊ทธ๋งŒ, ๋‚ ์”ฌํ•œ ๊ณฐ๋Œ์ด๋กœ ๋‹ค์‹œ ํƒœ์–ด๋‚˜๊ธฐ </h3> </a> <p class="post-meta">June 18, 2021</p> </div> <hr>
HTML
๋ณต์‚ฌ
1.
posts ๋‹จ์œ„๋กœ ๊ด€๋ฆฌ: ํฌ์ŠคํŒ…(๊ธ€) ๋‹จ์œ„๋กœ ์ชผ๊ฐœ๊ธฐ
a.
posts ์•ˆ์— ์กด์žฌํ•˜๋Š” post๋“ค์˜ ํ˜•์‹์ด ๋‹ค๋ฅผ ์ˆ˜๋„ ์žˆ์œผ๋ฏ€๋กœ ํฌ์ŠคํŒ… ๋‹จ์œ„๋กœ ์ชผ๊ฐœ๋Š” ์ž‘์—…์„ ๋จผ์ € ์ง„ํ–‰ํ•ด์•ผ ํ•จ
2.
post ๋ถ„์„: ํ•˜๋‚˜์˜ ํ•˜๋‚˜์˜ ํฌ์ŠคํŒ…์— ํฌํ•จ๋œ ์ œ๋ชฉ, ์†Œ์ œ๋ชฉ, ๋‚ ์งœ๋ฅผ ์ถ”์ถœํ•˜๊ธฐ

[์‹ค์Šต] yes24

[๊ณผ์ œ ์„ค๋ช…]
yes24(http://www.yes24.com/) ์‚ฌ์ดํŠธ์˜ ๋ฒ ์ŠคํŠธ์…€๋Ÿฌ ํŽ˜์ด์ง€์—์„œ
๊ตญ๋‚ด๋„์„œ ๋ฒ ์ŠคํŠธ์…€๋Ÿฌ ์ˆœ์œ„ 1-10์œ„ ์ฑ…์— ๋Œ€ํ•ด ์ˆœ์œ„, ์ œ๋ชฉ, ์ €์ž, ์ถœํŒ์‚ฌ, ํŒ๋งค๊ฐ€, ์ถœ๊ฐ„์ผ์„ ํ”„๋ฆฐํŠธํ•˜์‹œ์˜ค.
[์ถœ๋ ฅ ์˜ˆ์‹œ]
1.
์ œ๋ชฉ: ํŠธ๋ Œ๋“œ ์ฝ”๋ฆฌ์•„ 2024, ์ €์ž: ๊น€๋‚œ๋„, ์ „๋ฏธ์˜, ์ตœ์ง€ํ˜œ, ์ด์ˆ˜์ง„, ๊ถŒ์ •์œค ์™ธ 6๋ช…, ์ถœํŒ์‚ฌ: ๋ฏธ๋ž˜์˜์ฐฝ, ํŒ๋งค๊ฐ€: 17,100์›, ์ถœ๊ฐ„์ผ: 2023๋…„ 10์›” 05์ผ
2.
........................
3.
........................
4.
....................
5.
...................

[์‹ค์Šต] Tripadvisor

โ€ข
ํฌ๋กค๋ง ์‚ฌ์ดํŠธ: https://www.tripadvisor.co.kr/Restaurants-g294197-Seoul.html
โ€ข
์ฐธ๊ณ  ์†Œ์Šค์ฝ”๋“œ: ์ฃผํ”ผํ„ฐ๋…ธํŠธ๋ถ
1.
์‚ฌ์ดํŠธ์˜ html์„ ์ฝ์–ด๋“ค์ด๊ธฐ: requests.get(url) ์‚ฌ์šฉ
2.
ํ…์ŠคํŠธ ํ˜•ํƒœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ html ํƒœ๊ทธ๋ณ„๋กœ ๊ตฌ๋ถ„ํ•˜์—ฌ ํŒŒ์‹ฑํ•˜๊ธฐ : BeutifulSoup
3.
ํŠน์ • ํƒœ๊ทธ๊ฐ’๋งŒ ์ฐพ๊ธฐ : findAll, find
4.
ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ๊ฐ’ ์ •์ œํ•˜๊ธฐ
5.
๊ด‘๊ณ  ์ƒํ’ˆ ์ œ๊ฑฐํ•˜๊ธฐ
6.
๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ํ˜•ํƒœ๋กœ ๋งŒ๋“ค๊ธฐ

[๊ณผ์ œ] ์ฟ ํŒก

โ€ข
ํฌ๋กค๋ง ์‚ฌ์ดํŠธ: https://www.coupang.com
โ€ข
์ฐธ๊ณ  ์†Œ์Šค์ฝ”๋“œ: ์ฃผํ”ผํ„ฐ๋…ธํŠธ๋ถ
1.
์‚ฌ์ดํŠธ์˜ html์„ ์ฝ์–ด๋“ค์ด๊ธฐ: requests.get(url) ์‚ฌ์šฉ
โ€ข
ํ—ค๋” ์„ค์ •: http://www.useragentstring.com/
โ€ข
url ์ฃผ์†Œ(GET ๋ฐฉ์‹) ํ˜•์‹์„ ์ดํ•ดํ•˜๊ธฐ
2.
ํ…์ŠคํŠธ ํ˜•ํƒœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ html ํƒœ๊ทธ๋ณ„๋กœ ๊ตฌ๋ถ„ํ•˜์—ฌ ํŒŒ์‹ฑํ•˜๊ธฐ : BeutifulSoup
3.
ํŠน์ • ํƒœ๊ทธ๊ฐ’๋งŒ ์ฐพ๊ธฐ : findAll, find
4.
ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ๊ฐ’ ์ •์ œํ•˜๊ธฐ
โ€ข
๊ด‘๊ณ  ์ƒํ’ˆ ์ œ๊ฑฐํ•˜๊ธฐ
5.
๋ฐ์ดํ„ฐ ์ €์žฅํ•˜๊ธฐ
โ€ข
csv ํ˜•์‹์œผ๋กœ ์ €์žฅํ•˜๊ธฐ