3 Web Scraping Methods in python

티스토리 뷰

Development/Python

3 Web Scraping Methods in python

juniz 2020. 8. 6. 17:07

Beautifulsoup

- Easy to learn

- Require dependency

- Relatively slow

- Doesn't work well with some pages(e.g. amazon)

Examples

# In terminal
pip install beautifulsoup4
pip install requests

# Extract
from bs4 import Beautifulsoup
import requests

URL = 'https://blog.scrapinghub.com/'

page = requests.get(URL)

soup = BeautifulSoup(page.text, 'html.parser')
titles = soup.find_all('div',{'class':'post-header'})

for title in titles:
    result = title.find('h2').text.strip()
    print(result)


'''
result
Blog Comments API (BETA): Extract Blog Comment DATA At Scale
Your Price Intelligence Questions Answered
Data Center Proxies vs. Residential Proxies
How to Get High Success Rates With Proxies: 3 Steps to Scale Up
Job Postings API: Stable release
Web Scraping Basics: A Developer’s Guide To Reliably Extract Data
Extracting Article & News Data: The Importance of Data Quality
Price Gouging or Economics at Work: Price Intelligence to Track Consumer Sentiment
A Practical Guide to Web Data QA Part III: Holistic Data Valid
'''

Scrapy

- Fast

- Not user friendly

Examples

# Install
pip install scrapy

# Startproject
scrapy startproject <project_name>

# Run
scrapy crawl posts

# Extract
import scrapy


class PostsSpider(scrapy.Spider):
    name = 'posts'

    start_urls = [
        'https://blog.scrapinghub.com/',
    ]

    def parse(self, response):
        for post in response.css('div.post-item'):
            yield {
                'title':post.css('.post-header h2 a::text')[0].get(),
            }
        next_page = response.css('a.next-posts-link::attr(href)').get()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

Selenium Driver

- Versatile

- Works well with js

- Relatively slow

Selenium Example은 차후에 추가 예정입니다.

저작자표시 동일조건

'Development > Python' 카테고리의 다른 글

[python] python3.8 새롭게 추가된 기능 몇 가지 (0)	2020.09.24
[python] dictionary 기본 (0)	2020.08.31
[python] csv 파일 저장하기 (0)	2020.08.23
파이썬 기초 (6분 컷) (0)	2020.08.20
streamlit 사용해보기 (수정 예정) (0)	2020.06.21

공지사항

최근에 올라온 글

최근에 달린 댓글

Total

Today

Yesterday

링크

TAG more

« 2025/02 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

글 보관함

니즈 개발 일기

티스토리 뷰

3 Web Scraping Methods in python

Beautifulsoup

Scrapy

Selenium Driver

'Development > Python' 카테고리의 다른 글

티스토리툴바