instascrape.scrapers package¶
Submodules¶
instascrape.scrapers.hashtag module¶
Hashtag¶
Scrape data from a Hashtag page
-
class
instascrape.scrapers.hashtag.
Hashtag
(source: Union[str, bs4.BeautifulSoup, Dict[str, Any]])¶ Bases:
instascrape.core._static_scraper._StaticHtmlScraper
Scraper for an Instagram hashtag page
-
get_recent_posts
(amt: int = 71) → List[instascrape.scrapers.post.Post]¶ Return a list of recent posts to the hasthag
Parameters: amt (int) – Amount of recent posts to return Returns: posts – List containing the recent 12 posts and their available data Return type: List[Post]
-
scrape
(mapping=None, keys: Optional[List[str]] = None, exclude: Optional[List[str]] = None, headers={'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Mobile Safari/537.36 Edg/87.0.664.57'}, inplace=True, session=None, webdriver=None) → None¶ Scrape data from the source
Parameters: - mapping (Dict[str, deque]) – Dictionary of parsing queue’s that tell the JSON engine how to process the JSON data
- keys (List[str]) – List of strings that correspond to desired attributes for scraping
- exclude (List[str]) – List of strings that correspond to which attributes to exclude from being scraped
- headers (Dict[str, str]) – Dictionary of request headers to be passed on the GET request
- inplace (bool) – Determines if data modified inplace or return a new object with the scraped data
- session (requests.Session) – Session for making the GET request
- webdriver (selenium.webdriver.chrome.webdriver.WebDriver) – Webdriver for scraping the page, overrides any default or passed session
Returns: Optionally returns a scraped instance instead of modifying inplace if inplace arg is True
Return type: return_instance
-
session
= <requests.sessions.Session object>¶
-
to_csv
(fp: str) → None¶ Write scraped data to .csv at the given filepath
Parameters: fp (str) – Filepath to write data to
-
to_dict
(metadata: bool = False) → Dict[str, Any]¶ Return a dictionary containing all of the data that has been scraped
Parameters: metadata (bool) – Boolean value that determines if metadata specified in self._METADATA_KEYS will be included in the dictionary. Returns: data_dict – Dictionary containing the scraped data Return type: Dict[str, Any]
-
to_json
(fp: str) → None¶ Write scraped data to .json file at the given filepath
Parameters: fp (str) – Filepath to write data to
-
instascrape.scrapers.post module¶
Post¶
Scrape data from a Post page
-
class
instascrape.scrapers.post.
Post
(source: Union[str, bs4.BeautifulSoup, Dict[str, Any]])¶ Bases:
instascrape.core._static_scraper._StaticHtmlScraper
Scraper for an Instagram post page
-
SUPPORTED_DOWNLOAD_EXTENSIONS
= ['.mp3', '.mp4', '.png', '.jpg']¶
-
download
(fp: str) → None¶ Download an image or video from a post to your local machine at the given filepath
Parameters: fp (str) – Filepath to download the image to
-
embed
() → str¶ Return embeddable HTML str for this post
Returns: html_template – HTML string with embed markup for this Post Return type: str
-
get_recent_comments
() → List[instascrape.scrapers.comment.Comment]¶ Returns a list of Comment objects that contain data regarding some of the posts comments
Returns: comments_arr – List of Comment objects Return type: List[Comment]
-
scrape
(mapping=None, keys: Optional[List[str]] = None, exclude: Optional[List[str]] = None, headers={'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Mobile Safari/537.36 Edg/87.0.664.57'}, inplace=True, session=None, webdriver=None) → None¶ Scrape data from the source
Parameters: - mapping (Dict[str, deque]) – Dictionary of parsing queue’s that tell the JSON engine how to process the JSON data
- keys (List[str]) – List of strings that correspond to desired attributes for scraping
- exclude (List[str]) – List of strings that correspond to which attributes to exclude from being scraped
- headers (Dict[str, str]) – Dictionary of request headers to be passed on the GET request
- inplace (bool) – Determines if data modified inplace or return a new object with the scraped data
- session (requests.Session) – Session for making the GET request
- webdriver (selenium.webdriver.chrome.webdriver.WebDriver) – Webdriver for scraping the page, overrides any default or passed session
Returns: Optionally returns a scraped instance instead of modifying inplace if inplace arg is True
Return type: return_instance
-
session
= <requests.sessions.Session object>¶
-
to_csv
(fp: str) → None¶ Write scraped data to .csv at the given filepath
Parameters: fp (str) – Filepath to write data to
-
to_dict
(metadata: bool = False) → Dict[str, Any]¶ Return a dictionary containing all of the data that has been scraped
Parameters: metadata (bool) – Boolean value that determines if metadata specified in self._METADATA_KEYS will be included in the dictionary. Returns: data_dict – Dictionary containing the scraped data Return type: Dict[str, Any]
-
to_json
(fp: str) → None¶ Write scraped data to .json file at the given filepath
Parameters: fp (str) – Filepath to write data to
-
instascrape.scrapers.profile module¶
-
class
instascrape.scrapers.profile.
Profile
(source: Union[str, bs4.BeautifulSoup, Dict[str, Any]])¶ Bases:
instascrape.core._static_scraper._StaticHtmlScraper
Scraper for an Instagram profile page
-
get_posts
(webdriver, amount=None, login_first=False, login_pause=60, max_failed_scroll=300, scrape=False, scrape_pause=5)¶ Return Post objects from profile scraped using a webdriver (not included)
Parameters: - webdriver (selenium.webdriver.chrome.webdriver.WebDriver) – Selenium webdriver for rendering JavaScript and loading dynamic content
- amount (int) – Amount of posts to return, default is all of them
- login_first (bool) – Start on login page to allow user to manually login to Instagram
- login_pause (int) – Length of time in seconds to pause before starting scrape
- max_failed_scroll (int) – Maximum amount of scroll attempts before stopping if scroll is stuck
- scrape (bool) – Scrape posts with the webdriver prior to returning
- scrape_pause (int) – Time in seconds between each scrape
Returns: posts – Post objects gathered from the profile page
Return type: List[Post]
-
get_recent_posts
(amt: int = 12) → List[instascrape.scrapers.post.Post]¶ Return a list of the profiles recent posts. Max available for return is 12.
Parameters: amt (int) – Amount of recent posts to return Returns: posts – List containing the recent 12 posts and their available data Return type: List[Post]
-
scrape
(mapping=None, keys: Optional[List[str]] = None, exclude: Optional[List[str]] = None, headers={'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Mobile Safari/537.36 Edg/87.0.664.57'}, inplace=True, session=None, webdriver=None) → None¶ Scrape data from the source
Parameters: - mapping (Dict[str, deque]) – Dictionary of parsing queue’s that tell the JSON engine how to process the JSON data
- keys (List[str]) – List of strings that correspond to desired attributes for scraping
- exclude (List[str]) – List of strings that correspond to which attributes to exclude from being scraped
- headers (Dict[str, str]) – Dictionary of request headers to be passed on the GET request
- inplace (bool) – Determines if data modified inplace or return a new object with the scraped data
- session (requests.Session) – Session for making the GET request
- webdriver (selenium.webdriver.chrome.webdriver.WebDriver) – Webdriver for scraping the page, overrides any default or passed session
Returns: Optionally returns a scraped instance instead of modifying inplace if inplace arg is True
Return type: return_instance
-
session
= <requests.sessions.Session object>¶
-
to_csv
(fp: str) → None¶ Write scraped data to .csv at the given filepath
Parameters: fp (str) – Filepath to write data to
-
to_dict
(metadata: bool = False) → Dict[str, Any]¶ Return a dictionary containing all of the data that has been scraped
Parameters: metadata (bool) – Boolean value that determines if metadata specified in self._METADATA_KEYS will be included in the dictionary. Returns: data_dict – Dictionary containing the scraped data Return type: Dict[str, Any]
-
to_json
(fp: str) → None¶ Write scraped data to .json file at the given filepath
Parameters: fp (str) – Filepath to write data to
-
instascrape.scrapers.reel module¶
-
class
instascrape.scrapers.reel.
Reel
(source: Union[str, bs4.BeautifulSoup, Dict[str, Any]])¶ Bases:
instascrape.scrapers.post.Post
Scraper for an Instagram reel
-
SUPPORTED_DOWNLOAD_EXTENSIONS
= ['.mp3', '.mp4', '.png', '.jpg']¶
-
download
(fp: str) → None¶ Download an image or video from a post to your local machine at the given filepath
Parameters: fp (str) – Filepath to download the image to
-
embed
() → str¶ Return embeddable HTML str for this post
Returns: html_template – HTML string with embed markup for this Post Return type: str
-
get_recent_comments
() → List[instascrape.scrapers.comment.Comment]¶ Returns a list of Comment objects that contain data regarding some of the posts comments
Returns: comments_arr – List of Comment objects Return type: List[Comment]
-
scrape
(mapping=None, keys: Optional[List[str]] = None, exclude: Optional[List[str]] = None, headers={'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Mobile Safari/537.36 Edg/87.0.664.57'}, inplace=True, session=None, webdriver=None) → None¶ Scrape data from the source
Parameters: - mapping (Dict[str, deque]) – Dictionary of parsing queue’s that tell the JSON engine how to process the JSON data
- keys (List[str]) – List of strings that correspond to desired attributes for scraping
- exclude (List[str]) – List of strings that correspond to which attributes to exclude from being scraped
- headers (Dict[str, str]) – Dictionary of request headers to be passed on the GET request
- inplace (bool) – Determines if data modified inplace or return a new object with the scraped data
- session (requests.Session) – Session for making the GET request
- webdriver (selenium.webdriver.chrome.webdriver.WebDriver) – Webdriver for scraping the page, overrides any default or passed session
Returns: Optionally returns a scraped instance instead of modifying inplace if inplace arg is True
Return type: return_instance
-
session
= <requests.sessions.Session object>¶
-
to_csv
(fp: str) → None¶ Write scraped data to .csv at the given filepath
Parameters: fp (str) – Filepath to write data to
-
to_dict
(metadata: bool = False) → Dict[str, Any]¶ Return a dictionary containing all of the data that has been scraped
Parameters: metadata (bool) – Boolean value that determines if metadata specified in self._METADATA_KEYS will be included in the dictionary. Returns: data_dict – Dictionary containing the scraped data Return type: Dict[str, Any]
-
to_json
(fp: str) → None¶ Write scraped data to .json file at the given filepath
Parameters: fp (str) – Filepath to write data to
-
instascrape.scrapers.location module¶
-
class
instascrape.scrapers.location.
Location
(source: Union[str, bs4.BeautifulSoup, Dict[str, Any]])¶ Bases:
instascrape.core._static_scraper._StaticHtmlScraper
Scraper for an Instagram profile page
-
get_recent_posts
(amt: int = 24) → List[instascrape.scrapers.post.Post]¶ Return a list of recent posts to the location
Parameters: amt (int) – Amount of recent posts to return Returns: posts – List containing the recent 24 posts and their available data Return type: List[Post]
-
scrape
(mapping=None, keys: Optional[List[str]] = None, exclude: Optional[List[str]] = None, headers={'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Mobile Safari/537.36 Edg/87.0.664.57'}, inplace=True, session=None, webdriver=None) → None¶ Scrape data from the source
Parameters: - mapping (Dict[str, deque]) – Dictionary of parsing queue’s that tell the JSON engine how to process the JSON data
- keys (List[str]) – List of strings that correspond to desired attributes for scraping
- exclude (List[str]) – List of strings that correspond to which attributes to exclude from being scraped
- headers (Dict[str, str]) – Dictionary of request headers to be passed on the GET request
- inplace (bool) – Determines if data modified inplace or return a new object with the scraped data
- session (requests.Session) – Session for making the GET request
- webdriver (selenium.webdriver.chrome.webdriver.WebDriver) – Webdriver for scraping the page, overrides any default or passed session
Returns: Optionally returns a scraped instance instead of modifying inplace if inplace arg is True
Return type: return_instance
-
session
= <requests.sessions.Session object>¶
-
to_csv
(fp: str) → None¶ Write scraped data to .csv at the given filepath
Parameters: fp (str) – Filepath to write data to
-
to_dict
(metadata: bool = False) → Dict[str, Any]¶ Return a dictionary containing all of the data that has been scraped
Parameters: metadata (bool) – Boolean value that determines if metadata specified in self._METADATA_KEYS will be included in the dictionary. Returns: data_dict – Dictionary containing the scraped data Return type: Dict[str, Any]
-
to_json
(fp: str) → None¶ Write scraped data to .json file at the given filepath
Parameters: fp (str) – Filepath to write data to
-
instascrape.scrapers.igtv module¶
-
class
instascrape.scrapers.igtv.
IGTV
(source: Union[str, bs4.BeautifulSoup, Dict[str, Any]])¶ Bases:
instascrape.scrapers.post.Post
Scraper for an IGTV post
-
SUPPORTED_DOWNLOAD_EXTENSIONS
= ['.mp3', '.mp4', '.png', '.jpg']¶
-
download
(fp: str) → None¶ Download an image or video from a post to your local machine at the given filepath
Parameters: fp (str) – Filepath to download the image to
-
embed
() → str¶ Return embeddable HTML str for this post
Returns: html_template – HTML string with embed markup for this Post Return type: str
-
get_recent_comments
() → List[instascrape.scrapers.comment.Comment]¶ Returns a list of Comment objects that contain data regarding some of the posts comments
Returns: comments_arr – List of Comment objects Return type: List[Comment]
-
scrape
(mapping=None, keys: Optional[List[str]] = None, exclude: Optional[List[str]] = None, headers={'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Mobile Safari/537.36 Edg/87.0.664.57'}, inplace=True, session=None, webdriver=None) → None¶ Scrape data from the source
Parameters: - mapping (Dict[str, deque]) – Dictionary of parsing queue’s that tell the JSON engine how to process the JSON data
- keys (List[str]) – List of strings that correspond to desired attributes for scraping
- exclude (List[str]) – List of strings that correspond to which attributes to exclude from being scraped
- headers (Dict[str, str]) – Dictionary of request headers to be passed on the GET request
- inplace (bool) – Determines if data modified inplace or return a new object with the scraped data
- session (requests.Session) – Session for making the GET request
- webdriver (selenium.webdriver.chrome.webdriver.WebDriver) – Webdriver for scraping the page, overrides any default or passed session
Returns: Optionally returns a scraped instance instead of modifying inplace if inplace arg is True
Return type: return_instance
-
session
= <requests.sessions.Session object>¶
-
to_csv
(fp: str) → None¶ Write scraped data to .csv at the given filepath
Parameters: fp (str) – Filepath to write data to
-
to_dict
(metadata: bool = False) → Dict[str, Any]¶ Return a dictionary containing all of the data that has been scraped
Parameters: metadata (bool) – Boolean value that determines if metadata specified in self._METADATA_KEYS will be included in the dictionary. Returns: data_dict – Dictionary containing the scraped data Return type: Dict[str, Any]
-
to_json
(fp: str) → None¶ Write scraped data to .json file at the given filepath
Parameters: fp (str) – Filepath to write data to
-
instascrape.scrapers.scrape_tools module¶
-
instascrape.scrapers.scrape_tools.
determine_json_type
(json_data: Union[Dict[str, Any], str]) → str¶ Return the type of Instagram page based on the JSON data parsed from source
Parameters: json_data (Union[JSONDict, str]) – JSON data that will be checked and parsed to determine what type of page the program is looking at (Profile, Post, Hashtag, etc) Returns: instagram_type – Name of the type of page the program is currently parsing or looking at Return type: str
-
instascrape.scrapers.scrape_tools.
flatten_dict
(json_dict: Dict[str, Any]) → Dict[str, Any]¶ Returns a flattened dictionary of data
Parameters: json_dict (dict) – Input dictionary for flattening Returns: flattened_dict – Flattened dictionary Return type: dict
-
instascrape.scrapers.scrape_tools.
json_from_html
(source: Union[str, BeautifulSoup], as_dict: bool = True, flatten=False) → Union[Dict[str, Any], str]¶ Return JSON data parsed from Instagram source HTML
Parameters: - source (Union[str, BeautifulSoup]) – Instagram HTML source code to parse the JSON from
- as_dict (bool = True) – Return JSON as dict if True else return JSON as string
- flatten (bool) – Flatten the dictionary prior to returning it
Returns: json_data – Parsed JSON data from the HTML source as either a JSON-like dictionary or just the string serialization
Return type: Union[JSONDict, str]
-
instascrape.scrapers.scrape_tools.
json_from_soup
(source, as_dict: bool = True, flatten=False)¶
-
instascrape.scrapers.scrape_tools.
json_from_url
(url: str, as_dict: bool = True, headers={'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Mobile Safari/537.36 Edg/87.0.664.57'}, flatten=False) → Union[Dict[str, Any], str]¶ Return JSON data parsed from a provided Instagram URL
Parameters: - url (str) – URL of the page to get the JSON data from
- as_dict (bool = True) – Return JSON as dict if True else return JSON as string
- headers (Dict[str, str]) – Dictionary of request headers to be passed on the GET request
- flatten (bool) – Flatten the dictionary prior to returning it
Returns: json_data – Parsed JSON data from the URL as either a JSON-like dictionary or just the string serialization
Return type: Union[JSONDict, str]
-
instascrape.scrapers.scrape_tools.
parse_data_from_json
(json_dict, map_dict, default_value=nan)¶ Parse data from a JSON dictionary using a mapping dictionary that tells the program how to parse the data
-
instascrape.scrapers.scrape_tools.
scrape_posts
(posts: List[Post], session: Optional[requests.sessions.Session] = None, webdriver: Optional[selenium.webdriver.chrome.webdriver.WebDriver] = None, limit: Union[int, datetime.datetime, None] = None, headers: dict = {'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Mobile Safari/537.36 Edg/87.0.664.57'}, pause: int = 5, on_exception: str = 'raise', silent: bool = True, inplace: bool = False)¶
Module contents¶
Primary API scraper tools