Pre-requisites
Data must be machine readable for it to be pulled into a database, data warehouse, or data visualization application. This typically means that the data must be part of a structured database, be stored in a tabular format (e.g. CSV, TSV, or Excel). However, being able to utilize unstructured data(JSON) and HTML tables enables an individual to access data through a variety of sources.
Imagine you have a website that you log in to that generates reports. A manual way of loading these reports into a data dashboard, like Power BI or Tableau, would involve 1) Logging into the website, 2) downloading a report, 3) combining data from the new report with data from previous reports and loading the data into the data visualization software. These four steps can be addressed using three different Python libraries: Selenium, Requests, and Pandas.
Pip install selenium
To automate the process of logging into the website, open up Dev Tools (Right click + Inspect) or click F12. Using Dev Tools, select the Username input box. Right click on the element and select ‘Copy full X Path’. Using the get element by X path, locate the element. Include a wait expected condition so that the action occurs as soon as the element becomes available.
Using environment variables, use send keys to send your username and password.
With dev tools open, switch to the network tab and hit submit. Analyze the network traffic, specifically looking for any XHR / http requests. This will help you see how the log in logic works. If the website uses a simple log in protocol by validated your username and password and returning session credentials (e.g. session, auth cookies), you can opt for using the requests library instead of selenium to get your necessary cookies. If the log in process is more complex, it is better to use selenium to automate authentication.
Once you are logged in, run a report with the network tab open. You should be able to see the following: 1. Where the report is being requested from (the endpoint, usually an API) 2. What parameters it expects and how (for example, date ranges and the specific format the dates must be sent in) and 3. How the data is returned. Data will likely be returned as either a JSON, a CSV/TSV, or an HTML table. If it returned in one of these formats, you can begin using loading the data in with pandas. It is also possible that the report will be returned as a full HTML webpage, which is still usable, though this will come back as a “document” rather than a “XHR/HTTP” request.
Pandas has methods to create dataframes from JSON, CSV, TSV, and HTML tables. If the JSON is very unstructured, or if the data you need is nested, use the JSON library to traverse the data and extract what you need. If the HTML table is embedded within a larger HTML document, use Beautiful Soup to extract the specific HTML table that you are looking for.