chatbot

This repository is under active development. It's part of research going on at Illinois Institute of Technology Chicago for using LLMs in chatbot/agent development.

The webscrapping has been refined to great extent during the past couple of weeks (as of Jan 15, 2024). Currently, there are two main scrapping files deploying the scrapping logic in two parts:

url_crawling.py: MultiCrawler class in this module accepts multiple base_urls. For each base_url, a list of urls is formed by crawling that url and its childurls recursively depending upon max_depth param. This functionality is achieved via Url_Crawl class which is actually inherited from Langchain's RecursiveURLLoader. It must be noted here that Langchain's source code has been tweaked to get URLs pertaining to similar top level domain (root domain). Originally, Langchain only sticks to base_url very strictly. Because of this, we had to determine the subdomains ourselves and feed them separately for getting the respective child_urls. However, our algorithm improves this by extending the original functionality to account for other subdomains lying within a given base_url.
- input format: List of Tuples [(base_url, max_depth), ...] each tuple contains base_url along with maximum depth of recursion for getting child_urls
- output format: a csv file containing one column of resulting consolidated URLs only
webscrapping.py: This module scrapes webpages iteratively by reading URLs from the csv file resulted from url_crawling. There are two aspects to the scraping method:

i. webscrapping on the basis of webpages: For IIT webpages, we have designed a customized way of scrapping using BeautifulSoup library; whereas for other webpages we're using Unstructured.io library.

ii. splitting: This is our research area. There are three classes defined in this module. The primary orchestrator class is WebScraper which contains use_split param to scrape the webpages and store them into respective split_id directories. Before that, we're splitting the urls into categories depending upon the kind of information they contain. For IIT, there are some URLs that pertain to admissions others pertain to academics. We have curated five categories by using our collaborative judgement. An additional column called split_id has been manually added to csv file generated by url_crawling.py. When use_split = True in WebScraper, each url pertaining to specific category is scrapped and resulting text file is saved in respective directory based on split_id.
- input format: csv file with or without split_id column
- output format: text files

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
web_scrapping		web_scrapping
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

chatbot

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

mali404/chatbot

Folders and files

Latest commit

History

Repository files navigation

chatbot

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages