- Get a github personal access token from: https://github.com/settings/tokens
- Setup your
.Renvironfile (to make it easier run:usethis::edit_r_environ(), or edit~/.Renvirondirectly)
- add a like this to the file:
GH_TOSS_TOKEN='YOUR_TOKEN'- You can use
GH_TOSS_TOKEN='1c06459fc9b515e2a5aa748b06913f3495068a45', but may not work since its not your own token.
- You can use
- add your database password
DB_PASSWORD='PASSWORD_IS_PROBABLY_YOUR_PID' - make sure the file ends in an empty new line
- restart your R session so the environment variables are picked up
The cran pull is taken from the packages listed on: https://cran.r-project.org/web/packages/available_packages_by_name.html
Rscript ./src/01-data_collection/scrape/CRAN/01-cran_scrape.RRscript ./src/01-data_collection/scrape/CRAN/02_parse_cran.R
Rscript ./src/01-data_collection/scrape/CRAN/03_missingness.R
Rscript ./src/01-data_collection/scrape/CRAN/04_CI_Checks.R
Rscript ./src/01-data_collection/scrape/CRAN/05_CI_OSI_subsets.R01_cran_scrape.RThis is the main function used to parse HTML from Cran and store it on the server. Small section of parallelized code to speed up scraping process02_parse_cran.RTransforming the data we scraped from CRAN from large lists to dataframes.03_missingness.RGenerates a table of the missingness in variables (columns) scraped from CRAN.04_CI_Checks.RUsing links from the original CRAN scrape (script 01), we know where the continuous integration checks are located for each package. Using a package’s link to its own CI results, we then scrape the HTML from that page and store it. This script is almost identical to 01_cran_scrape except the goal is to scrape CI checks for each package.05_CI_OSI_subsets0 filtering until this script. A lot of filtering in this script. TLDR: The goal of this script is to get all packages that have OSI approved licenses, Passing CI tests, and are on valid Github repos. In checking if packages were OSI approved, we got a list of all OSI approved licenses and used that to label packages as “approved” or not. I did this in excel and then imported the data back onto the server as I found it to be faster. From the set of all OSI approved packages, we then exclude any packages with the CI status “Error” or “Fail” so we are left with CI test passing packages. From that set, we filter so that we are left with packages that have Github links. Various data cleaning steps were taken to find the link if it was missing (take the link from bug reports, or issues). Looked for github related url's in the dataset.
Note: some code in this script breaks because we cannot use the “sdalr” package to get a table from the old SDAL database (lines 9 - 12) .
-
06_github_scrape.RThis is the script that gets us Github data. We get it by feeding in a slug to some github API calls that Bayoan wrote. The same data cleaning steps are performed as in script 05. Lines 125 to 132 are attempts to resolve NA repos from the first gitthub scrape, we fix this by getting the slugs of the Na repos and feeding it back into the function. -
07_github_LoCMaking the Github lines of code calculations -
08_CRAN_source_files.RFor CRAN packages that are not on github, we can calculate their lines of code by downloading the source files and then running a line count on the files. This script grabs all the LINKS to the source files for all Cran packages. -
09_Cran_LOC.RThis downloads all of cran and then runs line counts on each package. Calculates cost with same formula as github packages .R .c, .h files filtered -
10_github_contrib.RThis function works with github contributors and works with more github API calls. The code doesn’t work right now, but it doesn’t write out any data so I believe this was not used for anything. Seems like a byoan script. -
11_uploads.RCombining Cran github information along with lines of code and contributor information, then uploading them to the SDAL database. The resulting tables are for packages with OSI, CI passing, and on Github. -
12_cleaning_keys.Rthis is a cleaning script to produce a list of OSI approved and CI approved github slugs with package names. Mostly data cleaning steps here -
13_github_fix.RWe missed a few packages in data cleaning, so we just make sure to obtain relevant information on what we missed: -
14_costs.RAttaching cost calculation to an analysis table -
15_cost_comparisons.RExploratory work to look at the difference in lines of code between source files from CRAN and lines of code info from Github. Small plotting efforts but nothing that was used in the poster -
16_HEX.RExploratory plotting comparing outdegree and cost across Python, Julia, CDN(JS), and R. None of these visuals were used in the poster
Also, there are two other scripts in the repo right now (chk.R and dependencies.R). I’m pretty certain that Bayoan wrote these scripts either to check my work or to do a little bit of work on CRAN. Either way I don’t think I used/edited anything in those scripts.
#load our list of keys, and identify the set of packages that we have missed
keys <- readRDS('./data/oss/working/CRAN_2018/name_slug_keys.RDS') # from script 12
Analysis <- readRDS('./data/oss/working/CRAN_2018/Analysis.RDS') # from uploads (script 11)
missed <- setdiff(keys$slug, Analysis$slug) #this should be 220 packages
After identifying what we missed we get the information and bind it back to our master analysis tableSome of the "original" data sets cannot be located, but they have been saved in the database.
./data/oss/final/PyPI/complete_osi_info.csvcan be found in the SDAL Database aspostgresql/oss/public/python_general_pkg_info./data/oss/working/pypi/10_github_api_info.csvcan be found in the SDAL Database aspostgresql/oss/public/python_final./data/oss/final/PyPI/python_pkg_dependencies.csvcan be found in the SDAL Database aspostgresql/oss/public/python_pkg_dependencies
00_final_cleaning.RRun this script last, after10Using the final data table produced in10_num_contributors.R, this script cleans all of the final data.- input:
~/oss/data/oss/final/PyPI/complete_osi_info.csvfrom:10_num_contributors_loc.R - input:
~/oss/data/oss/working/pypi/10_github_api_info.csvfrom:06_github_api.R - input:
~/oss/data/oss/final/PyPI/python_pkg_dependencies.csvfrom:07_dependencies_cleaning.R - output: db:
oss/python_cost_estimates
- input:
01_names.RThis is the beginning of the Pip data collection and cleaning. This script collects all of the names of Python packages from pypi.org in the development categories Production/ Stable and Mature.- input: None
- output:
~/oss/data/oss/working/pypi/02_prod_stable_pkgs_names.csv - output:
~/oss/data/oss/working/pypi/02_mature_pkgs_names.csv
02_all_names_cleaning.RUsing the names scraped from pypi.org in the Production/ Stable and Mature categories from the01_names.Rscript, this script cleans the data to result in a final clean list of these package names.- input:
~/oss/data/oss/working/pypi/02_prod_stable_pkgs_names.csv - intput:
~/oss/data/oss/working/pypi/02_mature_pkgs_names.csv - output:
~/oss/data/oss/working/pypi/03_prod_mature_names.csv
- input:
03_librariesio_licenses.RUsing the final list of package names produced from02_all_names_cleaning.Rthis script scrapes libraries.io to collect licenses for the packages.- input:
~/oss/data/oss/working/pypi/03_prod_mature_names.csv - output:
~/oss/data/oss/working/pypi/04_prod_mature_w_licenses.csv
- input:
04_licenses_cleaning_sort.RUsing the licenses collected in03_librariesio_licenses.Rthis script uses an already produced data set of all OSI-approved licenses to create a column for each package that indicates if the license for that package is an OSI-approved license. This script also includes some exploratory analysis code.- input:
~/oss/data/oss/working/pypi/04_prod_mature_w_licenses.csv - input:
~/oss/data/oss/final/PyPI/osi_approved_licenses.csvfrom: unknown - output:
~/oss/data/oss/working/pypi/05_prod_mature_names_w_osi_approved_status.csv
- input:
05_repository_scraping.RUsing the list of just the packages with OSI-approved licenses that was created from04_licenses_cleaning_sort.R, this script collects repository URLs for these packages using libraries.io.- input:
~/oss/data/oss/working/pypi/05_prod_mature_names_w_osi_approved_status.csv - output:
~/oss/data/oss/working/pypi/06_osi_approved_w_repos.csv - output:
~/oss/data/oss/working/pypi/07_names_prod_mature_osi_approved.csv
- input:
06_github_api.RUsing the list of packages with OSI-approved licenses and repository URLs collected from05_repository_scraping.R, this script uses the Github API to collect repositories for packages where libraries.io had no repository listed and/or to ensure that if the package has a Github.com repository, the data table has the correct url for the repository on Github to be used later. This script also utilizes the Github API to collect information for packages with valid Github repositories about their activity including start date, end date, additions, and deletions for the top contributors to the project.- input:
~/oss/data/oss/working/pypi/06_osi_approved_w_repos.csv - output:
~/oss/data/oss/working/pypi/10_github_api_info.csv
- input:
07_dependencies_cleaning.RUsing the dependencies collected for all Production/Stable and Mature OSI-approved packages in “dep_script.py”, this script parses out just the dependencies for each package.- input:
~/oss/data/oss/working/pypi/07_names_prod_mature_osi_approved.csv - input:
~/oss/data/oss/working/pypi/01_dependencies_files/ - output:
~/oss/data/oss/final/PyPI/python_pkg_dependencies.csv - output: db:
oss/python_pkg_dependencies
- input:
08_github_contd.RUsing the list of packages with OSI-approved licenses and repository URLs collected from “05_repository_scraping.R”, this script uses the Github API to collect numbers of stars for each package with a valid Github repository.- input:
~/oss/data/oss/working/pypi/06_osi_approved_w_repos.csv - output:
~/oss/data/oss/working/pypi/09_github_api_info_w_stars.csv
- input:
09_additional_info.RUsing the data produced in “08_github_contd.R”, this script collects the latest release date as well as latest version number for these packages from pypi.org.- input:
~/oss/data/oss/working/pypi/09_github_api_info_w_stars.csv - output:
~/oss/data/oss/working/pypi/10_github_and_additional_info.csv
- input:
10_num_contributors_loc.RUsing the data produced in “09_additonal_info.R”, this script sums the number of contributors and lines of code from the contribution information previously collected.- input:
~/oss/data/oss/working/pypi/10_github_api_info.csvfrom:06_github_api.R - input:
~/oss/data/oss/working/pypi/10_github_and_additional_info.csv - output:
~/oss/data/oss/final/PyPI/complete_osi_info.csv
- input:
first_try_pypi_scraping.RInitial attempt to scrape pypi.first_try_scraping_func.RAnother first attempt to scrape pypi.
Scripts found under 01-data_collection/scrape/CDN/01-scrape_w_API.R
01-scrape_w_API.R: scarping cdnjs with CDN API. This phase obtains project names, repo url, and least realiable package info- output:
data/oss/working/CDN/raw_response.csv
- output:
02-scrape_w_gitHub_api.R: obtaion info with github api. save author information, license, and version and dependencies as csv- input:
data/oss/final/CDN/general_info.csv - input:
./data/oss/original/CDN/CDN_json/* - output: db:
CDN_authors_info - output: db:
CDN_licenses_info - output: db:
CDN_dependencies_info - output: db:
cdn_keywords_info
- input:
03-parsing_CDN.R: parsing the one big data set into several subsets- input:
data/oss/working/CDN/raw_response.csv - output:
data/oss/final/CDN/general_info.csv - output:
data/oss/final/CDN/keword_info.csv - output:
data/oss/final/CDN/author_info.csv
- input:
04-revise_general_info.R: parse the general info df again to improve data quality- input:
data/oss/final/CDN/general_info.csv - output:
data/oss/working/CDN/pkg_langs.csv - output:
data/oss/working/CDN/pkg_langs_finalized.csv - output:
data/oss/final/CDN/general_info.csv
- input:
05-process_author.R: parse author info in the first round- input:
data/oss/final/CDN/general_info.csv - input:
./data/oss/original/CDN/CDN_json/* - output: db:
CDN_authors_info
- input:
06-process_license.R: parse license info in the first round- input:
./data/oss/original/CDN/CDN_json/* - output: db:
CDN_licenses_info
- input:
07-process_dependencies.R: parse dependencies for each package- input:
./data/oss/original/CDN_json/* - output: db:
cdn_dependencies_info
- input:
08-revise_author.R: parse author info in the second round- input: db:
CDN_authors_info - input:
data/oss/final/CDN/general_info.csv - output: db:
cdn_authors_info
- input: db:
09-revise_license.R: parse license info in the second round- input: db:
licenses - input: db:
CDN_licenses_info - input:
data/oss/final/CDN/general_info.csv - output:
data/oss/final/CDN/license_per_manual.csv - output: db:
cdn_license_info
- input: db:
The Data is collected from the API from the following script (you only want to run this once)
Rscript ./src/01-data_collection/scrape/code_gov/use_api/01-get_repos.R- Tag which repositories are OSI approved
- Domain of the repositoryURL
Rscript ./src/02-data_processing/code_gov/01-add_columns.RLooks at the repositoryURL, domains, and licesnses for code.gov.
Rscript -e "rmarkdown::render(here::here('./src/exploratory/code_gov/repository_domains.Rmd'), output_dir = here::here('./output/code_gov'))"
Rscript -e "bad_html <- './src/exploratory/code_gov/repository_domains.html'; if (file.exists(here::here(bad_html))) file.remove(here::here(bad_html))"