Technical Article

Join Csv files using SQL

,

Csv files must frequently be joined. It would be fantastic if we could connect CSV files using the power of SQL. This script accomplishes that. It executes SQL that is provided as an input parameter. In this example, I've made two csv files. This script may be expanded to handle multiple csv files, multiple column joins, group by, order by, and limit operations. Without this script, one may have to utilize vlookup, which might be expensive, time-consuming, and challenging to implement.

employees.csv

emp_id,emp_name,dept_id
1,Paul,1
2,John,1
3,Jane,1
4,Barb,2
5,Cody,2
6,Bill,2

 

departments.csv

dept_id,dept_name
1,Sales
2,Marketing

I wrote a python script called ExecuteSqlForCsv.py and I can join the employees.csv and departments.csv files

python3 ExecuteSqlForCsv.py "SELECT emp_name, dept_name FROM employees.csv, departments.csv WHERE employees.csv.dept_id = departments.csv.dept_id"
The below output gets produced:

 

 emp_name dept_name
0 Paul Sales
1 John Sales
2 Jane Sales
3 Barb Marketing
4 Cody Marketing
5 Bill Marketing
#=====================================================================
# ExecuteSqlForCsv.py:
# Usage: python3 ExecuteSqlForCsv.py "SELECT emp_name, dept_name FROM employees.csv, departments.csv  WHERE employees.csv.dept_id = departments.csv.dept_id"
#=====================================================================
import pandas as pd
import sys

def main():
  strSql = "".join(sys.argv[1:])
  strSelectClause = strSql.split("FROM")[0].replace("SELECT","").replace(" ","")
  lstColumns = strSelectClause.split(",")
  strFromClause = strSql.split("FROM")[1]
  strFromClause = strFromClause.split("WHERE")[0]
  strCsv1 = strFromClause.split(",")[0].strip()
  strCsv2 = strFromClause.split(",")[1].strip()
  strWhereClause = strSql.split("WHERE")[1]
  strLeftOn = strWhereClause.split("=")[0].replace(strCsv1,"").replace(".","").strip()
  strRightOn = strWhereClause.split("=")[1].replace(strCsv2,"").replace(".","").strip()
  dfCsv1 = pd.read_csv(strCsv1)
  dfCsv2 = pd.read_csv(strCsv2)
  dfOutput = pd.merge(dfCsv1, dfCsv2,  how="left", left_on=[strLeftOn], right_on = [strRightOn])
  dfOutput = dfOutput[lstColumns]
  print(dfOutput)
# End main()

if __name__ == "__main__":
  main()

Rate

5 (1)

You rated this post out of 5. Change rating

Share

Share

Rate

5 (1)

You rated this post out of 5. Change rating