apache spark using Pyspark ETL help

$30-50 USD

已取消

已发布

大约 4 年前

$30-50 USD

货到付款

Basically I have an ETL with 2 updates and I want to write the same updates in Pyspark table_a: +---+-----------+-------+--------------+ |key|col_a | col_b | current_flag | +---+-----------+-------+--------------+ |001| Value1 | T123 | Y | |002| oth_val1 | T123 | N | |003| oth_val2 | T123 | N | |004| oth_val3 | T123 | N | |005| Value2 | T123 | Y | |006| oth_val4 | T789 | N | |007| Value2 | T789 | Y | |008| Value1 | T789 | N | +---+-----------+-------+--------------+ UPDATE table_abc SET col_a = 'Value1' WHERE col_b IN ( SELECT col_b FROM table_abc WHERE col_a = 'Value1' and current_flag = 'Y' ) AND current_flag = 'N' COMMIT; +---+-----------+-------+--------------+ |key|col_a | col_b | current_flag | +---+-----------+-------+--------------+ |001| Value1 | T123 | Y | |002| Value1 | T123 | N | -- updated |003| Value1 | T123 | N | -- updated |004| Value1 | T123 | N | -- updated |005| Value2 | T123 | Y | |006| oth_val4 | T789 | N | |007| Value2 | T789 | Y | |008| Value1 | T789 | N | +---+-----------+-------+--------------+ UPDATE table_abc SET col_a = 'Value2' WHERE col_b IN ( SELECT col_b FROM table_abc WHERE col_a = 'Value2' and current_flag = 'Y' ) AND current_flag = 'N' COMMIT +---+-----------+-------+--------------+ |key|col_a | col_b | current_flag | +---+-----------+-------+--------------+ |001| Value1 | T123 | Y | |002| Value1 | T123 | N | |003| Value1 | T123 | N | |004| Value1 | T123 | N | |005| Value2 | T123 | Y | |006| Value2 | T789 | N | -- updated |007| Value2 | T789 | Y | |008| Value2 | T789 | N | -- updated +---+-----------+-------+--------------+ --------------------------------------------------------- #pyspark code to reproduce the updates #initial dataframe is "table_a" tval1 = [login to view URL]( col("col_a") == lit("Value1") & col("current_flag") == lit("Y") ) t= [login to view URL]("t1").join( [login to view URL]("tval1"), col("t1.col_b") == col("tval1.col_b"), "left-outer" ).select( col("[login to view URL]"), when( col("tval1.col_b").isNotNull(), lit("Value1") ).otherwise(col("t1.col_a")).alias("col_a"), col("t1.col_b"), col("t1.current_flag") ) #use data frame t from above tval2 = [login to view URL]( col("col_a") == lit("Value2") & col("current_flag") == lit("Y") ) t_new = [login to view URL]("t1").join( [login to view URL]("tval2"), col("t1.col_b") == col("tval2.col_b"), "left-outer" ).select( col("[login to view URL]"), when( col("tval2.col_b").isNotNull(), lit("Value2") ).otherwise(col("t1.col_a")).alias("col_a"), col("t1.col_b"), col("t1.current_flag") ) but what really happens in Pyspark is this: t_new: +---+-----------+-------+--------------+ |key|col_a | col_b | current_flag | +---+-----------+-------+--------------+ |001| Value1 | T123 | Y | |002| Value2 | T123 | N | |003| Value2 | T123 | N | |004| Value2 | T123 | N | |005| Value2 | T123 | Y | |006| Value2 | T789 | N | |007| Value2 | T789 | Y | |008| Value2 | T789 | N | +---+-----------+-------+--------------+

Python

Spark

Linux

项目 ID: 25337503

关于此项目

23提案

远程项目

活跃4 年前

想赚点钱吗？

电子邮箱地址

在Freelancer上竞价的好处

设定您的预算和时间范围

为您的工作获得报酬

简要概述您的提案

免费注册和竞标工作

23威客以平均价$82 USD来参与此工作竞价

@ankitbansal1996

Hi, I have more than a year of experience of working with pyspark ETL jobs. I have written big data ETL jobs with complex operations as well. Ping me to discuss about it.

$50 USD 在1天之内

5.0

(30条评论)

5.1

@letsstartcoding

hello, i just need 2 to 3 hours max to get this job done, waiting for your reply as i am ready to start work from now

$55 USD 在1天之内

4.8

(17条评论)

5.0

@dineshrajputit

Hi, I have 8 years of experience and working on hadoop, spark, nosql, java, BI tools(tableau, powerbi), cloud(Amazon, Google, Microsoft Azure)... Done end to end data warehouse management projects on aws cloud with hadoop, hive, spark and presodb. Worked on multiple etl project like springboot, angular, node, PHP, Kafka, nifi, flume, mapreduce, spark with XML/JSON., Cassandra, mongodb, hbase, redis, oracle, sap hana, ASE.... Many more. Let's discuss the required things in detail. I am committed to work done and strong in issue resolving as well. Thanks

$56 USD 在1天之内

5.0

(6条评论)

4.2

@irfaanmeah

Hi, Project - I have used Pyspark for data cleaning and updates in the previous projects. I would need some sampel data to help you the issue. I am a Data Scientist with 9+ years of experience with expertise in Machine learning using tools like R, Python, SQL and Excel. I am new to freelancing and I would want to make sure my clients get the best work from me and they choose me again in the future. I keep up deadlines and make sure they are well tracked and communicated. Let me know if you have time to discuss the project so you know I am the PERSON for the job. Thanks, Md Irfaan Meah

$50 USD 在1天之内

4.9

(3条评论)

3.4

@nmogilip

Hi, I am a certified bigdata developer and used pyspark extensively. Please let’s connect and discuss more on your requirements.

$111 USD 在5天之内

5.0

(4条评论)

3.2

@alexzvetkov1

hello there you? i am python expert. i am live in python and dijango frameworks because it's my major skill. i can complete your project in a short time. Happy day :)

$100 USD 在1天之内

5.0

(5条评论)

3.0

@kovacspjotr

Hey, Let me know if you agree with the price and I can resolve it ASAP. I have a lot of experience with Spark :) I will provide unit-tests on top of the code for free.

$170 USD 在1天之内

5.0

(1条评论)

2.8

@rnaushad

Hi there , I have about 16 years of experience in java , python and big data and associated frameworks like spring , hadoop, mapreduce , Spark etc . I have reviewed your problem and it looks Like a quick fix. Please feel free to review the feedback I have reviewed on other projects on freelancer . Kindly do consider my proposal. Regards, Rabiya

$56 USD 在1天之内

5.0

(5条评论)

3.0

@tanushsoftware87

hello, It's late to bid on that project. but if still it's open then I am interested. let me know if you consider my proposal. thanks.

$356 USD 在2天之内

4.1

(5条评论)

1.8

@singhrahul2016

Hi, I am working in MNC as Data Engineer and currently working on Big Data Fields using PySpark and Hadoop Frameworks. Having more than 4 years of experience in Big Data Field in production, have worked for freelance work as a Pyspark and hadoop Developer. Requesting you to please share the details so we can start . I am a certified Pysaprk developer. Thanks Rahul.

$40 USD 在1天之内

5.0

(2条评论)

1.2

@bestflancer

Hi Row 2, 3 and 4 are wrongly updated using Pyspark code. where is your solution hosted on the cloud? I can help you to fix this issue and will require access to the cloud. Looking forward to your reply.

$50 USD 在2天之内

5.0

(3条评论)

1.1

@dominikstrm

Hello, I'm a python expert with experience spanning 6+ years. I'd kindly like to know the details of the project. Thank you for cooperation.

$299 USD 在1天之内

0.0

(0条评论)

0.0

@dachakg

Hi, I've been working as a data engineer for almost two years. I am currently working in the Scala and Spark programming languages but I can work in pySpark as well it is pretty similar. I've seen your issue and understood it, and there are a couple of ways for solving this. P.S I've already found one way to solve the first issue. The second issue is pretty much the same, just with other parameters. Kind regards, Danilo

$50 USD 在1天之内

0.0

(0条评论)

0.0

@vinodkb24

Hi i am having an experience of more than 4 years in Pyspark ETL , which makes me to complete the work more efficiently.

$30 USD 在7天之内

0.0

(0条评论)

0.0

@kwongwuisim

Hi, I am experienced in Python and Sql. Do let me know if you still need help for this task. I could do this within 1 hour. Thanks.

$50 USD 在1天之内

0.0

(0条评论)

0.0

@sugamchawla

I am an expert in pyspark .working on big data making etl jobs with pyspark.I can do this task easily !

$35 USD 在1天之内

0.0

(0条评论)

0.0

@AbShivaPrasad

i am good with the following: Pyspark and spark streaming .worked on large datasets and larger tables

$30 USD 在7天之内

0.0

(0条评论)

0.0

@Adrija04

I am a software engineer working in Big Data technologies like pyspark for the last 1 year and hence I can achieve the results pretty well by using sql equivalents there like the used queries as it is. Connect to discuss further.

$40 USD 在1天之内

0.0

(0条评论)

0.0

@PerfectInfo

Hi, I've 12 years experience in Spark with python and scala. I've done similar work in past and I am confident to complete this work in given time. It is just one hour job for me. Please hire me, You will not be disappointed and will re-hire me for sure.

$40 USD 在1天之内

0.0

(0条评论)

0.0

@himanshu192

Hi I am Databricks and Azure certified professional Data Engineer with expertise on - Big data architecture Azure cloud Architecture Spark/Scala/ETL Hadoop MySQL,MongoDB Completed around 4 projects in end to end development and data pipeline implementation

$50 USD 在1天之内